RE: Question liste solr

2018-03-20 Thread Rahul Singh
Parallel processing in any way will help, including Spark w/ a DFS like S3 or 
HDFS. Your three machines could end up being a bottleneck and you may need more 
nodes.

On Mar 20, 2018, 2:36 AM -0500, LOPEZ-CORTES Mariano-ext 
<mariano.lopez-cortes-...@pole-emploi.fr>, wrote:
> CSV file is 5GB aprox. for 29 millions.
>
> As you say Christopher, at the beggining we thougth that reading chunk by 
> chunk from Oracle and writing to Solr
> was the best strategy.
>
> But, from our tests we've remarked:
>
> CSV creation via PL/SQL is really really fast. 40 minutes for the full 
> dataset (with bulk collect).
> Multiple SELECT calls from java slows down the process. I think Oracle is the 
> bottleneck here.
>
> Any other ideas/alternatives?
>
> Some other points to remark:
>
> We are going to enable autoCommit for every 10 minutes / 1 rows. No 
> commit from client.
> During indexing, whe call all the time a front-end load-balancer that 
> redirect calls to the 3-node cluster.
>
> Thanks in advance!!
>
> ==>Great maillist and really awesome tool!!
>
> -Message d'origine-
> De : Christopher Schultz [mailto:ch...@christopherschultz.net]
> Envoyé : lundi 19 mars 2018 18:05
> À : solr-user@lucene.apache.org
> Objet : Re: Question liste solr
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Mariano,
>
> On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote:
> > Hello
> >
> > We have an index Solr with 3 nodes, 1 shard et 2 replicas.
> >
> > Our goal is to index 42 millions rows. Indexing time is important.
> > The data source is an oracle database.
> >
> > Our indexing strategy is :
> >
> > * Reading from Oracle to a big CSV file.
> >
> > * Reading from 4 files (big file chunked) and injection via
> > ConcurrentUpdateSolrClient
> >
> > Is it the optimal way of injecting such mass of data into Solr ?
> >
> > For information, estimated time for our solution is 6h.
>
> How big are the CSV files? If most of the time is taken performing the 
> various SELECT operations, then it's probably a good strategy.
>
> However, you may find that using the disk as a buffer slows everything down 
> because disk-writes can be very slow.
>
> Why not perform your SELECT(s) and write directly to Solr using one of the 
> APIs (either a language-specific API, or through the HTTP API)?
>
> Hope that helps,
> - -chris
> -BEGIN PGP SIGNATURE-
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo
> cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE
> s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH
> I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3
> 6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+
> r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5
> BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6
> ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX
> ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey
> 85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg
> GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy
> tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD
> VH6PlwgqcrO28Jx799mJvpIotoE=
> =aMPk
> -END PGP SIGNATURE-


RE: Question liste solr

2018-03-20 Thread LOPEZ-CORTES Mariano-ext
CSV file is 5GB aprox. for 29 millions. 

As you say Christopher, at the beggining we thougth that reading chunk by chunk 
from Oracle and writing to Solr
was the best strategy. 

But, from our tests we've remarked:

CSV creation via PL/SQL is really really fast. 40 minutes for the full dataset 
(with bulk collect).
Multiple SELECT calls from java slows down the process. I think Oracle is the 
bottleneck here.

Any other ideas/alternatives?

Some other points to remark:

We are going to enable autoCommit for every 10 minutes / 1 rows. No commit 
from client.
During indexing,  whe call all the time a front-end load-balancer that redirect 
calls to the 3-node cluster.

Thanks in advance!!

==>Great maillist and really awesome tool!! 

-Message d'origine-
De : Christopher Schultz [mailto:ch...@christopherschultz.net] 
Envoyé : lundi 19 mars 2018 18:05
À : solr-user@lucene.apache.org
Objet : Re: Question liste solr

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Mariano,

On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote:
> Hello
> 
> We have an index Solr with 3 nodes, 1 shard et 2 replicas.
> 
> Our goal is to index 42 millions rows. Indexing time is important.
> The data source is an oracle database.
> 
> Our indexing strategy is :
> 
> * Reading from Oracle to a big CSV file.
> 
> * Reading from 4 files (big file chunked) and injection via
> ConcurrentUpdateSolrClient
> 
> Is it the optimal way of injecting such mass of data into Solr ?
> 
> For information, estimated time for our solution is 6h.

How big are the CSV files? If most of the time is taken performing the various 
SELECT operations, then it's probably a good strategy.

However, you may find that using the disk as a buffer slows everything down 
because disk-writes can be very slow.

Why not perform your SELECT(s) and write directly to Solr using one of the APIs 
(either a language-specific API, or through the HTTP API)?

Hope that helps,
- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo
cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE
s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH
I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3
6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+
r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5
BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6
ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX
ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey
85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg
GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy
tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD
VH6PlwgqcrO28Jx799mJvpIotoE=
=aMPk
-END PGP SIGNATURE-


Re: Question liste solr

2018-03-19 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Mariano,

On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote:
> Hello
> 
> We have an index Solr with 3 nodes, 1 shard et 2 replicas.
> 
> Our goal is to index 42 millions rows. Indexing time is important.
> The data source is an oracle database.
> 
> Our indexing strategy is :
> 
> * Reading from Oracle to a big CSV file.
> 
> * Reading from 4 files (big file chunked) and injection via
> ConcurrentUpdateSolrClient
> 
> Is it the optimal way of injecting such mass of data into Solr ?
> 
> For information, estimated time for our solution is 6h.

How big are the CSV files? If most of the time is taken performing the
various SELECT operations, then it's probably a good strategy.

However, you may find that using the disk as a buffer slows everything
down because disk-writes can be very slow.

Why not perform your SELECT(s) and write directly to Solr using one of
the APIs (either a language-specific API, or through the HTTP API)?

Hope that helps,
- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo
cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE
s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH
I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3
6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+
r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5
BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6
ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX
ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey
85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg
GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy
tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD
VH6PlwgqcrO28Jx799mJvpIotoE=
=aMPk
-END PGP SIGNATURE-


RE: Question liste solr

2018-03-19 Thread LOPEZ-CORTES Mariano-ext
Sorry. Thanks in advance !!

De : LOPEZ-CORTES Mariano-ext
Envoyé : lundi 19 mars 2018 16:50
À : 'solr-user@lucene.apache.org'
Objet : RE: Question liste solr

Hello

We have an index Solr with 3 nodes, 1 shard et 2 replicas.

Our goal is to index 42 millions rows. Indexing time is important. The data 
source is an oracle database.

Our indexing strategy is :

· Reading from Oracle to a big CSV file.

· Reading from 4 files (big file chunked) and injection via 
ConcurrentUpdateSolrClient

Is it the optimal way of injecting such mass of data into Solr ?

For information, estimated time for our solution is 6h.


RE: Question liste solr

2018-03-19 Thread LOPEZ-CORTES Mariano-ext
Hello

We have an index Solr with 3 nodes, 1 shard et 2 replicas.

Our goal is to index 42 millions rows. Indexing time is important. The data 
source is an oracle database.

Our indexing strategy is :

* Reading from Oracle to a big CSV file.

* Reading from 4 files (big file chunked) and injection via 
ConcurrentUpdateSolrClient

Is it the optimal way of injecting such mass of data into Solr ?

For information, estimated time for our solution is 6h.