RE: Question liste solr
Parallel processing in any way will help, including Spark w/ a DFS like S3 or HDFS. Your three machines could end up being a bottleneck and you may need more nodes. On Mar 20, 2018, 2:36 AM -0500, LOPEZ-CORTES Mariano-ext <mariano.lopez-cortes-...@pole-emploi.fr>, wrote: > CSV file is 5GB aprox. for 29 millions. > > As you say Christopher, at the beggining we thougth that reading chunk by > chunk from Oracle and writing to Solr > was the best strategy. > > But, from our tests we've remarked: > > CSV creation via PL/SQL is really really fast. 40 minutes for the full > dataset (with bulk collect). > Multiple SELECT calls from java slows down the process. I think Oracle is the > bottleneck here. > > Any other ideas/alternatives? > > Some other points to remark: > > We are going to enable autoCommit for every 10 minutes / 1 rows. No > commit from client. > During indexing, whe call all the time a front-end load-balancer that > redirect calls to the 3-node cluster. > > Thanks in advance!! > > ==>Great maillist and really awesome tool!! > > -Message d'origine- > De : Christopher Schultz [mailto:ch...@christopherschultz.net] > Envoyé : lundi 19 mars 2018 18:05 > À : solr-user@lucene.apache.org > Objet : Re: Question liste solr > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Mariano, > > On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote: > > Hello > > > > We have an index Solr with 3 nodes, 1 shard et 2 replicas. > > > > Our goal is to index 42 millions rows. Indexing time is important. > > The data source is an oracle database. > > > > Our indexing strategy is : > > > > * Reading from Oracle to a big CSV file. > > > > * Reading from 4 files (big file chunked) and injection via > > ConcurrentUpdateSolrClient > > > > Is it the optimal way of injecting such mass of data into Solr ? > > > > For information, estimated time for our solution is 6h. > > How big are the CSV files? If most of the time is taken performing the > various SELECT operations, then it's probably a good strategy. > > However, you may find that using the disk as a buffer slows everything down > because disk-writes can be very slow. > > Why not perform your SELECT(s) and write directly to Solr using one of the > APIs (either a language-specific API, or through the HTTP API)? > > Hope that helps, > - -chris > -BEGIN PGP SIGNATURE- > Comment: GPGTools - http://gpgtools.org > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo > cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE > s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH > I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3 > 6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+ > r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5 > BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6 > ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX > ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey > 85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg > GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy > tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD > VH6PlwgqcrO28Jx799mJvpIotoE= > =aMPk > -END PGP SIGNATURE-
RE: Question liste solr
CSV file is 5GB aprox. for 29 millions. As you say Christopher, at the beggining we thougth that reading chunk by chunk from Oracle and writing to Solr was the best strategy. But, from our tests we've remarked: CSV creation via PL/SQL is really really fast. 40 minutes for the full dataset (with bulk collect). Multiple SELECT calls from java slows down the process. I think Oracle is the bottleneck here. Any other ideas/alternatives? Some other points to remark: We are going to enable autoCommit for every 10 minutes / 1 rows. No commit from client. During indexing, whe call all the time a front-end load-balancer that redirect calls to the 3-node cluster. Thanks in advance!! ==>Great maillist and really awesome tool!! -Message d'origine- De : Christopher Schultz [mailto:ch...@christopherschultz.net] Envoyé : lundi 19 mars 2018 18:05 À : solr-user@lucene.apache.org Objet : Re: Question liste solr -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Mariano, On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote: > Hello > > We have an index Solr with 3 nodes, 1 shard et 2 replicas. > > Our goal is to index 42 millions rows. Indexing time is important. > The data source is an oracle database. > > Our indexing strategy is : > > * Reading from Oracle to a big CSV file. > > * Reading from 4 files (big file chunked) and injection via > ConcurrentUpdateSolrClient > > Is it the optimal way of injecting such mass of data into Solr ? > > For information, estimated time for our solution is 6h. How big are the CSV files? If most of the time is taken performing the various SELECT operations, then it's probably a good strategy. However, you may find that using the disk as a buffer slows everything down because disk-writes can be very slow. Why not perform your SELECT(s) and write directly to Solr using one of the APIs (either a language-specific API, or through the HTTP API)? Hope that helps, - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3 6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+ r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5 BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6 ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey 85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD VH6PlwgqcrO28Jx799mJvpIotoE= =aMPk -END PGP SIGNATURE-
Re: Question liste solr
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Mariano, On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote: > Hello > > We have an index Solr with 3 nodes, 1 shard et 2 replicas. > > Our goal is to index 42 millions rows. Indexing time is important. > The data source is an oracle database. > > Our indexing strategy is : > > * Reading from Oracle to a big CSV file. > > * Reading from 4 files (big file chunked) and injection via > ConcurrentUpdateSolrClient > > Is it the optimal way of injecting such mass of data into Solr ? > > For information, estimated time for our solution is 6h. How big are the CSV files? If most of the time is taken performing the various SELECT operations, then it's probably a good strategy. However, you may find that using the disk as a buffer slows everything down because disk-writes can be very slow. Why not perform your SELECT(s) and write directly to Solr using one of the APIs (either a language-specific API, or through the HTTP API)? Hope that helps, - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3 6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+ r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5 BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6 ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey 85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD VH6PlwgqcrO28Jx799mJvpIotoE= =aMPk -END PGP SIGNATURE-
RE: Question liste solr
Sorry. Thanks in advance !! De : LOPEZ-CORTES Mariano-ext Envoyé : lundi 19 mars 2018 16:50 À : 'solr-user@lucene.apache.org' Objet : RE: Question liste solr Hello We have an index Solr with 3 nodes, 1 shard et 2 replicas. Our goal is to index 42 millions rows. Indexing time is important. The data source is an oracle database. Our indexing strategy is : · Reading from Oracle to a big CSV file. · Reading from 4 files (big file chunked) and injection via ConcurrentUpdateSolrClient Is it the optimal way of injecting such mass of data into Solr ? For information, estimated time for our solution is 6h.
RE: Question liste solr
Hello We have an index Solr with 3 nodes, 1 shard et 2 replicas. Our goal is to index 42 millions rows. Indexing time is important. The data source is an oracle database. Our indexing strategy is : * Reading from Oracle to a big CSV file. * Reading from 4 files (big file chunked) and injection via ConcurrentUpdateSolrClient Is it the optimal way of injecting such mass of data into Solr ? For information, estimated time for our solution is 6h.