Re: Importing large datasets
What's the relation between items and item_descriptions table? I.e. is there only one item_descriptions record for every id? If 1-1 then you can merge all your data into single database and use the following query HTH, Alex On Thu, Jun 3, 2010 at 6:34 AM, Blargy wrote: > > > Erik Hatcher-4 wrote: >> >> One thing that might help indexing speed - create a *single* SQL query >> to grab all the data you need without using DIH's sub-entities, at >> least the non-cached ones. >> >> Erik >> >> On Jun 2, 2010, at 12:21 PM, Blargy wrote: >> >>> >>> >>> As a data point, I routinely see clients index 5M items on normal >>> hardware >>> in approx. 1 hour (give or take 30 minutes). >>> >>> Also wanted to add that our main entity (item) consists of 5 sub- >>> entities >>> (ie, joins). 2 of those 5 are fairly small so I am using >>> CachedSqlEntityProcessor for them but the other 3 (which includes >>> item_description) are normal. >>> >>> All the entites minus the item_description connect to datasource1. >>> They >>> currently point to one physical machine although we do have a pool >>> of 3 DB's >>> that could be used if it helps. The other entity, item_description >>> uses a >>> datasource2 which has a pool of 2 DB's that could potentially be >>> used. Not >>> sure if that would help or not. >>> >>> I might as well that the item description will have indexed, stored >>> and term >>> vectors set to true. >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> > > I can't find any example of creating a massive sql query. Any out there? > Will batching still work with this massive query? > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Importing large datasets
On Jun 2, 2010, at 10:30 PM, Blargy wrote: > Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its > so slow because I am using 2 different datasources? > By batch size, I meant the number of docs sent from the client to Solr. MySQL Batch Size is broken. The only thing that will work is -1 or not specifying it at all. If you don't specify it, it materializes all rows into memory. Does your data really need to be in two different databases? That is undoubtedly your bottleneck. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Importing large datasets
Frankly, if you can create a script that'll turn your data into valid CSV, that might be the easiest, quickest way to ingest your data. Pragmatic, at least. Avoids the complexity of DIH, allows you to script the export from your DB in the most efficient manner you can, and so on. Solr's CSV update handler is FAST! Erik On Jun 3, 2010, at 2:56 AM, David Stuart wrote: On 3 Jun 2010, at 03:51, Blargy wrote: Would dumping the databases to a local file help at all? I would suspect not especally with the size of your data. But it would be good to know how long that takes i.e. Creating a SQL script that just pulls that data out how long does that take? Also have many fields are you indexing per document 10,50,100? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-f tp863447p866538.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
On 3 Jun 2010, at 03:51, Blargy wrote: Would dumping the databases to a local file help at all? I would suspect not especally with the size of your data. But it would be good to know how long that takes i.e. Creating a SQL script that just pulls that data out how long does that take? Also have many fields are you indexing per document 10,50,100? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-f tp863447p866538.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
On 3 Jun 2010, at 02:51, Dennis Gearon wrote: Well, I hope to have around 5 million datasets/documents within 1 year, so this is good info. BUT if I DO have that many, then the market I am aiming at will end giving me 100 times more than than within 2 years. Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 million plus documents? As far as I'm aware there aren't any books yet that cover this for solr. The wiki, this mailing list, nabble are your best sources and there have been some quite indepth conversations on the matter in this list in the past The data is easily shardible geographially, as one given. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Grant Ingersoll wrote: From: Grant Ingersoll Subject: Re: Importing large datasets To: solr-user@lucene.apache.org Date: Wednesday, June 2, 2010, 3:42 AM On Jun 1, 2010, at 9:54 PM, Blargy wrote: We have around 5 million items in our index and each item has a description located on a separate physical database. These item descriptions vary in size and for the most part are quite large. Currently we are only indexing items and not their corresponding description and a full import takes around 4 hours. Ideally we want to index both our items and their descriptions but after some quick profiling I determined that a full import would take in excess of 24 hours. - How would I profile the indexing process to determine if the bottleneck is Solr or our Database. As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? - In either case, how would one speed up this process? Is there a way to run parallel import processes and then merge them together at the end? Possibly use some sort of distributed computing? DataImportHandler now supports multiple threads. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100). Often, from DBs one can split up the table via SQL statements that can then be fetched separately. You may want to write your own multithreaded client to index. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Importing large datasets
On 3 Jun 2010, at 02:58, Dennis Gearon wrote: When adding data continuously, that data is available after committing and is indexed, right? Yes If so, how often is reindexing do some good? You should only need to reindex if the data changes or you change your schema. The DIH in solr 1.4 supports delta imports so you should only really be adding of updating (which is actually deleting and adding) items when necessary. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Andrzej Bialecki wrote: From: Andrzej Bialecki Subject: Re: Importing large datasets To: solr-user@lucene.apache.org Date: Wednesday, June 2, 2010, 4:52 AM On 2010-06-02 13:12, Grant Ingersoll wrote: On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: On 2010-06-02 12:42, Grant Ingersoll wrote: On Jun 1, 2010, at 9:54 PM, Blargy wrote: We have around 5 million items in our index and each item has a description located on a separate physical database. These item descriptions vary in size and for the most part are quite large. Currently we are only indexing items and not their corresponding description and a full import takes around 4 hours. Ideally we want to index both our items and their descriptions but after some quick profiling I determined that a full import would take in excess of 24 hours. - How would I profile the indexing process to determine if the bottleneck is Solr or our Database. As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? - In either case, how would one speed up this process? Is there a way to run parallel import processes and then merge them together at the end? Possibly use some sort of distributed computing? DataImportHandler now supports multiple threads. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100). Often, from DBs one can split up the table via SQL statements that can then be fetched separately. You may want to write your own multithreaded client to index. SOLR-1301 is also an option if you are familiar with Hadoop ... If the bottleneck is the DB, will that do much? Nope. But the workflow could be set up so that during night hours a DB export takes place that results in a CSV or SolrXML file (there you could measure the time it takes to do this export), and then indexing can work from this file. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Importing large datasets
Would dumping the databases to a local file help at all? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866538.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
Erik Hatcher-4 wrote: > > One thing that might help indexing speed - create a *single* SQL query > to grab all the data you need without using DIH's sub-entities, at > least the non-cached ones. > > Erik > > On Jun 2, 2010, at 12:21 PM, Blargy wrote: > >> >> >> As a data point, I routinely see clients index 5M items on normal >> hardware >> in approx. 1 hour (give or take 30 minutes). >> >> Also wanted to add that our main entity (item) consists of 5 sub- >> entities >> (ie, joins). 2 of those 5 are fairly small so I am using >> CachedSqlEntityProcessor for them but the other 3 (which includes >> item_description) are normal. >> >> All the entites minus the item_description connect to datasource1. >> They >> currently point to one physical machine although we do have a pool >> of 3 DB's >> that could be used if it helps. The other entity, item_description >> uses a >> datasource2 which has a pool of 2 DB's that could potentially be >> used. Not >> sure if that would help or not. >> >> I might as well that the item description will have indexed, stored >> and term >> vectors set to true. >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > I can't find any example of creating a massive sql query. Any out there? Will batching still work with this massive query? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
Lance Norskog-2 wrote: > > Wait! You're fetching records from one database and then doing lookups > against another DB? That makes this a completely different problem. > > The DIH does not to my knowledge have the ability to "pool" these > queries. That is, it will not build a batch of 1000 keys from > datasource1 and then do a query against datasource2 with: > select foo where key_field IN (key1, key2,... key1000); > > This is the efficient way to do what you want. You'll have to write > your own client to do this. > > On Wed, Jun 2, 2010 at 12:00 PM, David Stuart > wrote: >> How long does it take to do a grab of all the data via SQL? I found by >> denormalizing the data into a lookup table meant that I was able to index >> about 300k rows of similar data size with dih regex spilting on some >> fields >> in about 8mins I know it's not quite the scale bit with batching... >> >> David Stuar >> >> On 2 Jun 2010, at 17:58, Blargy wrote: >> >>> >>> >>> One thing that might help indexing speed - create a *single* SQL query to grab all the data you need without using DIH's sub-entities, at least the non-cached ones. >>> >>> Not sure how much that would help. As I mentioned that without the item >>> description import the full process takes 4 hours which is bearable. >>> However >>> once I started to import the item description which is located on a >>> separate >>> machine/database the import process exploded to over 24 hours. >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >> > > > > -- > Lance Norskog > goks...@gmail.com > Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its so slow because I am using 2 different datasources? Say I am using just one datasource should I still be seing "Creating a connection for entity " for each sub entity in the document or should it just be using one connection? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866499.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
That's promising!!! That's how I have been desigining my project. It must be all the joins that are causing the problems for him? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, David Stuart wrote: > From: David Stuart > Subject: Re: Importing large datasets > To: "solr-user@lucene.apache.org" > Date: Wednesday, June 2, 2010, 12:00 PM > How long does it take to do a grab of > all the data via SQL? I found by denormalizing the data into > a lookup table meant that I was able to index about 300k > rows of similar data size with dih regex spilting on some > fields in about 8mins I know it's not quite the scale bit > with batching... > > David Stuar > > On 2 Jun 2010, at 17:58, Blargy > wrote: > > > > > > > > >> One thing that might help indexing speed - create > a *single* SQL query > >> to grab all the data you need without using DIH's > sub-entities, at > >> least the non-cached ones. > >> > > > > Not sure how much that would help. As I mentioned that > without the item > > description import the full process takes 4 hours > which is bearable. However > > once I started to import the item description which is > located on a separate > > machine/database the import process exploded to over > 24 hours. > > > > --View this message in context: > > http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html > > Sent from the Solr - User mailing list archive at > Nabble.com. >
Re: Importing large datasets
When adding data continuously, that data is available after committing and is indexed, right? If so, how often is reindexing do some good? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Andrzej Bialecki wrote: > From: Andrzej Bialecki > Subject: Re: Importing large datasets > To: solr-user@lucene.apache.org > Date: Wednesday, June 2, 2010, 4:52 AM > On 2010-06-02 13:12, Grant Ingersoll > wrote: > > > > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > > > >> On 2010-06-02 12:42, Grant Ingersoll wrote: > >>> > >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote: > >>> > >>>> > >>>> We have around 5 million items in our > index and each item has a description > >>>> located on a separate physical database. > These item descriptions vary in > >>>> size and for the most part are quite > large. Currently we are only indexing > >>>> items and not their corresponding > description and a full import takes around > >>>> 4 hours. Ideally we want to index both our > items and their descriptions but > >>>> after some quick profiling I determined > that a full import would take in > >>>> excess of 24 hours. > >>>> > >>>> - How would I profile the indexing process > to determine if the bottleneck is > >>>> Solr or our Database. > >>> > >>> As a data point, I routinely see clients index > 5M items on normal > >>> hardware in approx. 1 hour (give or take 30 > minutes). > >>> > >>> When you say "quite large", what do you > mean? Are we talking books here or maybe a couple > pages of text or just a couple KB of data? > >>> > >>> How long does it take you to get that data out > (and, from the sounds of it, merge it with your item) w/o > going to Solr? > >>> > >>>> - In either case, how would one speed up > this process? Is there a way to run > >>>> parallel import processes and then merge > them together at the end? Possibly > >>>> use some sort of distributed computing? > >>> > >>> DataImportHandler now supports multiple > threads. The absolute fastest way that I know of to > index is via multiple threads sending batches of documents > at a time (at least 100). Often, from DBs one can > split up the table via SQL statements that can then be > fetched separately. You may want to write your own > multithreaded client to index. > >> > >> SOLR-1301 is also an option if you are familiar > with Hadoop ... > >> > > > > If the bottleneck is the DB, will that do much? > > > > Nope. But the workflow could be set up so that during night > hours a DB > export takes place that results in a CSV or SolrXML file > (there you > could measure the time it takes to do this export), and > then indexing > can work from this file. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ > _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic > Web > ___|||__|| \| || | Embedded Unix, > System Integration > http://www.sigram.com Contact: info at sigram dot > com > >
Re: Importing large datasets
Well, I hope to have around 5 million datasets/documents within 1 year, so this is good info. BUT if I DO have that many, then the market I am aiming at will end giving me 100 times more than than within 2 years. Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 million plus documents? The data is easily shardible geographially, as one given. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Grant Ingersoll wrote: > From: Grant Ingersoll > Subject: Re: Importing large datasets > To: solr-user@lucene.apache.org > Date: Wednesday, June 2, 2010, 3:42 AM > > On Jun 1, 2010, at 9:54 PM, Blargy wrote: > > > > > We have around 5 million items in our index and each > item has a description > > located on a separate physical database. These item > descriptions vary in > > size and for the most part are quite large. Currently > we are only indexing > > items and not their corresponding description and a > full import takes around > > 4 hours. Ideally we want to index both our items and > their descriptions but > > after some quick profiling I determined that a full > import would take in > > excess of 24 hours. > > > > - How would I profile the indexing process to > determine if the bottleneck is > > Solr or our Database. > > As a data point, I routinely see clients index 5M items on > normal > hardware in approx. 1 hour (give or take 30 minutes). > > > When you say "quite large", what do you mean? Are we > talking books here or maybe a couple pages of text or just a > couple KB of data? > > How long does it take you to get that data out (and, from > the sounds of it, merge it with your item) w/o going to > Solr? > > > - In either case, how would one speed up this process? > Is there a way to run > > parallel import processes and then merge them together > at the end? Possibly > > use some sort of distributed computing? > > DataImportHandler now supports multiple threads. The > absolute fastest way that I know of to index is via multiple > threads sending batches of documents at a time (at least > 100). Often, from DBs one can split up the table via > SQL statements that can then be fetched separately. > You may want to write your own multithreaded client to > index. > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >
Re: Importing large datasets
Wait! You're fetching records from one database and then doing lookups against another DB? That makes this a completely different problem. The DIH does not to my knowledge have the ability to "pool" these queries. That is, it will not build a batch of 1000 keys from datasource1 and then do a query against datasource2 with: select foo where key_field IN (key1, key2,... key1000); This is the efficient way to do what you want. You'll have to write your own client to do this. On Wed, Jun 2, 2010 at 12:00 PM, David Stuart wrote: > How long does it take to do a grab of all the data via SQL? I found by > denormalizing the data into a lookup table meant that I was able to index > about 300k rows of similar data size with dih regex spilting on some fields > in about 8mins I know it's not quite the scale bit with batching... > > David Stuar > > On 2 Jun 2010, at 17:58, Blargy wrote: > >> >> >> >>> One thing that might help indexing speed - create a *single* SQL query >>> to grab all the data you need without using DIH's sub-entities, at >>> least the non-cached ones. >>> >> >> Not sure how much that would help. As I mentioned that without the item >> description import the full process takes 4 hours which is bearable. >> However >> once I started to import the item description which is located on a >> separate >> machine/database the import process exploded to over 24 hours. >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html >> Sent from the Solr - User mailing list archive at Nabble.com. > -- Lance Norskog goks...@gmail.com
Re: Importing large datasets
How long does it take to do a grab of all the data via SQL? I found by denormalizing the data into a lookup table meant that I was able to index about 300k rows of similar data size with dih regex spilting on some fields in about 8mins I know it's not quite the scale bit with batching... David Stuar On 2 Jun 2010, at 17:58, Blargy wrote: One thing that might help indexing speed - create a *single* SQL query to grab all the data you need without using DIH's sub-entities, at least the non-cached ones. Not sure how much that would help. As I mentioned that without the item description import the full process takes 4 hours which is bearable. However once I started to import the item description which is located on a separate machine/database the import process exploded to over 24 hours. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
> One thing that might help indexing speed - create a *single* SQL query > to grab all the data you need without using DIH's sub-entities, at > least the non-cached ones. > Not sure how much that would help. As I mentioned that without the item description import the full process takes 4 hours which is bearable. However once I started to import the item description which is located on a separate machine/database the import process exploded to over 24 hours. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
One thing that might help indexing speed - create a *single* SQL query to grab all the data you need without using DIH's sub-entities, at least the non-cached ones. Erik On Jun 2, 2010, at 12:21 PM, Blargy wrote: As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). Also wanted to add that our main entity (item) consists of 5 sub- entities (ie, joins). 2 of those 5 are fairly small so I am using CachedSqlEntityProcessor for them but the other 3 (which includes item_description) are normal. All the entites minus the item_description connect to datasource1. They currently point to one physical machine although we do have a pool of 3 DB's that could be used if it helps. The other entity, item_description uses a datasource2 which has a pool of 2 DB's that could potentially be used. Not sure if that would help or not. I might as well that the item description will have indexed, stored and term vectors set to true. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). Also wanted to add that our main entity (item) consists of 5 sub-entities (ie, joins). 2 of those 5 are fairly small so I am using CachedSqlEntityProcessor for them but the other 3 (which includes item_description) are normal. All the entites minus the item_description connect to datasource1. They currently point to one physical machine although we do have a pool of 3 DB's that could be used if it helps. The other entity, item_description uses a datasource2 which has a pool of 2 DB's that could potentially be used. Not sure if that would help or not. I might as well that the item description will have indexed, stored and term vectors set to true. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
Andrzej Bialecki wrote: > > On 2010-06-02 12:42, Grant Ingersoll wrote: >> >> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >> >>> >>> We have around 5 million items in our index and each item has a >>> description >>> located on a separate physical database. These item descriptions vary in >>> size and for the most part are quite large. Currently we are only >>> indexing >>> items and not their corresponding description and a full import takes >>> around >>> 4 hours. Ideally we want to index both our items and their descriptions >>> but >>> after some quick profiling I determined that a full import would take in >>> excess of 24 hours. >>> >>> - How would I profile the indexing process to determine if the >>> bottleneck is >>> Solr or our Database. >> >> As a data point, I routinely see clients index 5M items on normal >> hardware in approx. 1 hour (give or take 30 minutes). >> >> When you say "quite large", what do you mean? Are we talking books here >> or maybe a couple pages of text or just a couple KB of data? >> >> How long does it take you to get that data out (and, from the sounds of >> it, merge it with your item) w/o going to Solr? >> >>> - In either case, how would one speed up this process? Is there a way to >>> run >>> parallel import processes and then merge them together at the end? >>> Possibly >>> use some sort of distributed computing? >> >> DataImportHandler now supports multiple threads. The absolute fastest >> way that I know of to index is via multiple threads sending batches of >> documents at a time (at least 100). Often, from DBs one can split up the >> table via SQL statements that can then be fetched separately. You may >> want to write your own multithreaded client to index. > > SOLR-1301 is also an option if you are familiar with Hadoop ... > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > I haven't worked with Hadoop before but I'm willing to try anything to cut down this full import time. I see this currently uses the embedded solr server for indexing... would I have to scrap my DIH importing then? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865103.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
As a data point, I routinely see clients index 5M items on normal > hardware in approx. 1 hour (give or take 30 minutes). Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with 4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL version 5.0.67 (exact stats i don't know of the top of my head) When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? Our item descriptions are very similar to an ebay listing and can include HTML. We are talking about a couple of pages of text. How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? I'll have to get back to you on that one. DataImportHandler now supports multiple threads. When you say "now", what do you mean? I am running version 1.4. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100) Is there a wiki explaining how this multiple thread process works? Which batch size would work best? I am currently using a -1 batch size. You may want to write your own multithreaded client to index. This sounds like a viable option. Can you point me in the right direction on where to begin (what classes to look at, prior examples, etc)? Here is my field type I am using for the item description. Maybe its not the best? Here is an overview of my data-config.xml. Thoughts? ... I appreciate the help. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865091.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
On 2010-06-02 13:12, Grant Ingersoll wrote: > > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > >> On 2010-06-02 12:42, Grant Ingersoll wrote: >>> >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >>> We have around 5 million items in our index and each item has a description located on a separate physical database. These item descriptions vary in size and for the most part are quite large. Currently we are only indexing items and not their corresponding description and a full import takes around 4 hours. Ideally we want to index both our items and their descriptions but after some quick profiling I determined that a full import would take in excess of 24 hours. - How would I profile the indexing process to determine if the bottleneck is Solr or our Database. >>> >>> As a data point, I routinely see clients index 5M items on normal >>> hardware in approx. 1 hour (give or take 30 minutes). >>> >>> When you say "quite large", what do you mean? Are we talking books here or >>> maybe a couple pages of text or just a couple KB of data? >>> >>> How long does it take you to get that data out (and, from the sounds of it, >>> merge it with your item) w/o going to Solr? >>> - In either case, how would one speed up this process? Is there a way to run parallel import processes and then merge them together at the end? Possibly use some sort of distributed computing? >>> >>> DataImportHandler now supports multiple threads. The absolute fastest way >>> that I know of to index is via multiple threads sending batches of >>> documents at a time (at least 100). Often, from DBs one can split up the >>> table via SQL statements that can then be fetched separately. You may want >>> to write your own multithreaded client to index. >> >> SOLR-1301 is also an option if you are familiar with Hadoop ... >> > > If the bottleneck is the DB, will that do much? > Nope. But the workflow could be set up so that during night hours a DB export takes place that results in a CSV or SolrXML file (there you could measure the time it takes to do this export), and then indexing can work from this file. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Importing large datasets
On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > On 2010-06-02 12:42, Grant Ingersoll wrote: >> >> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >> >>> >>> We have around 5 million items in our index and each item has a description >>> located on a separate physical database. These item descriptions vary in >>> size and for the most part are quite large. Currently we are only indexing >>> items and not their corresponding description and a full import takes around >>> 4 hours. Ideally we want to index both our items and their descriptions but >>> after some quick profiling I determined that a full import would take in >>> excess of 24 hours. >>> >>> - How would I profile the indexing process to determine if the bottleneck is >>> Solr or our Database. >> >> As a data point, I routinely see clients index 5M items on normal >> hardware in approx. 1 hour (give or take 30 minutes). >> >> When you say "quite large", what do you mean? Are we talking books here or >> maybe a couple pages of text or just a couple KB of data? >> >> How long does it take you to get that data out (and, from the sounds of it, >> merge it with your item) w/o going to Solr? >> >>> - In either case, how would one speed up this process? Is there a way to run >>> parallel import processes and then merge them together at the end? Possibly >>> use some sort of distributed computing? >> >> DataImportHandler now supports multiple threads. The absolute fastest way >> that I know of to index is via multiple threads sending batches of documents >> at a time (at least 100). Often, from DBs one can split up the table via >> SQL statements that can then be fetched separately. You may want to write >> your own multithreaded client to index. > > SOLR-1301 is also an option if you are familiar with Hadoop ... > If the bottleneck is the DB, will that do much?
Re: Importing large datasets
On 2010-06-02 12:42, Grant Ingersoll wrote: > > On Jun 1, 2010, at 9:54 PM, Blargy wrote: > >> >> We have around 5 million items in our index and each item has a description >> located on a separate physical database. These item descriptions vary in >> size and for the most part are quite large. Currently we are only indexing >> items and not their corresponding description and a full import takes around >> 4 hours. Ideally we want to index both our items and their descriptions but >> after some quick profiling I determined that a full import would take in >> excess of 24 hours. >> >> - How would I profile the indexing process to determine if the bottleneck is >> Solr or our Database. > > As a data point, I routinely see clients index 5M items on normal > hardware in approx. 1 hour (give or take 30 minutes). > > When you say "quite large", what do you mean? Are we talking books here or > maybe a couple pages of text or just a couple KB of data? > > How long does it take you to get that data out (and, from the sounds of it, > merge it with your item) w/o going to Solr? > >> - In either case, how would one speed up this process? Is there a way to run >> parallel import processes and then merge them together at the end? Possibly >> use some sort of distributed computing? > > DataImportHandler now supports multiple threads. The absolute fastest way > that I know of to index is via multiple threads sending batches of documents > at a time (at least 100). Often, from DBs one can split up the table via SQL > statements that can then be fetched separately. You may want to write your > own multithreaded client to index. SOLR-1301 is also an option if you are familiar with Hadoop ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Importing large datasets
On Jun 1, 2010, at 9:54 PM, Blargy wrote: > > We have around 5 million items in our index and each item has a description > located on a separate physical database. These item descriptions vary in > size and for the most part are quite large. Currently we are only indexing > items and not their corresponding description and a full import takes around > 4 hours. Ideally we want to index both our items and their descriptions but > after some quick profiling I determined that a full import would take in > excess of 24 hours. > > - How would I profile the indexing process to determine if the bottleneck is > Solr or our Database. As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? > - In either case, how would one speed up this process? Is there a way to run > parallel import processes and then merge them together at the end? Possibly > use some sort of distributed computing? DataImportHandler now supports multiple threads. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100). Often, from DBs one can split up the table via SQL statements that can then be fetched separately. You may want to write your own multithreaded client to index. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search