Re: Importing large datasets

2010-06-07 Thread Alexey Serba
What's the relation between items and item_descriptions table? I.e. is
there only one item_descriptions record for every id?

If 1-1 then you can merge all your data into single database and use
the following query

 
 

HTH,
Alex

On Thu, Jun 3, 2010 at 6:34 AM, Blargy  wrote:
>
>
> Erik Hatcher-4 wrote:
>>
>> One thing that might help indexing speed - create a *single* SQL query
>> to grab all the data you need without using DIH's sub-entities, at
>> least the non-cached ones.
>>
>>       Erik
>>
>> On Jun 2, 2010, at 12:21 PM, Blargy wrote:
>>
>>>
>>>
>>> As a data point, I routinely see clients index 5M items on normal
>>> hardware
>>> in approx. 1 hour (give or take 30 minutes).
>>>
>>> Also wanted to add that our main entity (item) consists of 5 sub-
>>> entities
>>> (ie, joins). 2 of those 5 are fairly small so I am using
>>> CachedSqlEntityProcessor for them but the other 3 (which includes
>>> item_description) are normal.
>>>
>>> All the entites minus the item_description connect to datasource1.
>>> They
>>> currently point to one physical machine although we do have a pool
>>> of 3 DB's
>>> that could be used if it helps. The other entity, item_description
>>> uses a
>>> datasource2 which has a pool of 2 DB's that could potentially be
>>> used. Not
>>> sure if that would help or not.
>>>
>>> I might as well that the item description will have indexed, stored
>>> and term
>>> vectors set to true.
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>
> I can't find any example of creating a massive sql query. Any out there?
> Will batching still work with this massive query?
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Importing large datasets

2010-06-03 Thread Grant Ingersoll

On Jun 2, 2010, at 10:30 PM, Blargy wrote:
> Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its
> so slow because I am using 2 different datasources?
> 

By batch size, I meant the number of docs sent from the client to Solr.  MySQL 
Batch Size is broken.  The only thing that will work is -1 or not specifying it 
at all.  If you don't specify it, it materializes all rows into memory.

Does your data really need to be in two different databases?  That is 
undoubtedly your bottleneck.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: Importing large datasets

2010-06-03 Thread Erik Hatcher
Frankly, if you can create a script that'll turn your data into valid  
CSV, that might be the easiest, quickest way to ingest your data.   
Pragmatic, at least.  Avoids the complexity of DIH, allows you to  
script the export from your DB in the most efficient manner you can,  
and so on.


Solr's CSV update handler is FAST!

Erik

On Jun 3, 2010, at 2:56 AM, David Stuart wrote:




On 3 Jun 2010, at 03:51, Blargy  wrote:



Would dumping the databases to a local file help at all?


I would suspect not especally with the size of your data. But it  
would be good to know how long that takes i.e. Creating a SQL script  
that just pulls that data out how long does that take?


Also have many fields are you indexing per document 10,50,100?
-- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-f 
 tp863447p866538.html

Sent from the Solr - User mailing list archive at Nabble.com.




Re: Importing large datasets

2010-06-02 Thread David Stuart



On 3 Jun 2010, at 03:51, Blargy  wrote:



Would dumping the databases to a local file help at all?


I would suspect not especally with the size of your data. But it would  
be good to know how long that takes i.e. Creating a SQL script that  
just pulls that data out how long does that take?


Also have many fields are you indexing per document 10,50,100?
--  
View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-f 
 tp863447p866538.html

Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread David Stuart



On 3 Jun 2010, at 02:51, Dennis Gearon  wrote:

Well, I hope to have around 5 million datasets/documents within 1  
year, so this is good info. BUT if I DO have that many, then the  
market I am aiming at will end giving me 100 times more than than  
within 2 years.


Are there good references/books on using Solr/Lucen/(linux/nginx)  
for 500 million plus documents?


As far as I'm aware there aren't any books yet that cover this for  
solr. The wiki, this mailing list, nabble are your best sources and  
there have been some quite indepth conversations on the matter in this  
list in the past

The data is easily shardible geographially, as one given.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll  wrote:


From: Grant Ingersoll 
Subject: Re: Importing large datasets
To: solr-user@lucene.apache.org
Date: Wednesday, June 2, 2010, 3:42 AM

On Jun 1, 2010, at 9:54 PM, Blargy wrote:



We have around 5 million items in our index and each

item has a description

located on a separate physical database. These item

descriptions vary in

size and for the most part are quite large. Currently

we are only indexing

items and not their corresponding description and a

full import takes around

4 hours. Ideally we want to index both our items and

their descriptions but

after some quick profiling I determined that a full

import would take in

excess of 24 hours.

- How would I profile the indexing process to

determine if the bottleneck is

Solr or our Database.


As a data point, I routinely see clients index 5M items on
normal
hardware in approx. 1 hour (give or take 30 minutes).


When you say "quite large", what do you mean?  Are we
talking books here or maybe a couple pages of text or just a
couple KB of data?

How long does it take you to get that data out (and, from
the sounds of it, merge it with your item) w/o going to
Solr?


- In either case, how would one speed up this process?

Is there a way to run

parallel import processes and then merge them together

at the end? Possibly

use some sort of distributed computing?


DataImportHandler now supports multiple threads.  The
absolute fastest way that I know of to index is via multiple
threads sending batches of documents at a time (at least
100).  Often, from DBs one can split up the table via
SQL statements that can then be fetched separately.
You may want to write your own multithreaded client to
index.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search




Re: Importing large datasets

2010-06-02 Thread David Stuart



On 3 Jun 2010, at 02:58, Dennis Gearon  wrote:

When adding data continuously, that data is available after  
committing and is indexed, right?

Yes


If so, how often is reindexing do some good?
You should only need to reindex if the data changes or you change your  
schema. The DIH in solr 1.4 supports delta imports so you should only  
really be adding of updating (which is actually deleting and adding)  
items when necessary.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki  wrote:


From: Andrzej Bialecki 
Subject: Re: Importing large datasets
To: solr-user@lucene.apache.org
Date: Wednesday, June 2, 2010, 4:52 AM
On 2010-06-02 13:12, Grant Ingersoll
wrote:


On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:


On 2010-06-02 12:42, Grant Ingersoll wrote:


On Jun 1, 2010, at 9:54 PM, Blargy wrote:



We have around 5 million items in our

index and each item has a description

located on a separate physical database.

These item descriptions vary in

size and for the most part are quite

large. Currently we are only indexing

items and not their corresponding

description and a full import takes around

4 hours. Ideally we want to index both our

items and their descriptions but

after some quick profiling I determined

that a full import would take in

excess of 24 hours.

- How would I profile the indexing process

to determine if the bottleneck is

Solr or our Database.


As a data point, I routinely see clients index

5M items on normal

hardware in approx. 1 hour (give or take 30

minutes).


When you say "quite large", what do you

mean?  Are we talking books here or maybe a couple
pages of text or just a couple KB of data?


How long does it take you to get that data out

(and, from the sounds of it, merge it with your item) w/o
going to Solr?



- In either case, how would one speed up

this process? Is there a way to run

parallel import processes and then merge

them together at the end? Possibly

use some sort of distributed computing?


DataImportHandler now supports multiple

threads.  The absolute fastest way that I know of to
index is via multiple threads sending batches of documents
at a time (at least 100).  Often, from DBs one can
split up the table via SQL statements that can then be
fetched separately.  You may want to write your own
multithreaded client to index.


SOLR-1301 is also an option if you are familiar

with Hadoop ...




If the bottleneck is the DB, will that do much?



Nope. But the workflow could be set up so that during night
hours a DB
export takes place that results in a CSV or SolrXML file
(there you
could measure the time it takes to do this export), and
then indexing
can work from this file.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _
_   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic
Web
___|||__||  \|  ||  |  Embedded Unix,
System Integration
http://www.sigram.com  Contact: info at sigram dot
com




Re: Importing large datasets

2010-06-02 Thread Blargy

Would dumping the databases to a local file help at all?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866538.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy


Erik Hatcher-4 wrote:
> 
> One thing that might help indexing speed - create a *single* SQL query  
> to grab all the data you need without using DIH's sub-entities, at  
> least the non-cached ones.
> 
>   Erik
> 
> On Jun 2, 2010, at 12:21 PM, Blargy wrote:
> 
>>
>>
>> As a data point, I routinely see clients index 5M items on normal  
>> hardware
>> in approx. 1 hour (give or take 30 minutes).
>>
>> Also wanted to add that our main entity (item) consists of 5 sub- 
>> entities
>> (ie, joins). 2 of those 5 are fairly small so I am using
>> CachedSqlEntityProcessor for them but the other 3 (which includes
>> item_description) are normal.
>>
>> All the entites minus the item_description connect to datasource1.  
>> They
>> currently point to one physical machine although we do have a pool  
>> of 3 DB's
>> that could be used if it helps. The other entity, item_description  
>> uses a
>> datasource2 which has a pool of 2 DB's that could potentially be  
>> used. Not
>> sure if that would help or not.
>>
>> I might as well that the item description will have indexed, stored  
>> and term
>> vectors set to true.
>> -- 
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

I can't find any example of creating a massive sql query. Any out there?
Will batching still work with this massive query?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy


Lance Norskog-2 wrote:
> 
> Wait! You're fetching records from one database and then doing lookups
> against another DB? That makes this a completely different problem.
> 
> The DIH does not to my knowledge have the ability to "pool" these
> queries. That is, it will not build a batch of 1000 keys from
> datasource1 and then do a query against datasource2 with:
> select foo where key_field IN (key1, key2,... key1000);
> 
> This is the efficient way to do what you want. You'll have to write
> your own client to do this.
> 
> On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
>  wrote:
>> How long does it take to do a grab of all the data via SQL? I found by
>> denormalizing the data into a lookup table meant that I was able to index
>> about 300k rows of similar data size with dih regex spilting on some
>> fields
>> in about 8mins I know it's not quite the scale bit with batching...
>>
>> David Stuar
>>
>> On 2 Jun 2010, at 17:58, Blargy  wrote:
>>
>>>
>>>
>>>
 One thing that might help indexing speed - create a *single* SQL query
 to grab all the data you need without using DIH's sub-entities, at
 least the non-cached ones.

>>>
>>> Not sure how much that would help. As I mentioned that without the item
>>> description import the full process takes 4 hours which is bearable.
>>> However
>>> once I started to import the item description which is located on a
>>> separate
>>> machine/database the import process exploded to over 24 hours.
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
> 

Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its
so slow because I am using 2 different datasources?

Say I am using just one datasource should I still be seing "Creating a
connection for entity " for each sub entity in the document or should it
just be using one connection?




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866499.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Dennis Gearon
That's promising!!! That's how I have been desigining my project. It must be 
all the joins that are causing the problems for him?
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, David Stuart  wrote:

> From: David Stuart 
> Subject: Re: Importing large datasets
> To: "solr-user@lucene.apache.org" 
> Date: Wednesday, June 2, 2010, 12:00 PM
> How long does it take to do a grab of
> all the data via SQL? I found by denormalizing the data into
> a lookup table meant that I was able to index about 300k
> rows of similar data size with dih regex spilting on some
> fields in about 8mins I know it's not quite the scale bit
> with batching...
> 
> David Stuar
> 
> On 2 Jun 2010, at 17:58, Blargy 
> wrote:
> 
> > 
> > 
> > 
> >> One thing that might help indexing speed - create
> a *single* SQL query
> >> to grab all the data you need without using DIH's
> sub-entities, at
> >> least the non-cached ones.
> >> 
> > 
> > Not sure how much that would help. As I mentioned that
> without the item
> > description import the full process takes 4 hours
> which is bearable. However
> > once I started to import the item description which is
> located on a separate
> > machine/database the import process exploded to over
> 24 hours.
> > 
> > --View this message in context: 
> > http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
> > Sent from the Solr - User mailing list archive at
> Nabble.com.
> 


Re: Importing large datasets

2010-06-02 Thread Dennis Gearon
When adding data continuously, that data is available after committing and is 
indexed, right?

If so, how often is reindexing do some good?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki  wrote:

> From: Andrzej Bialecki 
> Subject: Re: Importing large datasets
> To: solr-user@lucene.apache.org
> Date: Wednesday, June 2, 2010, 4:52 AM
> On 2010-06-02 13:12, Grant Ingersoll
> wrote:
> > 
> > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
> > 
> >> On 2010-06-02 12:42, Grant Ingersoll wrote:
> >>>
> >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> >>>
> >>>>
> >>>> We have around 5 million items in our
> index and each item has a description
> >>>> located on a separate physical database.
> These item descriptions vary in
> >>>> size and for the most part are quite
> large. Currently we are only indexing
> >>>> items and not their corresponding
> description and a full import takes around
> >>>> 4 hours. Ideally we want to index both our
> items and their descriptions but
> >>>> after some quick profiling I determined
> that a full import would take in
> >>>> excess of 24 hours. 
> >>>>
> >>>> - How would I profile the indexing process
> to determine if the bottleneck is
> >>>> Solr or our Database.
> >>>
> >>> As a data point, I routinely see clients index
> 5M items on normal
> >>> hardware in approx. 1 hour (give or take 30
> minutes).  
> >>>
> >>> When you say "quite large", what do you
> mean?  Are we talking books here or maybe a couple
> pages of text or just a couple KB of data?
> >>>
> >>> How long does it take you to get that data out
> (and, from the sounds of it, merge it with your item) w/o
> going to Solr?
> >>>
> >>>> - In either case, how would one speed up
> this process? Is there a way to run
> >>>> parallel import processes and then merge
> them together at the end? Possibly
> >>>> use some sort of distributed computing?
> >>>
> >>> DataImportHandler now supports multiple
> threads.  The absolute fastest way that I know of to
> index is via multiple threads sending batches of documents
> at a time (at least 100).  Often, from DBs one can
> split up the table via SQL statements that can then be
> fetched separately.  You may want to write your own
> multithreaded client to index.
> >>
> >> SOLR-1301 is also an option if you are familiar
> with Hadoop ...
> >>
> > 
> > If the bottleneck is the DB, will that do much?
> > 
> 
> Nope. But the workflow could be set up so that during night
> hours a DB
> export takes place that results in a CSV or SolrXML file
> (there you
> could measure the time it takes to do this export), and
> then indexing
> can work from this file.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _
> _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix,
> System Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
>


Re: Importing large datasets

2010-06-02 Thread Dennis Gearon
Well, I hope to have around 5 million datasets/documents within 1 year, so this 
is good info. BUT if I DO have that many, then the market I am aiming at will 
end giving me 100 times more than than within 2 years.

Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 
million plus documents? The data is easily shardible geographially, as one 
given.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll  wrote:

> From: Grant Ingersoll 
> Subject: Re: Importing large datasets
> To: solr-user@lucene.apache.org
> Date: Wednesday, June 2, 2010, 3:42 AM
> 
> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> 
> > 
> > We have around 5 million items in our index and each
> item has a description
> > located on a separate physical database. These item
> descriptions vary in
> > size and for the most part are quite large. Currently
> we are only indexing
> > items and not their corresponding description and a
> full import takes around
> > 4 hours. Ideally we want to index both our items and
> their descriptions but
> > after some quick profiling I determined that a full
> import would take in
> > excess of 24 hours. 
> > 
> > - How would I profile the indexing process to
> determine if the bottleneck is
> > Solr or our Database.
> 
> As a data point, I routinely see clients index 5M items on
> normal
> hardware in approx. 1 hour (give or take 30 minutes). 
> 
> 
> When you say "quite large", what do you mean?  Are we
> talking books here or maybe a couple pages of text or just a
> couple KB of data?
> 
> How long does it take you to get that data out (and, from
> the sounds of it, merge it with your item) w/o going to
> Solr?
> 
> > - In either case, how would one speed up this process?
> Is there a way to run
> > parallel import processes and then merge them together
> at the end? Possibly
> > use some sort of distributed computing?
> 
> DataImportHandler now supports multiple threads.  The
> absolute fastest way that I know of to index is via multiple
> threads sending batches of documents at a time (at least
> 100).  Often, from DBs one can split up the table via
> SQL statements that can then be fetched separately. 
> You may want to write your own multithreaded client to
> index.
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
> 
>


Re: Importing large datasets

2010-06-02 Thread Lance Norskog
Wait! You're fetching records from one database and then doing lookups
against another DB? That makes this a completely different problem.

The DIH does not to my knowledge have the ability to "pool" these
queries. That is, it will not build a batch of 1000 keys from
datasource1 and then do a query against datasource2 with:
select foo where key_field IN (key1, key2,... key1000);

This is the efficient way to do what you want. You'll have to write
your own client to do this.

On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
 wrote:
> How long does it take to do a grab of all the data via SQL? I found by
> denormalizing the data into a lookup table meant that I was able to index
> about 300k rows of similar data size with dih regex spilting on some fields
> in about 8mins I know it's not quite the scale bit with batching...
>
> David Stuar
>
> On 2 Jun 2010, at 17:58, Blargy  wrote:
>
>>
>>
>>
>>> One thing that might help indexing speed - create a *single* SQL query
>>> to grab all the data you need without using DIH's sub-entities, at
>>> least the non-cached ones.
>>>
>>
>> Not sure how much that would help. As I mentioned that without the item
>> description import the full process takes 4 hours which is bearable.
>> However
>> once I started to import the item description which is located on a
>> separate
>> machine/database the import process exploded to over 24 hours.
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Importing large datasets

2010-06-02 Thread David Stuart
How long does it take to do a grab of all the data via SQL? I found by  
denormalizing the data into a lookup table meant that I was able to  
index about 300k rows of similar data size with dih regex spilting on  
some fields in about 8mins I know it's not quite the scale bit with  
batching...


David Stuar

On 2 Jun 2010, at 17:58, Blargy  wrote:





One thing that might help indexing speed - create a *single* SQL  
query

to grab all the data you need without using DIH's sub-entities, at
least the non-cached ones.



Not sure how much that would help. As I mentioned that without the  
item
description import the full process takes 4 hours which is bearable.  
However
once I started to import the item description which is located on a  
separate

machine/database the import process exploded to over 24 hours.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy



> One thing that might help indexing speed - create a *single* SQL query  
> to grab all the data you need without using DIH's sub-entities, at  
> least the non-cached ones.
> 

Not sure how much that would help. As I mentioned that without the item
description import the full process takes 4 hours which is bearable. However
once I started to import the item description which is located on a separate
machine/database the import process exploded to over 24 hours.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Erik Hatcher
One thing that might help indexing speed - create a *single* SQL query  
to grab all the data you need without using DIH's sub-entities, at  
least the non-cached ones.


Erik

On Jun 2, 2010, at 12:21 PM, Blargy wrote:




As a data point, I routinely see clients index 5M items on normal  
hardware

in approx. 1 hour (give or take 30 minutes).

Also wanted to add that our main entity (item) consists of 5 sub- 
entities

(ie, joins). 2 of those 5 are fairly small so I am using
CachedSqlEntityProcessor for them but the other 3 (which includes
item_description) are normal.

All the entites minus the item_description connect to datasource1.  
They
currently point to one physical machine although we do have a pool  
of 3 DB's
that could be used if it helps. The other entity, item_description  
uses a
datasource2 which has a pool of 2 DB's that could potentially be  
used. Not

sure if that would help or not.

I might as well that the item description will have indexed, stored  
and term

vectors set to true.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Importing large datasets

2010-06-02 Thread Blargy


As a data point, I routinely see clients index 5M items on normal hardware
in approx. 1 hour (give or take 30 minutes). 

Also wanted to add that our main entity (item) consists of 5 sub-entities
(ie, joins). 2 of those 5 are fairly small so I am using
CachedSqlEntityProcessor for them but the other 3 (which includes
item_description) are normal.

All the entites minus the item_description connect to datasource1. They
currently point to one physical machine although we do have a pool of 3 DB's
that could be used if it helps. The other entity, item_description uses a
datasource2 which has a pool of 2 DB's that could potentially be used. Not
sure if that would help or not.

I might as well that the item description will have indexed, stored and term
vectors set to true.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy


Andrzej Bialecki wrote:
> 
> On 2010-06-02 12:42, Grant Ingersoll wrote:
>> 
>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>> 
>>>
>>> We have around 5 million items in our index and each item has a
>>> description
>>> located on a separate physical database. These item descriptions vary in
>>> size and for the most part are quite large. Currently we are only
>>> indexing
>>> items and not their corresponding description and a full import takes
>>> around
>>> 4 hours. Ideally we want to index both our items and their descriptions
>>> but
>>> after some quick profiling I determined that a full import would take in
>>> excess of 24 hours. 
>>>
>>> - How would I profile the indexing process to determine if the
>>> bottleneck is
>>> Solr or our Database.
>> 
>> As a data point, I routinely see clients index 5M items on normal
>> hardware in approx. 1 hour (give or take 30 minutes).  
>> 
>> When you say "quite large", what do you mean?  Are we talking books here
>> or maybe a couple pages of text or just a couple KB of data?
>> 
>> How long does it take you to get that data out (and, from the sounds of
>> it, merge it with your item) w/o going to Solr?
>> 
>>> - In either case, how would one speed up this process? Is there a way to
>>> run
>>> parallel import processes and then merge them together at the end?
>>> Possibly
>>> use some sort of distributed computing?
>> 
>> DataImportHandler now supports multiple threads.  The absolute fastest
>> way that I know of to index is via multiple threads sending batches of
>> documents at a time (at least 100).  Often, from DBs one can split up the
>> table via SQL statements that can then be fetched separately.  You may
>> want to write your own multithreaded client to index.
> 
> SOLR-1301 is also an option if you are familiar with Hadoop ...
> 
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

I haven't worked with Hadoop before but I'm willing to try anything to cut
down this full import time. I see this currently uses the embedded solr
server for indexing... would I have to scrap my DIH importing then? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865103.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Blargy


As a data point, I routinely see clients index 5M items on normal
> hardware in approx. 1 hour (give or take 30 minutes).  

Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with
4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL
version 5.0.67 (exact stats i don't know of the top of my head)


When you say "quite large", what do you mean?  Are we talking books here or
maybe a couple pages of text or just a couple KB of data?

Our item descriptions are very similar to an ebay listing and can include
HTML. We are talking about a couple of pages of text.


How long does it take you to get that data out (and, from the sounds of it,
merge it with your item) w/o going to Solr? 

I'll have to get back to you on that one.


DataImportHandler now supports multiple threads. 

When you say "now", what do you mean? I am running version 1.4.


The absolute fastest way that I know of to index is via multiple threads
sending batches of documents at a time (at least 100)

 Is there a wiki explaining how this multiple thread process works? Which
batch size would work best? I am currently using a -1 batch size. 


You may want to write your own multithreaded client to index. 

This sounds like a viable option. Can you point me in the right direction on
where to begin (what classes to look at, prior examples, etc)?

Here is my field type I am using for the item description. Maybe its not the
best?

  
  





  


Here is an overview of my data-config.xml. Thoughts?

 
 ...

 

I appreciate the help.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865091.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki
On 2010-06-02 13:12, Grant Ingersoll wrote:
> 
> On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
> 
>> On 2010-06-02 12:42, Grant Ingersoll wrote:
>>>
>>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>>>

 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes 
 around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 

 - How would I profile the indexing process to determine if the bottleneck 
 is
 Solr or our Database.
>>>
>>> As a data point, I routinely see clients index 5M items on normal
>>> hardware in approx. 1 hour (give or take 30 minutes).  
>>>
>>> When you say "quite large", what do you mean?  Are we talking books here or 
>>> maybe a couple pages of text or just a couple KB of data?
>>>
>>> How long does it take you to get that data out (and, from the sounds of it, 
>>> merge it with your item) w/o going to Solr?
>>>
 - In either case, how would one speed up this process? Is there a way to 
 run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?
>>>
>>> DataImportHandler now supports multiple threads.  The absolute fastest way 
>>> that I know of to index is via multiple threads sending batches of 
>>> documents at a time (at least 100).  Often, from DBs one can split up the 
>>> table via SQL statements that can then be fetched separately.  You may want 
>>> to write your own multithreaded client to index.
>>
>> SOLR-1301 is also an option if you are familiar with Hadoop ...
>>
> 
> If the bottleneck is the DB, will that do much?
> 

Nope. But the workflow could be set up so that during night hours a DB
export takes place that results in a CSV or SolrXML file (there you
could measure the time it takes to do this export), and then indexing
can work from this file.


-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Importing large datasets

2010-06-02 Thread Grant Ingersoll

On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:

> On 2010-06-02 12:42, Grant Ingersoll wrote:
>> 
>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>> 
>>> 
>>> We have around 5 million items in our index and each item has a description
>>> located on a separate physical database. These item descriptions vary in
>>> size and for the most part are quite large. Currently we are only indexing
>>> items and not their corresponding description and a full import takes around
>>> 4 hours. Ideally we want to index both our items and their descriptions but
>>> after some quick profiling I determined that a full import would take in
>>> excess of 24 hours. 
>>> 
>>> - How would I profile the indexing process to determine if the bottleneck is
>>> Solr or our Database.
>> 
>> As a data point, I routinely see clients index 5M items on normal
>> hardware in approx. 1 hour (give or take 30 minutes).  
>> 
>> When you say "quite large", what do you mean?  Are we talking books here or 
>> maybe a couple pages of text or just a couple KB of data?
>> 
>> How long does it take you to get that data out (and, from the sounds of it, 
>> merge it with your item) w/o going to Solr?
>> 
>>> - In either case, how would one speed up this process? Is there a way to run
>>> parallel import processes and then merge them together at the end? Possibly
>>> use some sort of distributed computing?
>> 
>> DataImportHandler now supports multiple threads.  The absolute fastest way 
>> that I know of to index is via multiple threads sending batches of documents 
>> at a time (at least 100).  Often, from DBs one can split up the table via 
>> SQL statements that can then be fetched separately.  You may want to write 
>> your own multithreaded client to index.
> 
> SOLR-1301 is also an option if you are familiar with Hadoop ...
> 

If the bottleneck is the DB, will that do much?

Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki
On 2010-06-02 12:42, Grant Ingersoll wrote:
> 
> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> 
>>
>> We have around 5 million items in our index and each item has a description
>> located on a separate physical database. These item descriptions vary in
>> size and for the most part are quite large. Currently we are only indexing
>> items and not their corresponding description and a full import takes around
>> 4 hours. Ideally we want to index both our items and their descriptions but
>> after some quick profiling I determined that a full import would take in
>> excess of 24 hours. 
>>
>> - How would I profile the indexing process to determine if the bottleneck is
>> Solr or our Database.
> 
> As a data point, I routinely see clients index 5M items on normal
> hardware in approx. 1 hour (give or take 30 minutes).  
> 
> When you say "quite large", what do you mean?  Are we talking books here or 
> maybe a couple pages of text or just a couple KB of data?
> 
> How long does it take you to get that data out (and, from the sounds of it, 
> merge it with your item) w/o going to Solr?
> 
>> - In either case, how would one speed up this process? Is there a way to run
>> parallel import processes and then merge them together at the end? Possibly
>> use some sort of distributed computing?
> 
> DataImportHandler now supports multiple threads.  The absolute fastest way 
> that I know of to index is via multiple threads sending batches of documents 
> at a time (at least 100).  Often, from DBs one can split up the table via SQL 
> statements that can then be fetched separately.  You may want to write your 
> own multithreaded client to index.

SOLR-1301 is also an option if you are familiar with Hadoop ...



-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Importing large datasets

2010-06-02 Thread Grant Ingersoll

On Jun 1, 2010, at 9:54 PM, Blargy wrote:

> 
> We have around 5 million items in our index and each item has a description
> located on a separate physical database. These item descriptions vary in
> size and for the most part are quite large. Currently we are only indexing
> items and not their corresponding description and a full import takes around
> 4 hours. Ideally we want to index both our items and their descriptions but
> after some quick profiling I determined that a full import would take in
> excess of 24 hours. 
> 
> - How would I profile the indexing process to determine if the bottleneck is
> Solr or our Database.

As a data point, I routinely see clients index 5M items on normal
hardware in approx. 1 hour (give or take 30 minutes).  

When you say "quite large", what do you mean?  Are we talking books here or 
maybe a couple pages of text or just a couple KB of data?

How long does it take you to get that data out (and, from the sounds of it, 
merge it with your item) w/o going to Solr?

> - In either case, how would one speed up this process? Is there a way to run
> parallel import processes and then merge them together at the end? Possibly
> use some sort of distributed computing?

DataImportHandler now supports multiple threads.  The absolute fastest way that 
I know of to index is via multiple threads sending batches of documents at a 
time (at least 100).  Often, from DBs one can split up the table via SQL 
statements that can then be fetched separately.  You may want to write your own 
multithreaded client to index.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search