RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Thank you everyone for your patience and suggestions.

It turns out I was doing something really unreasonable in my schema. I 
mistakenly edited the max EdgeNgram size to 512, when I meant to set the 
lengthFilter max to 512. I brought this to a more reasonable number, and my 
estimated time to import is now down to 4 hours. Based on the size of my record 
set, this time is more consistent with Walter's observations in his own project.

Thanks again for your help,

Devon Baumgarten

-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 12:42 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

> Would it be unusual for an import of 160 million documents
> to take 18 hours?  Each document is less than 1kb and I
> have the DataImportHandler using the jdbc driver to connect
> to SQL Server 2008. The full-import query calls a stored
> procedure that contains only a select from my target table.
> 
> Is there any way I can speed this up? I saw recently someone
> on this list suggested a new user could get all their Solr
> data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml? 


Re: Unusually long data import time?

2012-02-22 Thread eks dev
Davon, you ought to try to update from many threads, (I do not know if
DIH can do it, check it), but lucene does great job if fed from many
update threads...

depends where your time gets lost, but it is usually a) analysis chain
or b) database

if it os a) and your server has spare cpu-cores, you can scale at X
NooCores rate

On Wed, Feb 22, 2012 at 7:41 PM, Devon Baumgarten
 wrote:
> Ahmet,
>
> I do not. I commented autoCommit out.
>
> Devon Baumgarten
>
>
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Wednesday, February 22, 2012 12:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Unusually long data import time?
>
>> Would it be unusual for an import of 160 million documents
>> to take 18 hours?  Each document is less than 1kb and I
>> have the DataImportHandler using the jdbc driver to connect
>> to SQL Server 2008. The full-import query calls a stored
>> procedure that contains only a select from my target table.
>>
>> Is there any way I can speed this up? I saw recently someone
>> on this list suggested a new user could get all their Solr
>> data imported in under an hour. I sure hope that's true!
>
> Do have autoCommit or autoSoftCommit configured in solrconfig.xml?


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

> Would it be unusual for an import of 160 million documents
> to take 18 hours?  Each document is less than 1kb and I
> have the DataImportHandler using the jdbc driver to connect
> to SQL Server 2008. The full-import query calls a stored
> procedure that contains only a select from my target table.
> 
> Is there any way I can speed this up? I saw recently someone
> on this list suggested a new user could get all their Solr
> data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml? 


Re: Unusually long data import time?

2012-02-22 Thread Ahmet Arslan
> Would it be unusual for an import of 160 million documents
> to take 18 hours?  Each document is less than 1kb and I
> have the DataImportHandler using the jdbc driver to connect
> to SQL Server 2008. The full-import query calls a stored
> procedure that contains only a select from my target table.
> 
> Is there any way I can speed this up? I saw recently someone
> on this list suggested a new user could get all their Solr
> data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml?


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Walter,

Do you mean sub-entities in your database, or something else?

The data I am feeding DIH is from a select * (no joins or WHERE clause) on a 
table with:

int, int, varchar(32), varchar(32), varchar(512) (this one is the Name), 
varchar(512), datetime

If it matters, the select * is happening in a stored procedure.

Devon Baumgarten


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, February 22, 2012 11:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

In my first try with the DIH, I had several sub-entities and it was making six 
queries per document. My 20M doc load was going to take many hours, most of a 
day. I re-wrote it to eliminate those, and now it makes a single query for the 
whole load and takes 70 minutes. These are small documents, just the metadata 
for each book.

wunder
Search Guy
Chegg

On Feb 22, 2012, at 9:41 AM, Devon Baumgarten wrote:

> I changed the heap size (Xmx1582m was as high as I could go). The import is 
> at about 5% now, and from that I now estimate about 13 hours. It's hard to 
> say though.. it keeps going up little by little.
> 
> If I get approval to use Solr for this project, I'll have them install a 
> 64bit jvm instead, but is there anything else I can do?
> 
> 
> Devon Baumgarten
> Application Developer
> 
> 
> -Original Message-
> From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
> Sent: Wednesday, February 22, 2012 10:32 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Unusually long data import time?
> 
> Oh sure! As best as I can, anyway.
> 
> I have not set the Java heap size, or really configured it at all. 
> 
> The server running both the SQL Server and Solr has:
> * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
> * 64 GB RAM
> * One Solr instance (no shards)
> 
> I'm not using faceting.
> My schema has these fields:
>   
>   
>   
>   termVectors="true" /> 
>   termVectors="true" /> 
>   
>  
> 
> Custom types:
> 
> *LikeText
>   PatternReplaceCharFilterFactory ("\W+" => "")
>   KeywordTokenizerFactory 
>   StopFilterFactory (~40 words in stoplist)
>   ASCIIFoldingFilterFactory
>   LowerCaseFilterFactory
>   EdgeNGramFilterFactory
>   LengthFilterFactory (min:3, max:512)
> 
> *FuzzyText
>   PatternReplaceCharFilterFactory ("\W+" => "")
>   KeywordTokenizerFactory 
>   StopFilterFactory (~40 words in stoplist)
>   ASCIIFoldingFilterFactory
>   LowerCaseFilterFactory
>   NGramFilterFactory
>   LengthFilterFactory (min:3, max:512)
> 
> Devon Baumgarten
> 
> 
> -Original Message-
> From: Glen Newton [mailto:glen.new...@gmail.com] 
> Sent: Wednesday, February 22, 2012 9:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Unusually long data import time?
> 
> Import times will depend on:
> - hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
> - Java configuration (heap size, etc)
> - Lucene/Solr configuration (many ...)
> - Index configuration - how many fields, indexed how; faceting, etc
> - OS configuration (this usually to a lesser degree; _usually_)
> - Network issues if non-local
> - DB configuration (driver, etc)
> 
> If you can give more information about the above, people on this list
> should be able to better indicate whether 18 hours sounds right for
> your situation.
> 
> -Glen Newton
> 
> On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
>  wrote:
>> Hello,
>> 
>> Would it be unusual for an import of 160 million documents to take 18 hours? 
>>  Each document is less than 1kb and I have the DataImportHandler using the 
>> jdbc driver to connect to SQL Server 2008. The full-import query calls a 
>> stored procedure that contains only a select from my target table.
>> 
>> Is there any way I can speed this up? I saw recently someone on this list 
>> suggested a new user could get all their Solr data imported in under an 
>> hour. I sure hope that's true!
>> 
>> 
>> Devon Baumgarten
>> 
>> 
> 
> 
> 
> -- 
> -
> http://zzzoot.blogspot.com/
> -







Re: Unusually long data import time?

2012-02-22 Thread Walter Underwood
In my first try with the DIH, I had several sub-entities and it was making six 
queries per document. My 20M doc load was going to take many hours, most of a 
day. I re-wrote it to eliminate those, and now it makes a single query for the 
whole load and takes 70 minutes. These are small documents, just the metadata 
for each book.

wunder
Search Guy
Chegg

On Feb 22, 2012, at 9:41 AM, Devon Baumgarten wrote:

> I changed the heap size (Xmx1582m was as high as I could go). The import is 
> at about 5% now, and from that I now estimate about 13 hours. It's hard to 
> say though.. it keeps going up little by little.
> 
> If I get approval to use Solr for this project, I'll have them install a 
> 64bit jvm instead, but is there anything else I can do?
> 
> 
> Devon Baumgarten
> Application Developer
> 
> 
> -Original Message-
> From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
> Sent: Wednesday, February 22, 2012 10:32 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Unusually long data import time?
> 
> Oh sure! As best as I can, anyway.
> 
> I have not set the Java heap size, or really configured it at all. 
> 
> The server running both the SQL Server and Solr has:
> * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
> * 64 GB RAM
> * One Solr instance (no shards)
> 
> I'm not using faceting.
> My schema has these fields:
>   
>   
>   
>   termVectors="true" /> 
>   termVectors="true" /> 
>   
>  
> 
> Custom types:
> 
> *LikeText
>   PatternReplaceCharFilterFactory ("\W+" => "")
>   KeywordTokenizerFactory 
>   StopFilterFactory (~40 words in stoplist)
>   ASCIIFoldingFilterFactory
>   LowerCaseFilterFactory
>   EdgeNGramFilterFactory
>   LengthFilterFactory (min:3, max:512)
> 
> *FuzzyText
>   PatternReplaceCharFilterFactory ("\W+" => "")
>   KeywordTokenizerFactory 
>   StopFilterFactory (~40 words in stoplist)
>   ASCIIFoldingFilterFactory
>   LowerCaseFilterFactory
>   NGramFilterFactory
>   LengthFilterFactory (min:3, max:512)
> 
> Devon Baumgarten
> 
> 
> -Original Message-
> From: Glen Newton [mailto:glen.new...@gmail.com] 
> Sent: Wednesday, February 22, 2012 9:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Unusually long data import time?
> 
> Import times will depend on:
> - hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
> - Java configuration (heap size, etc)
> - Lucene/Solr configuration (many ...)
> - Index configuration - how many fields, indexed how; faceting, etc
> - OS configuration (this usually to a lesser degree; _usually_)
> - Network issues if non-local
> - DB configuration (driver, etc)
> 
> If you can give more information about the above, people on this list
> should be able to better indicate whether 18 hours sounds right for
> your situation.
> 
> -Glen Newton
> 
> On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
>  wrote:
>> Hello,
>> 
>> Would it be unusual for an import of 160 million documents to take 18 hours? 
>>  Each document is less than 1kb and I have the DataImportHandler using the 
>> jdbc driver to connect to SQL Server 2008. The full-import query calls a 
>> stored procedure that contains only a select from my target table.
>> 
>> Is there any way I can speed this up? I saw recently someone on this list 
>> suggested a new user could get all their Solr data imported in under an 
>> hour. I sure hope that's true!
>> 
>> 
>> Devon Baumgarten
>> 
>> 
> 
> 
> 
> -- 
> -
> http://zzzoot.blogspot.com/
> -







RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
I changed the heap size (Xmx1582m was as high as I could go). The import is at 
about 5% now, and from that I now estimate about 13 hours. It's hard to say 
though.. it keeps going up little by little.

If I get approval to use Solr for this project, I'll have them install a 64bit 
jvm instead, but is there anything else I can do?


Devon Baumgarten
Application Developer


-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 10:32 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
   
   
   
   
   
   
  

Custom types:

*LikeText
PatternReplaceCharFilterFactory ("\W+" => "")
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory ("\W+" => "")
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
 wrote:
> Hello,
>
> Would it be unusual for an import of 160 million documents to take 18 hours?  
> Each document is less than 1kb and I have the DataImportHandler using the 
> jdbc driver to connect to SQL Server 2008. The full-import query calls a 
> stored procedure that contains only a select from my target table.
>
> Is there any way I can speed this up? I saw recently someone on this list 
> suggested a new user could get all their Solr data imported in under an hour. 
> I sure hope that's true!
>
>
> Devon Baumgarten
>
>



-- 
-
http://zzzoot.blogspot.com/
-


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
   
   
   
   
   
   
  

Custom types:

*LikeText
PatternReplaceCharFilterFactory ("\W+" => "")
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory ("\W+" => "")
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
 wrote:
> Hello,
>
> Would it be unusual for an import of 160 million documents to take 18 hours?  
> Each document is less than 1kb and I have the DataImportHandler using the 
> jdbc driver to connect to SQL Server 2008. The full-import query calls a 
> stored procedure that contains only a select from my target table.
>
> Is there any way I can speed this up? I saw recently someone on this list 
> suggested a new user could get all their Solr data imported in under an hour. 
> I sure hope that's true!
>
>
> Devon Baumgarten
>
>



-- 
-
http://zzzoot.blogspot.com/
-


Re: Unusually long data import time?

2012-02-22 Thread Glen Newton
Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
 wrote:
> Hello,
>
> Would it be unusual for an import of 160 million documents to take 18 hours?  
> Each document is less than 1kb and I have the DataImportHandler using the 
> jdbc driver to connect to SQL Server 2008. The full-import query calls a 
> stored procedure that contains only a select from my target table.
>
> Is there any way I can speed this up? I saw recently someone on this list 
> suggested a new user could get all their Solr data imported in under an hour. 
> I sure hope that's true!
>
>
> Devon Baumgarten
>
>



-- 
-
http://zzzoot.blogspot.com/
-