RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Thank you everyone for your patience and suggestions.

It turns out I was doing something really unreasonable in my schema. I 
mistakenly edited the max EdgeNgram size to 512, when I meant to set the 
lengthFilter max to 512. I brought this to a more reasonable number, and my 
estimated time to import is now down to 4 hours. Based on the size of my record 
set, this time is more consistent with Walter's observations in his own project.

Thanks again for your help,

Devon Baumgarten

-Original Message-
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 12:42 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

> Would it be unusual for an import of 160 million documents
> to take 18 hours?  Each document is less than 1kb and I
> have the DataImportHandler using the jdbc driver to connect
> to SQL Server 2008. The full-import query calls a stored
> procedure that contains only a select from my target table.
> 
> Is there any way I can speed this up? I saw recently someone
> on this list suggested a new user could get all their Solr
> data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml? 


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Ahmet,

I do not. I commented autoCommit out.

Devon Baumgarten



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, February 22, 2012 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

> Would it be unusual for an import of 160 million documents
> to take 18 hours?  Each document is less than 1kb and I
> have the DataImportHandler using the jdbc driver to connect
> to SQL Server 2008. The full-import query calls a stored
> procedure that contains only a select from my target table.
> 
> Is there any way I can speed this up? I saw recently someone
> on this list suggested a new user could get all their Solr
> data imported in under an hour. I sure hope that's true!

Do have autoCommit or autoSoftCommit configured in solrconfig.xml? 


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Walter,

Do you mean sub-entities in your database, or something else?

The data I am feeding DIH is from a select * (no joins or WHERE clause) on a 
table with:

int, int, varchar(32), varchar(32), varchar(512) (this one is the Name), 
varchar(512), datetime

If it matters, the select * is happening in a stored procedure.

Devon Baumgarten


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, February 22, 2012 11:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

In my first try with the DIH, I had several sub-entities and it was making six 
queries per document. My 20M doc load was going to take many hours, most of a 
day. I re-wrote it to eliminate those, and now it makes a single query for the 
whole load and takes 70 minutes. These are small documents, just the metadata 
for each book.

wunder
Search Guy
Chegg

On Feb 22, 2012, at 9:41 AM, Devon Baumgarten wrote:

> I changed the heap size (Xmx1582m was as high as I could go). The import is 
> at about 5% now, and from that I now estimate about 13 hours. It's hard to 
> say though.. it keeps going up little by little.
> 
> If I get approval to use Solr for this project, I'll have them install a 
> 64bit jvm instead, but is there anything else I can do?
> 
> 
> Devon Baumgarten
> Application Developer
> 
> 
> -----Original Message-
> From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
> Sent: Wednesday, February 22, 2012 10:32 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Unusually long data import time?
> 
> Oh sure! As best as I can, anyway.
> 
> I have not set the Java heap size, or really configured it at all. 
> 
> The server running both the SQL Server and Solr has:
> * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
> * 64 GB RAM
> * One Solr instance (no shards)
> 
> I'm not using faceting.
> My schema has these fields:
>   
>   
>   
>   termVectors="true" /> 
>   termVectors="true" /> 
>   
>  
> 
> Custom types:
> 
> *LikeText
>   PatternReplaceCharFilterFactory ("\W+" => "")
>   KeywordTokenizerFactory 
>   StopFilterFactory (~40 words in stoplist)
>   ASCIIFoldingFilterFactory
>   LowerCaseFilterFactory
>   EdgeNGramFilterFactory
>   LengthFilterFactory (min:3, max:512)
> 
> *FuzzyText
>   PatternReplaceCharFilterFactory ("\W+" => "")
>   KeywordTokenizerFactory 
>   StopFilterFactory (~40 words in stoplist)
>   ASCIIFoldingFilterFactory
>   LowerCaseFilterFactory
>   NGramFilterFactory
>   LengthFilterFactory (min:3, max:512)
> 
> Devon Baumgarten
> 
> 
> -Original Message-
> From: Glen Newton [mailto:glen.new...@gmail.com] 
> Sent: Wednesday, February 22, 2012 9:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Unusually long data import time?
> 
> Import times will depend on:
> - hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
> - Java configuration (heap size, etc)
> - Lucene/Solr configuration (many ...)
> - Index configuration - how many fields, indexed how; faceting, etc
> - OS configuration (this usually to a lesser degree; _usually_)
> - Network issues if non-local
> - DB configuration (driver, etc)
> 
> If you can give more information about the above, people on this list
> should be able to better indicate whether 18 hours sounds right for
> your situation.
> 
> -Glen Newton
> 
> On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
>  wrote:
>> Hello,
>> 
>> Would it be unusual for an import of 160 million documents to take 18 hours? 
>>  Each document is less than 1kb and I have the DataImportHandler using the 
>> jdbc driver to connect to SQL Server 2008. The full-import query calls a 
>> stored procedure that contains only a select from my target table.
>> 
>> Is there any way I can speed this up? I saw recently someone on this list 
>> suggested a new user could get all their Solr data imported in under an 
>> hour. I sure hope that's true!
>> 
>> 
>> Devon Baumgarten
>> 
>> 
> 
> 
> 
> -- 
> -
> http://zzzoot.blogspot.com/
> -







RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
I changed the heap size (Xmx1582m was as high as I could go). The import is at 
about 5% now, and from that I now estimate about 13 hours. It's hard to say 
though.. it keeps going up little by little.

If I get approval to use Solr for this project, I'll have them install a 64bit 
jvm instead, but is there anything else I can do?


Devon Baumgarten
Application Developer


-Original Message-----
From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] 
Sent: Wednesday, February 22, 2012 10:32 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unusually long data import time?

Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
   
   
   
   
   
   
  

Custom types:

*LikeText
PatternReplaceCharFilterFactory ("\W+" => "")
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory ("\W+" => "")
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
    NGramFilterFactory
LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
 wrote:
> Hello,
>
> Would it be unusual for an import of 160 million documents to take 18 hours?  
> Each document is less than 1kb and I have the DataImportHandler using the 
> jdbc driver to connect to SQL Server 2008. The full-import query calls a 
> stored procedure that contains only a select from my target table.
>
> Is there any way I can speed this up? I saw recently someone on this list 
> suggested a new user could get all their Solr data imported in under an hour. 
> I sure hope that's true!
>
>
> Devon Baumgarten
>
>



-- 
-
http://zzzoot.blogspot.com/
-


RE: Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Oh sure! As best as I can, anyway.

I have not set the Java heap size, or really configured it at all. 

The server running both the SQL Server and Solr has:
* 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors)
* 64 GB RAM
* One Solr instance (no shards)

I'm not using faceting.
My schema has these fields:
   
   
   
   
   
   
  

Custom types:

*LikeText
PatternReplaceCharFilterFactory ("\W+" => "")
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
EdgeNGramFilterFactory
LengthFilterFactory (min:3, max:512)

*FuzzyText
PatternReplaceCharFilterFactory ("\W+" => "")
KeywordTokenizerFactory 
StopFilterFactory (~40 words in stoplist)
ASCIIFoldingFilterFactory
LowerCaseFilterFactory
NGramFilterFactory
    LengthFilterFactory (min:3, max:512)

Devon Baumgarten


-Original Message-
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Wednesday, February 22, 2012 9:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Unusually long data import time?

Import times will depend on:
- hardware (speed of disks, cpu, # of cpus, amount of memory, etc)
- Java configuration (heap size, etc)
- Lucene/Solr configuration (many ...)
- Index configuration - how many fields, indexed how; faceting, etc
- OS configuration (this usually to a lesser degree; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)

If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.

-Glen Newton

On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
 wrote:
> Hello,
>
> Would it be unusual for an import of 160 million documents to take 18 hours?  
> Each document is less than 1kb and I have the DataImportHandler using the 
> jdbc driver to connect to SQL Server 2008. The full-import query calls a 
> stored procedure that contains only a select from my target table.
>
> Is there any way I can speed this up? I saw recently someone on this list 
> suggested a new user could get all their Solr data imported in under an hour. 
> I sure hope that's true!
>
>
> Devon Baumgarten
>
>



-- 
-
http://zzzoot.blogspot.com/
-


Unusually long data import time?

2012-02-22 Thread Devon Baumgarten
Hello,

Would it be unusual for an import of 160 million documents to take 18 hours?  
Each document is less than 1kb and I have the DataImportHandler using the jdbc 
driver to connect to SQL Server 2008. The full-import query calls a stored 
procedure that contains only a select from my target table.

Is there any way I can speed this up? I saw recently someone on this list 
suggested a new user could get all their Solr data imported in under an hour. I 
sure hope that's true!


Devon Baumgarten




RE: Solr, SQL Server's LIKE

2012-01-04 Thread Devon Baumgarten
Great suggestion! Thanks for keeping it simple for a complete Solr newbie.

I'm going to go try this right now.

Thanks!
Devon Baumgarten


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Monday, January 02, 2012 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

On 12/29/2011 3:51 PM, Devon Baumgarten wrote:
> N-Grams get me pretty great results in general, but I don't want the results 
> for this particular search to be fuzzy. How can I prevent the fuzzy matches 
> from appearing?
>
> Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather 
> than having a low score.

To achieve this while using the ngram filter, just do the ngram analysis 
on the index side, but not on the query side.  If you do this, you'll 
likely need a maxGramSize larger than would normally be required (which 
will make the index larger), and you might need to use the LengthFilter too.

Thanks,
Shawn



RE: Solr, SQL Server's LIKE

2011-12-30 Thread Devon Baumgarten
Hoss,

Thanks. You've answered my question. To clarify, what I should have asked for 
instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me that 
I didn't need n-grams to use the wildcard. You asking for me to clarify what I 
meant made me realize that the n-grams are the source of all my current 
problems. :)

Thanks!

Devon Baumgarten


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, December 29, 2011 7:00 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr, SQL Server's LIKE


: Thanks. I know I'll be able to utilize some of Solr's free text 
: searching capabilities in other search types in this project. The 
: product manager wants this particular search to exactly mimic LIKE%.
...
: Ex: If I search "Albatross" I want "Albert" to be excluded completely, 
: rather than having a low score.

please be specific about the types of queries you want. ie: we need more 
then one example of the type of input you want to provide, the type of 
matches you want to see for that input, and the type of matches you want 
to get back.

in your first message you said you need to match company titles "pretty 
exactly" but then seem to contradict yourself by saying the SQL's LIKE 
command fit's the bill -- even though the SQL LIKE command exists 
specificly for in-exact matches on field values.

Based on your one example above of Albatross, you don't need anything 
special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
just search for "Albatross" and it will match "Albatross" but not 
"Albert".  if you want "Albatross" to match "Albatross Road" use some 
basic tokenization.

If all you really care about is prefix searching (which seems suggested by 
your "LIKE%" comment above, which i'm guessing is shorthand for something 
similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both 
match "abcdef" and "abcd" but neither of them match "abcd" 
then just use prefix queries (ie: "abcd*") -- they should be plenty 
efficient for your purposes.  you only need to worry about ngrams when you 
want to efficiently match in the middle of a string. (ie: "TITLE LIKE 
%ABC%")


-Hoss


RE: Solr, SQL Server's LIKE

2011-12-29 Thread Devon Baumgarten
Erick,

Thanks. I know I'll be able to utilize some of Solr's free text searching 
capabilities in other search types in this project. The product manager wants 
this particular search to exactly mimic LIKE%.

N-Grams get me pretty great results in general, but I don't want the results 
for this particular search to be fuzzy. How can I prevent the fuzzy matches 
from appearing?

Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather 
than having a low score.

Devon Baumgarten


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, December 29, 2011 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

SQLs "like" is usually handled with ngrams if you want
*stuff* kinds of searches. Wildcards are "interesting"
in Solr.

Things Solr handles that aren't easy in SQL
Phrases, phrases with slop, stemming,
synonyms. And, especially, some kind
of relevance ranking.

But Solr does NOT do the things SQL is best at,
things like joins etc. Each has it's sweet spot
and trying to make one do all the functions of the
other is fraught with places to go wrong.

Not a lot of help, but free text searching is what Solr is
all about, so if your problem maps into that space,
it's a great tool!

Best
Erick

On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant  wrote:
> for a simple, hackish (albeit inefficient) approach look up wildcard searchers
>
> e,g foo*, *bar
>
>
>
> On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
>  wrote:
>> I have been tinkering with Solr for a few weeks, and I am convinced that it 
>> could be very helpful in many of my upcoming projects. I am trying to decide 
>> whether Solr is appropriate for this one, and I haven't had luck looking for 
>> answers on Google.
>>
>> I need to search a list of names of companies and individuals pretty 
>> exactly. T-SQL's LIKE operator does this with decent performance, but I have 
>> a feeling there is a way to configure Solr to do this better. I've tried 
>> using an edge N-gram tokenizer, but it feels like it might be more 
>> complicated than necessary. What would you suggest?
>>
>> I know this sounds kind of 'Golden Hammer,' but there has been talk of 
>> other, more complicated (magic) searches that I don't think SQL Server can 
>> handle, since its tokens (as far as I know) can't be smaller than one word.
>>
>> Thanks,
>>
>> Devon Baumgarten
>>


Solr, SQL Server's LIKE

2011-12-29 Thread Devon Baumgarten
I have been tinkering with Solr for a few weeks, and I am convinced that it 
could be very helpful in many of my upcoming projects. I am trying to decide 
whether Solr is appropriate for this one, and I haven't had luck looking for 
answers on Google.

I need to search a list of names of companies and individuals pretty exactly. 
T-SQL's LIKE operator does this with decent performance, but I have a feeling 
there is a way to configure Solr to do this better. I've tried using an edge 
N-gram tokenizer, but it feels like it might be more complicated than 
necessary. What would you suggest?

I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
more complicated (magic) searches that I don't think SQL Server can handle, 
since its tokens (as far as I know) can't be smaller than one word.

Thanks,

Devon Baumgarten



RE: Removing whitespace

2011-12-12 Thread Devon Baumgarten
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten


RE: Removing whitespace

2011-12-12 Thread Devon Baumgarten
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten

-Original Message-
From: Alireza Salimi [mailto:alireza.sal...@gmail.com] 
Sent: Monday, December 12, 2011 4:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Removing whitespace

That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories



The

On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten <
dbaumgar...@nationalcorp.com> wrote:

> Hello,
>
> I am having trouble finding how to remove/ignore whitespace when indexing.
> The only answer I have found suggested that it is necessary to write my own
> tokenizer. Is this true? I want to remove whitespace and special characters
> from the phrase and create N-grams from the result.
>
> Ultimately, the effect I am after is that searching "bobdole" would match
> "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way...
> can anyone lend some assistance?
>
> Thanks!
>
> Dev B
>
>


-- 
Alireza Salimi
Java EE Developer


Removing whitespace

2011-12-12 Thread Devon Baumgarten
Hello,

I am having trouble finding how to remove/ignore whitespace when indexing. The 
only answer I have found suggested that it is necessary to write my own 
tokenizer. Is this true? I want to remove whitespace and special characters 
from the phrase and create N-grams from the result.

Ultimately, the effect I am after is that searching "bobdole" would match "Bob 
Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way... can 
anyone lend some assistance?

Thanks!

Dev B