date:20100602




On 3 Jun 2010, at 03:51, Blargy  wrote:



Would dumping the databases to a local file help at all?


I would suspect not especally with the size of your data. But it would  
be good to know how long that takes i.e. Creating a SQL script that  
just pulls that data out how long does that take?


Also have many fields are you indexing per document 10,50,100?
--  
View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-f 
 tp863447p866538.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Search problem; cannot search the existing word in the index content

2010-06-02 Thread Mint o_O!

Hi Yandong,

You are right. It works!!!
You are the best.

Thanks,

Mint


2010/6/3 Zero Yao 

> Modify all  settings in solrconfig.xml and try again, by
> default solr will only index the first 1 fields.
>
> Best Regards,
> Yandong
>
> -Original Message-
> From: Mint o_O! [mailto:mint@gmail.com]
> Sent: 2010年6月3日 13:58
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Search problem; cannot search the existing word in the
> index content
>
> Thanks for you advice. I did as you said and i still cannot search my
> content.
>
> One thing i notice here i can search for only the words within first 100
> rows or maybe bigger than this not sure but not all. So is it the
> limitation
> of the index it self? When I create another sample content with only small
> amount of data. It's working great!!!
> My content is around 1.2M. I stored it as the text field as in the
> schema.xml sample file.
>
> Anyone has the same issue with me?
>
> thanks,
>
> Mint
>
> On Tue, May 18, 2010 at 1:58 PM, Lance Norskog  wrote:
>
> > backslash*rhode
> > \*rhode may work.
> >
> > On Mon, May 17, 2010 at 7:23 AM, Erick Erickson  >
> > wrote:
> > > A couple of things:
> > > 1> try searching with &debugQuery=on attached to your URL, that'll
> > > give you some clues.
> > > 2> It's really worthwhile exploring the admin pages for a while, it'll
> > also
> > > give you a world of information. It takes a while to understand what
> the
> > > various pages are telling you, but you'll come to rely on them.
> > > 3> Are you really searching with leading and trailing wildcards or is
> > that
> > > just the mail changing bolding? Because this is tricky, very tricky.
> > Search
> > > the mail archives for "leading wildcard" to see lots of discussion of
> > this
> > > topic.
> > >
> > > You might back off a bit and try building up to wildcards if that's
> what
> > > you're doing
> > >
> > > HTH
> > > Erick
> > >
> > > On Mon, May 17, 2010 at 1:11 AM, Mint o_O!  wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm working on the index/search project recently and i found solr
> which
> > is
> > >> very fascinating to me.
> > >>
> > >> I followed the test successful from the tutorial page. Starting up
> jetty
> > >> and
> > >> run adding new xml (user:~/solr/example/exampledocs$ *java -jar
> post.jar
> > >> *.xml*) so far so good at this stage.
> > >>
> > >> Now i have create my own testing westpac.xml file with real data I
> > intend
> > >> to
> > >> implement, putting in exampledocs and again ran the command
> > >> (user:~/solr/example/exampledocs$ *java -jar post.jar westpac.xml*).
> > >> Everything went on very well however when i searched for "*rhode*"
> which
> > is
> > >> in the content. And Index returned nothing.
> > >>
> > >> Could anyone guide me what I did wrong why i couldn't search for that
> > word
> > >> even though that word is in my index content.
> > >>
> > >> thanks,
> > >>
> > >> Mint
> > >>
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>

Re: Importing large datasets




On 3 Jun 2010, at 02:51, Dennis Gearon  wrote:

Well, I hope to have around 5 million datasets/documents within 1  
year, so this is good info. BUT if I DO have that many, then the  
market I am aiming at will end giving me 100 times more than than  
within 2 years.


Are there good references/books on using Solr/Lucen/(linux/nginx)  
for 500 million plus documents?


As far as I'm aware there aren't any books yet that cover this for  
solr. The wiki, this mailing list, nabble are your best sources and  
there have been some quite indepth conversations on the matter in this  
list in the past

The data is easily shardible geographially, as one given.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll  wrote:


From: Grant Ingersoll 
Subject: Re: Importing large datasets
To: solr-user@lucene.apache.org
Date: Wednesday, June 2, 2010, 3:42 AM

On Jun 1, 2010, at 9:54 PM, Blargy wrote:



We have around 5 million items in our index and each

item has a description

located on a separate physical database. These item

descriptions vary in

size and for the most part are quite large. Currently

we are only indexing

items and not their corresponding description and a

full import takes around

4 hours. Ideally we want to index both our items and

their descriptions but

after some quick profiling I determined that a full

import would take in

excess of 24 hours.

- How would I profile the indexing process to

determine if the bottleneck is

Solr or our Database.


As a data point, I routinely see clients index 5M items on
normal
hardware in approx. 1 hour (give or take 30 minutes).


When you say "quite large", what do you mean?  Are we
talking books here or maybe a couple pages of text or just a
couple KB of data?

How long does it take you to get that data out (and, from
the sounds of it, merge it with your item) w/o going to
Solr?


- In either case, how would one speed up this process?

Is there a way to run

parallel import processes and then merge them together

at the end? Possibly

use some sort of distributed computing?


DataImportHandler now supports multiple threads.  The
absolute fastest way that I know of to index is via multiple
threads sending batches of documents at a time (at least
100).  Often, from DBs one can split up the table via
SQL statements that can then be fetched separately.
You may want to write your own multithreaded client to
index.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Importing large datasets




On 3 Jun 2010, at 02:58, Dennis Gearon  wrote:

When adding data continuously, that data is available after  
committing and is indexed, right?

Yes


If so, how often is reindexing do some good?
You should only need to reindex if the data changes or you change your  
schema. The DIH in solr 1.4 supports delta imports so you should only  
really be adding of updating (which is actually deleting and adding)  
items when necessary.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki  wrote:


From: Andrzej Bialecki 
Subject: Re: Importing large datasets
To: solr-user@lucene.apache.org
Date: Wednesday, June 2, 2010, 4:52 AM
On 2010-06-02 13:12, Grant Ingersoll
wrote:


On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:


On 2010-06-02 12:42, Grant Ingersoll wrote:


On Jun 1, 2010, at 9:54 PM, Blargy wrote:



We have around 5 million items in our

index and each item has a description

located on a separate physical database.

These item descriptions vary in

size and for the most part are quite

large. Currently we are only indexing

items and not their corresponding

description and a full import takes around

4 hours. Ideally we want to index both our

items and their descriptions but

after some quick profiling I determined

that a full import would take in

excess of 24 hours.

- How would I profile the indexing process

to determine if the bottleneck is

Solr or our Database.


As a data point, I routinely see clients index

5M items on normal

hardware in approx. 1 hour (give or take 30

minutes).


When you say "quite large", what do you

mean?  Are we talking books here or maybe a couple
pages of text or just a couple KB of data?


How long does it take you to get that data out

(and, from the sounds of it, merge it with your item) w/o
going to Solr?



- In either case, how would one speed up

this process? Is there a way to run

parallel import processes and then merge

them together at the end? Possibly

use some sort of distributed computing?


DataImportHandler now supports multiple

threads.  The absolute fastest way that I know of to
index is via multiple threads sending batches of documents
at a time (at least 100).  Often, from DBs one can
split up the table via SQL statements that can then be
fetched separately.  You may want to write your own
multithreaded client to index.


SOLR-1301 is also an option if you are familiar

with Hadoop ...




If the bottleneck is the DB, will that do much?



Nope. But the workflow could be set up so that during night
hours a DB
export takes place that results in a CSV or SolrXML file
(there you
could measure the time it takes to do this export), and
then indexing
can work from this file.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _
_   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic
Web
___|||__||  \|  ||  |  Embedded Unix,
System Integration
http://www.sigram.com  Contact: info at sigram dot
com

RE: Solr Search problem; cannot search the existing word in the index content

2010-06-02 Thread Zero Yao

Modify all  settings in solrconfig.xml and try again, by 
default solr will only index the first 1 fields.

Best Regards,
Yandong

-Original Message-
From: Mint o_O! [mailto:mint@gmail.com] 
Sent: 2010年6月3日 13:58
To: solr-user@lucene.apache.org
Subject: Re: Solr Search problem; cannot search the existing word in the index 
content

Thanks for you advice. I did as you said and i still cannot search my
content.

One thing i notice here i can search for only the words within first 100
rows or maybe bigger than this not sure but not all. So is it the limitation
of the index it self? When I create another sample content with only small
amount of data. It's working great!!!
My content is around 1.2M. I stored it as the text field as in the
schema.xml sample file.

Anyone has the same issue with me?

thanks,

Mint

On Tue, May 18, 2010 at 1:58 PM, Lance Norskog  wrote:

> backslash*rhode
> \*rhode may work.
>
> On Mon, May 17, 2010 at 7:23 AM, Erick Erickson 
> wrote:
> > A couple of things:
> > 1> try searching with &debugQuery=on attached to your URL, that'll
> > give you some clues.
> > 2> It's really worthwhile exploring the admin pages for a while, it'll
> also
> > give you a world of information. It takes a while to understand what the
> > various pages are telling you, but you'll come to rely on them.
> > 3> Are you really searching with leading and trailing wildcards or is
> that
> > just the mail changing bolding? Because this is tricky, very tricky.
> Search
> > the mail archives for "leading wildcard" to see lots of discussion of
> this
> > topic.
> >
> > You might back off a bit and try building up to wildcards if that's what
> > you're doing
> >
> > HTH
> > Erick
> >
> > On Mon, May 17, 2010 at 1:11 AM, Mint o_O!  wrote:
> >
> >> Hi,
> >>
> >> I'm working on the index/search project recently and i found solr which
> is
> >> very fascinating to me.
> >>
> >> I followed the test successful from the tutorial page. Starting up jetty
> >> and
> >> run adding new xml (user:~/solr/example/exampledocs$ *java -jar post.jar
> >> *.xml*) so far so good at this stage.
> >>
> >> Now i have create my own testing westpac.xml file with real data I
> intend
> >> to
> >> implement, putting in exampledocs and again ran the command
> >> (user:~/solr/example/exampledocs$ *java -jar post.jar westpac.xml*).
> >> Everything went on very well however when i searched for "*rhode*" which
> is
> >> in the content. And Index returned nothing.
> >>
> >> Could anyone guide me what I did wrong why i couldn't search for that
> word
> >> even though that word is in my index content.
> >>
> >> thanks,
> >>
> >> Mint
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Re: Solr Search problem; cannot search the existing word in the index content

2010-06-02 Thread Mint o_O!

Thanks for you advice. I did as you said and i still cannot search my
content.

One thing i notice here i can search for only the words within first 100
rows or maybe bigger than this not sure but not all. So is it the limitation
of the index it self? When I create another sample content with only small
amount of data. It's working great!!!
My content is around 1.2M. I stored it as the text field as in the
schema.xml sample file.

Anyone has the same issue with me?

thanks,

Mint

On Tue, May 18, 2010 at 1:58 PM, Lance Norskog  wrote:

> backslash*rhode
> \*rhode may work.
>
> On Mon, May 17, 2010 at 7:23 AM, Erick Erickson 
> wrote:
> > A couple of things:
> > 1> try searching with &debugQuery=on attached to your URL, that'll
> > give you some clues.
> > 2> It's really worthwhile exploring the admin pages for a while, it'll
> also
> > give you a world of information. It takes a while to understand what the
> > various pages are telling you, but you'll come to rely on them.
> > 3> Are you really searching with leading and trailing wildcards or is
> that
> > just the mail changing bolding? Because this is tricky, very tricky.
> Search
> > the mail archives for "leading wildcard" to see lots of discussion of
> this
> > topic.
> >
> > You might back off a bit and try building up to wildcards if that's what
> > you're doing
> >
> > HTH
> > Erick
> >
> > On Mon, May 17, 2010 at 1:11 AM, Mint o_O!  wrote:
> >
> >> Hi,
> >>
> >> I'm working on the index/search project recently and i found solr which
> is
> >> very fascinating to me.
> >>
> >> I followed the test successful from the tutorial page. Starting up jetty
> >> and
> >> run adding new xml (user:~/solr/example/exampledocs$ *java -jar post.jar
> >> *.xml*) so far so good at this stage.
> >>
> >> Now i have create my own testing westpac.xml file with real data I
> intend
> >> to
> >> implement, putting in exampledocs and again ran the command
> >> (user:~/solr/example/exampledocs$ *java -jar post.jar westpac.xml*).
> >> Everything went on very well however when i searched for "*rhode*" which
> is
> >> in the content. And Index returned nothing.
> >>
> >> Could anyone guide me what I did wrong why i couldn't search for that
> word
> >> even though that word is in my index content.
> >>
> >> thanks,
> >>
> >> Mint
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Error loading class 'solr.HTMLStripStandardTokenizerFactory'

2010-06-02 Thread Terance Dias


Hi, 

I'm trying to use the field collapsing feature. 
For that I need to take a checkout of the trunk and apply the patch
available at https://issues.apache.org/jira/browse/SOLR-236
When I take a checkout and run the example-DIH, I get following error in
browser on doing dataimport?command=full-import 

org.apache.solr.common.SolrException: Plugin init failure for [schema.xml]
analyzer/tokenizer:Error loading class
'solr.HTMLStripStandardTokenizerFactory' 
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168)
 
at
org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:904) 
at
org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60) 
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:445) 
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:435) 
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142)
 
at
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:480) 
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:122) 
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:429) 
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:286) 
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:198) 
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:123)
 
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) 
at
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) 
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) 
at
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:662) 
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) 
at
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1250) 
at
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517) 
at
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:467) 
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) 
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) 
at
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
 
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) 
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) 
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) 
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) 
at org.mortbay.jetty.Server.doStart(Server.java:224) 
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) 
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 
at java.lang.reflect.Method.invoke(Method.java:597) 
at org.mortbay.start.Main.invokeMain(Main.java:194) 
at org.mortbay.start.Main.start(Main.java:534) 
at org.mortbay.start.Main.start(Main.java:441) 
at org.mortbay.start.Main.main(Main.java:119) 
Caused by: org.apache.solr.common.SolrException: Error loading class
'solr.HTMLStripStandardTokenizerFactory' 
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:388) 
at
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:403)
 
at
org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:85)
 
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142)
 
... 37 more 
Caused by: java.lang.ClassNotFoundException:
solr.HTMLStripStandardTokenizerFactory 
at java.net.URLClassLoader$1.run(URLClassLoader.java:200) 
at java.security.AccessController.doPrivileged(Native Method) 
at java.net.URLClassLoader.findClass(URLClassLoader.java:188) 
at java.lang.ClassLoader.loadClass(ClassLoader.java:307) 
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592) 
at java.lang.ClassLoader.loadClass(ClassLoader.java:252) 
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) 
at java.lang.Class.forName0(Native Method) 
at java.lang.Class.forName(Class.java:247) 
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:372) 
... 40 more 


Because of this error I cannot proceed with applying the patch and trying
out the field collapsing feature. 
Appreciate any help. 

Thanks, 
Terance.
--

Re: Array of arguments in URL?

2010-06-02 Thread Lance Norskog

Ah! Thank you.

On Wed, Jun 2, 2010 at 9:52 AM, Chris Hostetter
 wrote:
>
> : In the "/spell" declaration in the example solrconfig.xml, we find
> : these lines among the default parameters:
>
> as grant pointed out: these aren't in the default params
>
> : How does one supply such an array of strings in HTTP parameters? Does
> : Solr have a parsing option for this?
>
> in general, ignoring for a moment hte question of wether you are asking
> about changing the component list in a param (you can't) and addressing
> just the question of specifing an array of strings in HTTP params: if the
> param supports multiple values, then you can specify multiple values just
> be  repeating hte key...
>
>  q=foo&fq=firstValue&fq=secondValue&fq=thirdValue
>
> ...this results in a SolrParams instance where the "value" of "fq" is an
> array of [firstValue, secondValue]
>
>
>
>
> -Hoss
>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Importing large datasets


Would dumping the databases to a local file help at all?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866538.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets



Erik Hatcher-4 wrote:
> 
> One thing that might help indexing speed - create a *single* SQL query  
> to grab all the data you need without using DIH's sub-entities, at  
> least the non-cached ones.
> 
>   Erik
> 
> On Jun 2, 2010, at 12:21 PM, Blargy wrote:
> 
>>
>>
>> As a data point, I routinely see clients index 5M items on normal  
>> hardware
>> in approx. 1 hour (give or take 30 minutes).
>>
>> Also wanted to add that our main entity (item) consists of 5 sub- 
>> entities
>> (ie, joins). 2 of those 5 are fairly small so I am using
>> CachedSqlEntityProcessor for them but the other 3 (which includes
>> item_description) are normal.
>>
>> All the entites minus the item_description connect to datasource1.  
>> They
>> currently point to one physical machine although we do have a pool  
>> of 3 DB's
>> that could be used if it helps. The other entity, item_description  
>> uses a
>> datasource2 which has a pool of 2 DB's that could potentially be  
>> used. Not
>> sure if that would help or not.
>>
>> I might as well that the item description will have indexed, stored  
>> and term
>> vectors set to true.
>> -- 
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

I can't find any example of creating a massive sql query. Any out there?
Will batching still work with this massive query?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets



Lance Norskog-2 wrote:
> 
> Wait! You're fetching records from one database and then doing lookups
> against another DB? That makes this a completely different problem.
> 
> The DIH does not to my knowledge have the ability to "pool" these
> queries. That is, it will not build a batch of 1000 keys from
> datasource1 and then do a query against datasource2 with:
> select foo where key_field IN (key1, key2,... key1000);
> 
> This is the efficient way to do what you want. You'll have to write
> your own client to do this.
> 
> On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
>  wrote:
>> How long does it take to do a grab of all the data via SQL? I found by
>> denormalizing the data into a lookup table meant that I was able to index
>> about 300k rows of similar data size with dih regex spilting on some
>> fields
>> in about 8mins I know it's not quite the scale bit with batching...
>>
>> David Stuar
>>
>> On 2 Jun 2010, at 17:58, Blargy  wrote:
>>
>>>
>>>
>>>
 One thing that might help indexing speed - create a *single* SQL query
 to grab all the data you need without using DIH's sub-entities, at
 least the non-cached ones.

>>>
>>> Not sure how much that would help. As I mentioned that without the item
>>> description import the full process takes 4 hours which is bearable.
>>> However
>>> once I started to import the item description which is located on a
>>> separate
>>> machine/database the import process exploded to over 24 hours.
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
> 

Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its
so slow because I am using 2 different datasources?

Say I am using just one datasource should I still be seing "Creating a
connection for entity " for each sub entity in the document or should it
just be using one connection?




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866499.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets

2010-06-02 Thread Dennis Gearon

That's promising!!! That's how I have been desigining my project. It must be 
all the joins that are causing the problems for him?
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, David Stuart  wrote:

> From: David Stuart 
> Subject: Re: Importing large datasets
> To: "solr-user@lucene.apache.org" 
> Date: Wednesday, June 2, 2010, 12:00 PM
> How long does it take to do a grab of
> all the data via SQL? I found by denormalizing the data into
> a lookup table meant that I was able to index about 300k
> rows of similar data size with dih regex spilting on some
> fields in about 8mins I know it's not quite the scale bit
> with batching...
> 
> David Stuar
> 
> On 2 Jun 2010, at 17:58, Blargy 
> wrote:
> 
> > 
> > 
> > 
> >> One thing that might help indexing speed - create
> a *single* SQL query
> >> to grab all the data you need without using DIH's
> sub-entities, at
> >> least the non-cached ones.
> >> 
> > 
> > Not sure how much that would help. As I mentioned that
> without the item
> > description import the full process takes 4 hours
> which is bearable. However
> > once I started to import the item description which is
> located on a separate
> > machine/database the import process exploded to over
> 24 hours.
> > 
> > --View this message in context: 
> > http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
> > Sent from the Solr - User mailing list archive at
> Nabble.com.
>

Re: Importing large datasets

2010-06-02 Thread Dennis Gearon

When adding data continuously, that data is available after committing and is 
indexed, right?

If so, how often is reindexing do some good?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki  wrote:

> From: Andrzej Bialecki 
> Subject: Re: Importing large datasets
> To: solr-user@lucene.apache.org
> Date: Wednesday, June 2, 2010, 4:52 AM
> On 2010-06-02 13:12, Grant Ingersoll
> wrote:
> > 
> > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
> > 
> >> On 2010-06-02 12:42, Grant Ingersoll wrote:
> >>>
> >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> >>>
> 
>  We have around 5 million items in our
> index and each item has a description
>  located on a separate physical database.
> These item descriptions vary in
>  size and for the most part are quite
> large. Currently we are only indexing
>  items and not their corresponding
> description and a full import takes around
>  4 hours. Ideally we want to index both our
> items and their descriptions but
>  after some quick profiling I determined
> that a full import would take in
>  excess of 24 hours. 
> 
>  - How would I profile the indexing process
> to determine if the bottleneck is
>  Solr or our Database.
> >>>
> >>> As a data point, I routinely see clients index
> 5M items on normal
> >>> hardware in approx. 1 hour (give or take 30
> minutes).  
> >>>
> >>> When you say "quite large", what do you
> mean?  Are we talking books here or maybe a couple
> pages of text or just a couple KB of data?
> >>>
> >>> How long does it take you to get that data out
> (and, from the sounds of it, merge it with your item) w/o
> going to Solr?
> >>>
>  - In either case, how would one speed up
> this process? Is there a way to run
>  parallel import processes and then merge
> them together at the end? Possibly
>  use some sort of distributed computing?
> >>>
> >>> DataImportHandler now supports multiple
> threads.  The absolute fastest way that I know of to
> index is via multiple threads sending batches of documents
> at a time (at least 100).  Often, from DBs one can
> split up the table via SQL statements that can then be
> fetched separately.  You may want to write your own
> multithreaded client to index.
> >>
> >> SOLR-1301 is also an option if you are familiar
> with Hadoop ...
> >>
> > 
> > If the bottleneck is the DB, will that do much?
> > 
> 
> Nope. But the workflow could be set up so that during night
> hours a DB
> export takes place that results in a CSV or SolrXML file
> (there you
> could measure the time it takes to do this export), and
> then indexing
> can work from this file.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _
> _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix,
> System Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
>

Re: Importing large datasets

2010-06-02 Thread Dennis Gearon

Well, I hope to have around 5 million datasets/documents within 1 year, so this 
is good info. BUT if I DO have that many, then the market I am aiming at will 
end giving me 100 times more than than within 2 years.

Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 
million plus documents? The data is easily shardible geographially, as one 
given.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll  wrote:

> From: Grant Ingersoll 
> Subject: Re: Importing large datasets
> To: solr-user@lucene.apache.org
> Date: Wednesday, June 2, 2010, 3:42 AM
> 
> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> 
> > 
> > We have around 5 million items in our index and each
> item has a description
> > located on a separate physical database. These item
> descriptions vary in
> > size and for the most part are quite large. Currently
> we are only indexing
> > items and not their corresponding description and a
> full import takes around
> > 4 hours. Ideally we want to index both our items and
> their descriptions but
> > after some quick profiling I determined that a full
> import would take in
> > excess of 24 hours. 
> > 
> > - How would I profile the indexing process to
> determine if the bottleneck is
> > Solr or our Database.
> 
> As a data point, I routinely see clients index 5M items on
> normal
> hardware in approx. 1 hour (give or take 30 minutes). 
> 
> 
> When you say "quite large", what do you mean?  Are we
> talking books here or maybe a couple pages of text or just a
> couple KB of data?
> 
> How long does it take you to get that data out (and, from
> the sounds of it, merge it with your item) w/o going to
> Solr?
> 
> > - In either case, how would one speed up this process?
> Is there a way to run
> > parallel import processes and then merge them together
> at the end? Possibly
> > use some sort of distributed computing?
> 
> DataImportHandler now supports multiple threads.  The
> absolute fastest way that I know of to index is via multiple
> threads sending batches of documents at a time (at least
> 100).  Often, from DBs one can split up the table via
> SQL statements that can then be fetched separately. 
> You may want to write your own multithreaded client to
> index.
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
> 
>

Re: Importing large datasets

2010-06-02 Thread Lance Norskog

Wait! You're fetching records from one database and then doing lookups
against another DB? That makes this a completely different problem.

The DIH does not to my knowledge have the ability to "pool" these
queries. That is, it will not build a batch of 1000 keys from
datasource1 and then do a query against datasource2 with:
select foo where key_field IN (key1, key2,... key1000);

This is the efficient way to do what you want. You'll have to write
your own client to do this.

On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
 wrote:
> How long does it take to do a grab of all the data via SQL? I found by
> denormalizing the data into a lookup table meant that I was able to index
> about 300k rows of similar data size with dih regex spilting on some fields
> in about 8mins I know it's not quite the scale bit with batching...
>
> David Stuar
>
> On 2 Jun 2010, at 17:58, Blargy  wrote:
>
>>
>>
>>
>>> One thing that might help indexing speed - create a *single* SQL query
>>> to grab all the data you need without using DIH's sub-entities, at
>>> least the non-cached ones.
>>>
>>
>> Not sure how much that would help. As I mentioned that without the item
>> description import the full process takes 4 hours which is bearable.
>> However
>> once I started to import the item description which is located on a
>> separate
>> machine/database the import process exploded to over 24 hours.
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com

Some basics

2010-06-02 Thread Frank A

Hi,

I'm new to SOLR and have some basic questions that hopefully steer me in the
right direction.

- I want my search to "auto" spell check - that is if someone types
"restarant" I'd like the system to automatically search for restaurant.
I've seen the SpellCheckComponent but that doesn't seem to have a simple way
to automatically do the "near" type comparison.  Is the SpellCheckComponent
the wrong one or do I just need to manually handle the situation in my
client code?

- Also, what is the proper analyzer if I want to search a search for "thai
food" or "thai restaurant" to actually match on Thai?  I can't totally
ignore words like food and restaurant but I want to ignore more general
terms and look for specific first (or I should say score them higher).

Any tips on what I should be reading up on will be greatly appreciated.

Thanks.

Re: DataImportHandler and running out of disk space


: I ran through some more failure scenarios (scenarios and results below). The
: concerning ones in my deployment are when data does not get updated, but the
: DIH's .properties file does. I could only simulate that scenario when I ran
: out of disk space (all all disk space issues behaved consistently). Is this
: worthy of a JIRA issue?

I don't know that it's DIH's responsibility to be specificly aware of disk 
space issues -- but it definitely sounds like a bug if Exceptions/Errors 
like running out of space (or file permissions errors) are occuring but 
DIH is still reporting success (and still updating hte properties file 
with the lsat updated timestamp)

by all means: please open issues for these types of things.

: Successful import
: 
: all dates updated in .properties (title date updated, each [entity
: name].last_index_time updated to its own update time. last_index_time set to
: earliest entity update time)
: 
: 
: 
: 
: Running out of disk space during import (in data directory only, conf
: directory still has space)
: 
: no data updated, but dataimport.properties updated as in 1
: 
: 
: 
: 
: Running out of disk space during import (in both data directory and conf
: directory)
: 
: some data updated, but dataimport.properties updated as in 1
: 
: 
: 
: 
: Running out of disk space during commit/optimize (in data directory only,
: conf directory still has space)
: 
: no data updated, but dataimport.properties updated as in 1
: 
: 
: 
: 
: Running out of disk space during commit/optimize (in both data directory and
: conf directory)
: 
: no data updated, but dataimport.properties updated as in 1
: 
: 
: 
: 
: File permissions prevent writing (on index directory)
: 
: data not updated, failure reported, properties file not updated
: 
: 
: 
: 
: File permissions prevent writing (on segment files)
: 
: data updated, failure reported, properties file not updated
: 
: 
: 
: 
: File permissions prevent writing (on .properties file)
: 
: data updated, failure reported, properties file not updated
: 
: 
: 
: 
: Shutting down Solr during import (killing process)
: 
: data not updated, .properties not updated, no result reported
: 
: 
: 
: 
: Shutting down Solr during import (issuing shutdown message)
: 
: Some data updated, .properties not updated, no result reported
: 
: 
: 
: 
: DB connection lost (unplugging network cable)
: 
: data not updated, .properties not updated, failure reported
: 
: 
: 
: 
: Updating single entity fails (first one)
: 
: data not updated, .properties not updated, failure reported
: 
: 
: 
: 
: Updating single entity fails (after another one succeeds)
: 
: data not updated, .properties not updated, failure reported
: 
: 
: 
: 
: 
: -- 
: View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-and-running-out-of-disk-space-tp835125p835368.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 



-Hoss

Help in facet query

2010-06-02 Thread Sushan Rungta

Hi,

Can I restrict the facet search within the result count? 

Example: A total of 100 documents were fetched for a given query x, and
facet worked in these 100 documents. I want that facet should work only on
first 10 documents fetched from query x.

Regards,

Sushan Rungta

Re: Luke browser does not show non-String Solr fields?

2010-06-02 Thread jlist9

Thank you Chris. I'm clear now. I'll give Luke's latest version a try
when it's out.

On Wed, Jun 2, 2010 at 9:47 AM, Chris Hostetter
 wrote:
>
> : I see. It's still a little confusing to me but I'm fine as long as
> : this is the expected behavior. I also tried the "example" index
> : with data that come with the solr distribution and observe the
> : same behavior - only String fields are displayed. So Lucene is
> : sharing _some_ types with Solr but not all. It's still a bit puzzling
> : to me that Lucene is not able to understand the simple types
> : such as long. But I'm OK as long as there is a reason. Thanks
> : for the explanations!
>
> The key is that there are *no* types in Lucene ... older
> versions of Lucene only supported "Strin" and clinets that wanted to index
> other types had to encode those types in some way as needed.  More
> recently lucene has started moving away from even dealing with Strings,
> and towards just indexing/searching raw byte[] ... all concepts of "field
> types" in Solr are specific to Solr
>
> (the caveat being that Lucene has, over the years, added utilities to help
> people make smart choices about how to encode some data types -- and in
> the case of the Trie numeric fields SOlr uses those utilites.  But that
> data isn't stored anywhere in the index files themselves, so Luke has no
> way of knowing that it should attempt to "decode" the binary data of a
> field using the Trie utilities.  That said: aparently Andrzej is working
> on making it possible to tell Luke "oh BTW, i indexed this field using
> this solr fieldType" ... i think he said it was on the Luke trunk)
>
>
> -Hoss

Re: Not able to access Solr Admin

2010-06-02 Thread Abdelhamid ABID

When you access from another machine what message error do you get ?

Check your remote access with Telnet to see if the server respond

On Wed, Jun 2, 2010 at 10:26 PM, Bondiga, Murali <
murali.krishna.bond...@hmhpub.com> wrote:

> Thank you so much for the reply.
>
> I am using Jetty which comes with Solr installation.
>
> http://localhost:8983/solr/
>
> The above URL works fine.
>
> The below URL does not work:
>
> http://177.44.9.119:8983/solr/
>
>
> -Original Message-
> From: Abdelhamid ABID [mailto:aeh.a...@gmail.com]
> Sent: Wednesday, June 02, 2010 5:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Not able to access Solr Admin
>
> details... detailseverybody let's say details !
>
> Which app server are you using ?
> What is the error message that you get when trying to access solr admin
> from
> another machine  ?
>
>
>
> On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali <
> murali.krishna.bond...@hmhpub.com> wrote:
>
> > Hi,
> >
> > I installed Solr Server on my machine and able to access with localhost.
> I
> > tried accessing from a different machine with IP Address but not able to
> > access it. What do I need to do to be able to access the Solr instance
> from
> > any machine within the network?
> >
> > Thanks,
> > Murali
> >
>
>
>
> --
> Abdelhamid ABID
> Software Engineer- J2EE / WEB
>



-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB

RE: Not able to access Solr Admin

2010-06-02 Thread Bondiga, Murali

Thank you so much for the reply.

I am using Jetty which comes with Solr installation. 

http://localhost:8983/solr/

The above URL works fine. 

The below URL does not work:

http://177.44.9.119:8983/solr/

-Original Message-
From: Abdelhamid ABID [mailto:aeh.a...@gmail.com] 
Sent: Wednesday, June 02, 2010 5:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Not able to access Solr Admin

details... detailseverybody let's say details !

Which app server are you using ?
What is the error message that you get when trying to access solr admin from
another machine  ?

On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali <
murali.krishna.bond...@hmhpub.com> wrote:

> Hi,
>
> I installed Solr Server on my machine and able to access with localhost. I
> tried accessing from a different machine with IP Address but not able to
> access it. What do I need to do to be able to access the Solr instance from
> any machine within the network?
>
> Thanks,
> Murali
>

-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB

Re: Not able to access Solr Admin

2010-06-02 Thread Abdelhamid ABID

details... detailseverybody let's say details !

Which app server are you using ?
What is the error message that you get when trying to access solr admin from
another machine  ?



On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali <
murali.krishna.bond...@hmhpub.com> wrote:

> Hi,
>
> I installed Solr Server on my machine and able to access with localhost. I
> tried accessing from a different machine with IP Address but not able to
> access it. What do I need to do to be able to access the Solr instance from
> any machine within the network?
>
> Thanks,
> Murali
>



-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB

Not able to access Solr Admin

2010-06-02 Thread Bondiga, Murali

Hi,

I installed Solr Server on my machine and able to access with localhost. I 
tried accessing from a different machine with IP Address but not able to access 
it. What do I need to do to be able to access the Solr instance from any 
machine within the network?

Thanks,
Murali

RE: Auto-suggest internal terms

2010-06-02 Thread Tim Gilbert

I was interested in the same thing and stumbled upon this article:

http://www.mattweber.org/2009/05/02/solr-autosuggest-with-termscomponent
-and-jquery/

I haven't followed through, but it looked promising to me.

Tim

-Original Message-
From: Jay Hill [mailto:jayallenh...@gmail.com] 
Sent: Wednesday, June 02, 2010 4:02 PM
To: solr-user@lucene.apache.org
Subject: Auto-suggest internal terms

I've got a situation where I'm looking to build an auto-suggest where
any
term entered will lead to suggestions. For example, if I type "wine" I
want
to see suggestions like this:

french *wine* classes
*wine* book discounts
burgundy *wine*

etc.

I've tried some tricks with shingles, but the only solution that worked
was
pre-processing my queries into a core in all variations.

Anyone know any tricks to accomplish this in Solr without doing any
custom
work?

-Jay

RE: Auto-suggest internal terms

2010-06-02 Thread Patrick Wilson

I'm painfully new to Solr so please be gentle if my suggestion is terrible!

Could you use highlighting to do this? Take the first n results from a query 
and show their highlights, customizing the highlights to show the desired 
number of words.

Just a thought.

Patrick

-Original Message-
From: Jay Hill [mailto:jayallenh...@gmail.com]
Sent: Wednesday, June 02, 2010 4:02 PM
To: solr-user@lucene.apache.org
Subject: Auto-suggest internal terms

I've got a situation where I'm looking to build an auto-suggest where any
term entered will lead to suggestions. For example, if I type "wine" I want
to see suggestions like this:

french *wine* classes
*wine* book discounts
burgundy *wine*

etc.

I've tried some tricks with shingles, but the only solution that worked was
pre-processing my queries into a core in all variations.

Anyone know any tricks to accomplish this in Solr without doing any custom
work?

-Jay

Auto-suggest internal terms

2010-06-02 Thread Jay Hill

I've got a situation where I'm looking to build an auto-suggest where any
term entered will lead to suggestions. For example, if I type "wine" I want
to see suggestions like this:

french *wine* classes
*wine* book discounts
burgundy *wine*

etc.

I've tried some tricks with shingles, but the only solution that worked was
pre-processing my queries into a core in all variations.

Anyone know any tricks to accomplish this in Solr without doing any custom
work?

-Jay

Re: Query related question

: When I query for a word say Tiger woods, and sort results by score... i do
: notice that the results are mixed up i.e first 5 results match Tiger woods
: the next 2 match either tiger/tigers or wood/woods
: the next 2 after that i notice again match tiger woods.
: 
: How do i make sure that when searching for words like above i get all the
: results matching whole search term first, followed by individual tokens like
: tiger, woods later.

for starters, you have to make sense of why exactly those docs are scoring 
that way -- this is what the param debugQuery=true is for -- look at the
score explanations and see why those docs are scoring lower.

My guess is that it's because of fieldNorms (ie: longer documents score 
lower with the same number of matches) but it could also be a term 
frequency factor (some documents contain "tiger" so many times they score 
high even w/o "woods") ... you have to understand why your docs score they 
way they do before you can come up with a general plan for how to change 
the scoring to better meet your goals.



-Hoss

Re: Importing large datasets

How long does it take to do a grab of all the data via SQL? I found by  
denormalizing the data into a lookup table meant that I was able to  
index about 300k rows of similar data size with dih regex spilting on  
some fields in about 8mins I know it's not quite the scale bit with  
batching...


David Stuar

On 2 Jun 2010, at 17:58, Blargy  wrote:





One thing that might help indexing speed - create a *single* SQL  
query

to grab all the data you need without using DIH's sub-entities, at
least the non-cached ones.



Not sure how much that would help. As I mentioned that without the  
item
description import the full process takes 4 hours which is bearable.  
However
once I started to import the item description which is located on a  
separate

machine/database the import process exploded to over 24 hours.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: minpercentage vs. mincount

2010-06-02 Thread Lukas Kahwe Smith

thx for your reply!

On 02.06.2010, at 20:27, Chris Hostetter wrote:

> feel free to file a feature request -- truthfully this is kind of a hard 
> problem to solve in userland, you'd either have to do two queries (the 
> first to get the numFound, the second with facet.mincount set as an 
> integer relative numFound) or you'd have to do a single query but ask for 
> a "big" value for facet.limit and hope that you get enough to prune your 
> list.

well i would probably implement it by just not setting a limit, and then just 
reducing the facets based on the numRows before sending the facets to the 
client (aka browser)

> Off the top of my head though: i can't relaly think of a sane way to do 
> this on the server side that would work with distributed search either -- 
> but go ahead and open an issue and let's see what the folks who are really 
> smart about the distributed searching stuff have to say.


ok i have created it:
https://issues.apache.org/jira/browse/SOLR-1937

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: minpercentage vs. mincount


: Obviously I could implement this in userland (like like mincount for 
: that matter), but I wonder if anyone else see's use in being able to 
: define that a facet must match a minimum percentage of all documents in 
: the result set, rather than a hardcoded value? The idea being that while 
: I might not be interested in a facet that only covers 3 documents in the 
: result set if there are lets say 1000 documents in the result set, the 
: situation would be a lot different if I only have 10 documents in the 
: result set.

typically people deal with this type of situation by using facet.limit to 
ensure they only get the "top" N constraints back -- and they set 
facet.mincount to something low just to save bandwidth if all the 
counts are "too low to care about no matter how few results there are" 
(ie: 0)

: I did not yet see such a feature, would it make sense to file it as a 
: feature request or should stuff like this rather be done in userland (I 
: have noticed for example that Solr prefers to have users normalize the 
: scores in userland too)?

feel free to file a feature request -- truthfully this is kind of a hard 
problem to solve in userland, you'd either have to do two queries (the 
first to get the numFound, the second with facet.mincount set as an 
integer relative numFound) or you'd have to do a single query but ask for 
a "big" value for facet.limit and hope that you get enough to prune your 
list.

Off the top of my head though: i can't relaly think of a sane way to do 
this on the server side that would work with distributed search either -- 
but go ahead and open an issue and let's see what the folks who are really 
smart about the distributed searching stuff have to say.


-Hoss

Re: Importing large datasets




> One thing that might help indexing speed - create a *single* SQL query  
> to grab all the data you need without using DIH's sub-entities, at  
> least the non-cached ones.
> 

Not sure how much that would help. As I mentioned that without the item
description import the full process takes 4 hours which is bearable. However
once I started to import the item description which is located on a separate
machine/database the import process exploded to over 24 hours.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Combining index and file spellcheck dictionaries


: Is it possible to combine index and file spellcheck dictionaries?

off the top of my head -- i don't think so.  however you could add special 
docs to your index, which only contain the "spell" field you use to build 
your spellcheck index, based on the contents of your dictionary file.


-Hoss

Re: Array of arguments in URL?


: In the "/spell" declaration in the example solrconfig.xml, we find
: these lines among the default parameters:

as grant pointed out: these aren't in the default params

: How does one supply such an array of strings in HTTP parameters? Does
: Solr have a parsing option for this?

in general, ignoring for a moment hte question of wether you are asking 
about changing the component list in a param (you can't) and addressing 
just the question of specifing an array of strings in HTTP params: if the 
param supports multiple values, then you can specify multiple values just 
be  repeating hte key...

  q=foo&fq=firstValue&fq=secondValue&fq=thirdValue

...this results in a SolrParams instance where the "value" of "fq" is an 
array of [firstValue, secondValue] 




-Hoss

Re: Luke browser does not show non-String Solr fields?


: I see. It's still a little confusing to me but I'm fine as long as
: this is the expected behavior. I also tried the "example" index
: with data that come with the solr distribution and observe the
: same behavior - only String fields are displayed. So Lucene is
: sharing _some_ types with Solr but not all. It's still a bit puzzling
: to me that Lucene is not able to understand the simple types
: such as long. But I'm OK as long as there is a reason. Thanks
: for the explanations!

The key is that there are *no* types in Lucene ... older 
versions of Lucene only supported "Strin" and clinets that wanted to index 
other types had to encode those types in some way as needed.  More 
recently lucene has started moving away from even dealing with Strings, 
and towards just indexing/searching raw byte[] ... all concepts of "field 
types" in Solr are specific to Solr 

(the caveat being that Lucene has, over the years, added utilities to help 
people make smart choices about how to encode some data types -- and in 
the case of the Trie numeric fields SOlr uses those utilites.  But that 
data isn't stored anywhere in the index files themselves, so Luke has no 
way of knowing that it should attempt to "decode" the binary data of a 
field using the Trie utilities.  That said: aparently Andrzej is working 
on making it possible to tell Luke "oh BTW, i indexed this field using 
this solr fieldType" ... i think he said it was on the Luke trunk)


-Hoss

Re: Importing large datasets

2010-06-02 Thread Erik Hatcher

One thing that might help indexing speed - create a *single* SQL query  
to grab all the data you need without using DIH's sub-entities, at  
least the non-cached ones.


Erik

On Jun 2, 2010, at 12:21 PM, Blargy wrote:




As a data point, I routinely see clients index 5M items on normal  
hardware

in approx. 1 hour (give or take 30 minutes).

Also wanted to add that our main entity (item) consists of 5 sub- 
entities

(ie, joins). 2 of those 5 are fairly small so I am using
CachedSqlEntityProcessor for them but the other 3 (which includes
item_description) are normal.

All the entites minus the item_description connect to datasource1.  
They
currently point to one physical machine although we do have a pool  
of 3 DB's
that could be used if it helps. The other entity, item_description  
uses a
datasource2 which has a pool of 2 DB's that could potentially be  
used. Not

sure if that would help or not.

I might as well that the item description will have indexed, stored  
and term

vectors set to true.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Luke browser does not show non-String Solr fields?

2010-06-02 Thread jlist9

I see. It's still a little confusing to me but I'm fine as long as
this is the expected behavior. I also tried the "example" index
with data that come with the solr distribution and observe the
same behavior - only String fields are displayed. So Lucene is
sharing _some_ types with Solr but not all. It's still a bit puzzling
to me that Lucene is not able to understand the simple types
such as long. But I'm OK as long as there is a reason. Thanks
for the explanations!

On Tue, Jun 1, 2010 at 10:38 AM, Chris Hostetter
 wrote:
>
> : So it seems like Luke does not understand Solr's long type. This
> : is not a native Lucene type?
>
> No,  Lucene has concept of "types" ... there are utilities to help encode
> some data in special ways (particularly numbers) but the underlying lucene
> index doesn't keep track of when/how you do ths -- so Luke has no way of
> knowing what "type" the field is.
>
> Schema information is specific to Solr.
>
>
> -Hoss
>
>

Re: Importing large datasets



As a data point, I routinely see clients index 5M items on normal hardware
in approx. 1 hour (give or take 30 minutes). 

Also wanted to add that our main entity (item) consists of 5 sub-entities
(ie, joins). 2 of those 5 are fairly small so I am using
CachedSqlEntityProcessor for them but the other 3 (which includes
item_description) are normal.

All the entites minus the item_description connect to datasource1. They
currently point to one physical machine although we do have a pool of 3 DB's
that could be used if it helps. The other entity, item_description uses a
datasource2 which has a pool of 2 DB's that could potentially be used. Not
sure if that would help or not.

I might as well that the item description will have indexed, stored and term
vectors set to true.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets



Andrzej Bialecki wrote:
> 
> On 2010-06-02 12:42, Grant Ingersoll wrote:
>> 
>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>> 
>>>
>>> We have around 5 million items in our index and each item has a
>>> description
>>> located on a separate physical database. These item descriptions vary in
>>> size and for the most part are quite large. Currently we are only
>>> indexing
>>> items and not their corresponding description and a full import takes
>>> around
>>> 4 hours. Ideally we want to index both our items and their descriptions
>>> but
>>> after some quick profiling I determined that a full import would take in
>>> excess of 24 hours. 
>>>
>>> - How would I profile the indexing process to determine if the
>>> bottleneck is
>>> Solr or our Database.
>> 
>> As a data point, I routinely see clients index 5M items on normal
>> hardware in approx. 1 hour (give or take 30 minutes).  
>> 
>> When you say "quite large", what do you mean?  Are we talking books here
>> or maybe a couple pages of text or just a couple KB of data?
>> 
>> How long does it take you to get that data out (and, from the sounds of
>> it, merge it with your item) w/o going to Solr?
>> 
>>> - In either case, how would one speed up this process? Is there a way to
>>> run
>>> parallel import processes and then merge them together at the end?
>>> Possibly
>>> use some sort of distributed computing?
>> 
>> DataImportHandler now supports multiple threads.  The absolute fastest
>> way that I know of to index is via multiple threads sending batches of
>> documents at a time (at least 100).  Often, from DBs one can split up the
>> table via SQL statements that can then be fetched separately.  You may
>> want to write your own multithreaded client to index.
> 
> SOLR-1301 is also an option if you are familiar with Hadoop ...
> 
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

I haven't worked with Hadoop before but I'm willing to try anything to cut
down this full import time. I see this currently uses the embedded solr
server for indexing... would I have to scrap my DIH importing then? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865103.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets



As a data point, I routinely see clients index 5M items on normal
> hardware in approx. 1 hour (give or take 30 minutes).  

Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with
4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL
version 5.0.67 (exact stats i don't know of the top of my head)


When you say "quite large", what do you mean?  Are we talking books here or
maybe a couple pages of text or just a couple KB of data?

Our item descriptions are very similar to an ebay listing and can include
HTML. We are talking about a couple of pages of text.


How long does it take you to get that data out (and, from the sounds of it,
merge it with your item) w/o going to Solr? 

I'll have to get back to you on that one.


DataImportHandler now supports multiple threads. 

When you say "now", what do you mean? I am running version 1.4.


The absolute fastest way that I know of to index is via multiple threads
sending batches of documents at a time (at least 100)

 Is there a wiki explaining how this multiple thread process works? Which
batch size would work best? I am currently using a -1 batch size. 


You may want to write your own multithreaded client to index. 

This sounds like a viable option. Can you point me in the right direction on
where to begin (what classes to look at, prior examples, etc)?

Here is my field type I am using for the item description. Maybe its not the
best?

  
  





  


Here is an overview of my data-config.xml. Thoughts?

 
 ...

 

I appreciate the help.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865091.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Many Tomcat Processes on Server ?!?!?


okay you are right. thats all threads and no processes ...
but so many ? :D hehe

so when all the "processes" are threads i think its okay so ?! i can ignore
this ... XD
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p865008.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrException: No such core

2010-06-02 Thread jfmnews

Solr is used to manage lists of indexes.
We have a database containing documents of different types.
Each document type is defined by a list of properties and we want to associate 
some of these properties with lists of indexes to help users during query.

For example:
The property contains a text field "desc" may be associated with a field Solr 
"desc_en_items.

"Desc_en_items" is a dynamic field solr:
  

And so on for each property associated with a field Solr.

Each Solr document contains an identifier (stored and indexed) Solr and dynamic 
fields. (only indexed)

When adding a document in our database, if needed, we dynamically generate the 
document and add it to solr. When a document is deleted from our database we 
suppress systematically the solr document "deleteById" (the document can not 
exist in solr).

There is only one core (Core0) and the server is embedded.

We use a derived lucli/LuceneMethods.java to browse index.
 
It seems to me, without being sure, that the problem comes when no list is set 
(solr is started but contains no records) in a few days of operation. We have a 
database with lists parameterized works for several months without problem.

Here the wrappers to use ...solrj.SolrServer
[code]
public class SolrCoreServer
{
   private static Logger log = LoggerFactory.getLogger(SolrCoreServer.class);

   private SolrServer server=null;
   
   public SolrCoreServer(CoreContainer container, String coreName)
   {
  server = new EmbeddedSolrServer( container, coreName );
   }

   protected SolrServer getSolrServer(){
  return server;
   }
   
   public void cleanup() throws SolrServerException, IOException {
   log.debug("cleanup()");
   UpdateResponse rsp = server.deleteByQuery( "*:*" );
   log.debug("cleanup():" + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException("cleanup() failed status=" + 
rsp.getStatus());
   }
 
   public void add(SolrInputDocument doc) throws SolrServerException, 
IOException{
   log.debug("add(" + doc + ")");
   UpdateResponse rsp = server.add(doc);
   log.debug("add():" + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException("add() failed status=" + 
rsp.getStatus());
}
   
   public void add(Collection docs) throws 
SolrServerException, IOException{
   log.debug("add(" + docs + ")");
   UpdateResponse rsp = server.add(docs);
   log.debug("add():" + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException("add() failed status=" + 
rsp.getStatus());
   }
   
   public void deleteById(String docId) throws SolrServerException, IOException{
   log.debug("deleteById(" + docId + ")");
   UpdateResponse rsp = server.deleteById(docId);
   log.debug("deleteById():" + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException("deleteById() failed status=" 
+ rsp.getStatus());
}
  
   public void commit() throws SolrServerException, IOException {
   log.debug("commit()");
   UpdateResponse rsp = server.commit();
   log.debug("commit():" + rsp.getStatus());
   if (rsp.getStatus() != 0)
   throw new SolrServerException("commit() failed status=" + 
rsp.getStatus());
   }
  
   public void addAndCommit(Collection docs) throws 
SolrServerException, IOException{
  log.debug("addAndCommit(" + docs + ")");
  UpdateRequest req = new UpdateRequest(); 
  req.setAction( UpdateRequest.ACTION.COMMIT, false, false );
  req.add( docs );
  UpdateResponse rsp = req.process( server );  
  log.debug("addAndCommit():" + rsp.getStatus());
  if (rsp.getStatus() != 0)
   throw new SolrServerException("addAndCommit() failed 
status=" + rsp.getStatus());
   }
  
   public QueryResponse query( SolrQuery query ) throws SolrServerException{
  log.debug("query(" + query + ")");
  QueryResponse qr =  server.query( query );
  log.debug("query():" + qr.getStatus());
  return qr;
   }

   public QueryResponse query( String queryString, String sortField, 
SolrQuery.ORDER order, Integer maxRows ) throws SolrServerException{
  log.debug("query(" + queryString + ")");
  SolrQuery query = new SolrQuery();
  query.setQuery( queryString );
  query.addSortField( sortField, order );
  query.setRows(maxRows);
  QueryResponse qr = server.query( query );
  log.debug("query():" + qr.getStatus());
  return qr;
   }

}
[/code]

the schema
[code]

Re: nested querries, and LocalParams syntax

2010-06-02 Thread Jonathan Rochkind

Thanks Yonik.

I guess the confusing thing is if the lucene query parser (for nested
querries) does backslash escaping, and the LocalParams also does
backslash escaping when you have a nested query with local params,
with quotes at both places... the inner scope needs... double escaping?
That gets really confusing fast.

[ Yeah, I recognize that using parameter dereferencing can avoid this;
I'm trying to see if I can make my code flexible enough to work either
way].

Maybe using single vs double quotes is the answer. Let's try one out and
see:

[Query un-uri escaped for clarity:]

_query_:"{!dismax q.alt=' \"a phrase search \" '} \"another phrase
search\" "

[ Heh, getting that into a ruby string to uri escape it is a pain, but
we end up with: ]

&q="_query_%3A%7B%21dismax+q.alt%3D%27%5C%22a+phrase+search%5C%22%27%7D+%5C%22another+phrase+search%5C%22

Which, um, I _think_ is working, although the debugQuery=true isn't
telling me much, I don't entirely understand it. Have to play around
with it more.

But it looks like maybe a fine strategy is use double quote for the
nested query itself, use single quote for the LocalParam values, and
then simply singly escape any single or double quotes inside the
LocalParam values.

Jonathan

Yonik Seeley wrote:

Hmmm, well, the lucene query parser does basic backslash escaping, and
so does local params within quoted strings. You can also use
parameter derefererencing to avoid the need to escape values too.
Like you pointed out, using single quotes in some places can also
help.

But instead of me trying to give you tons of examples that you
probably already understand, start from the assumption that things
will work, and if you come across something that doesn't make sense
(or doesn't work), I can help with that. Or if you give a single
real example as a general pattern, perhaps we could help figure out
the simplest way to avoid most of the escaping.

-Yonik
http://www.lucidimagination.com

On Tue, Jun 1, 2010 at 6:21 PM, Jonathan Rochkind wrote:

I am just trying to figure it out mostly, the particular thing I am trying
to do is a very general purpose mapper to complex dismax nested querries. I
could try to explain it, and we could go back and forth for a while, and
maybe I could convince you it makes sense to do what I'm trying to do. But
mostly I'm just exploring at this point, so I can get a sense of what is
possible.

So it would be super helpful if someone can help me figure out escaping
stuff and skip the other part, heh.

But basically, it's a mapper from a "CQL" query (a structured language for
search-engine-style querries) to Solr, where some of the "fields" searched
aren't really Solr fields/indexes, but aggregated definitions of dismax
query params including multiple solr fields, where exactly what solr fields
and other dismax querries will not be hard-coded, but will be configurable.
Thus the use of nested querries. So since it ends up so general purpose and
abstract, and many of the individual parameters are configurable, thus my
interest in figuring out proper escaping.

Jonathan

Yonik Seeley wrote:

It's not clear if you're just trying to figure it all out, or get
something specific to work.
If you can give a specific example, we might be able to suggest easier
ways to achieve it rather than going escape crazy :-)

-Yonik
http://www.lucidimagination.com

On Tue, Jun 1, 2010 at 5:06 PM, Jonathan Rochkind
wrote:

Thanks, the pointer to that documentation page (which somehow I had
missed),
as well as Chris's response is very helpful.

The one thing I'm still not sure about, which I might be able to figure
it
out through trial-and-error reverse engineering, is escaping issues when
you
combine nested querries WITH local params. We potentially have a lot of
levels of quotes:

q= URIescape(_local_="{!dismax qf=" value that itself contains a \"
quote mark"} "phrase query"" )

Whole bunch of quotes going on there. How do I give this to Solr so all
my
quotes will end up parsed appropriately? Obviously that above example
isn't
right. We've got the quotes around the _local_ nested query, then we've
got quotes around a LocalParam value, then we've got quotes that might be
IN
the actual literal value of the LocalParam, or quotes that might be in
the
actual literal value of the nested query. Maybe using single quotes in
some
places but double quotes in others will help, for certain places that can
take singel or double quotes?
Thanks very much for any advice, I get confused thinking about this.

Jonathan

Chris Hostetter wrote:

In addition to yonik's point about the LocalParams wiki page (and please
let us know if you aren't sure of the answers to any of your questions
after
reading it) I wanted to clear up one thing...

: Let's start with that not-nested query example. Can you in fact use
it
as
: above, to force dismax handling of the 'q' even if the qt or request
handler

Quick side

Re: Many Tomcat Processes on Server ?!?!?



Le 02-juin-10 à 16:57, stockii a écrit :

all the process in in htop show, have a own PID. so thats are no  
threads ?


No, you can't say that.
In general it is sufficient for the "mother process" to be killed but  
it can take several attempts.




i restart my tomcat via " /etc/init.d/tomcat restart "
do you think that after ervery resart the processes arent closed ?


after bin/shutdown.sh it is very common to me that some hanging  
threads remain... and we crafted my little script snippet (which is  
kind of specific) to actually prevent this and kill... after a while  
only.


it's not optimal.

paul

smime.p7s
Description: S/MIME cryptographic signature

RE: Many Tomcat Processes on Server ?!?!?

2010-06-02 Thread Patrick Wilson

Try shutting tomcat down instead of restarting. If processes remain, then I'd 
say further investigation is warranted. If no processes remain, then I think 
it's safe to disregard unless you notice any problems.

-Original Message-
From: stockii [mailto:st...@shopgate.com]
Sent: Wednesday, June 02, 2010 10:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Many Tomcat Processes on Server ?!?!?

all the process in in htop show, have a own PID. so thats are no threads ?

i restart my tomcat via " /etc/init.d/tomcat restart "

do you think that after ervery resart the processes arent closed ?
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864918.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Many Tomcat Processes on Server ?!?!?


all the process in in htop show, have a own PID. so thats are no threads ? 

i restart my tomcat via " /etc/init.d/tomcat restart " 

do you think that after ervery resart the processes arent closed ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864918.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PHP output at a multiValued AND dynamicField

Am 02.06.2010 16:42, schrieb Jörg Agatz:
> i don't understand what you mean!
> 
Then you should ask more precisely.

Re: Many Tomcat Processes on Server ?!?!?

Am 02.06.2010 16:39, schrieb Paul Libbrecht:
> This is impressive, I had this in any Linux I've been using: SuSE,
> Ubuntu, Debian, Mandrake, ...
> Maybe there's some modern JDK with a modern Linux where it doesn't happen?
> It surely is not one process per thread though.

I'm not a linux thread expert, but from what I know Linux doesn't know
lightweight threads as other systems do. Instead it uses processes for that.

But these processes aren't "top level" processes that show up in top/ps.
Instead, they're grouped hierarchically (AFAIK). Otherwise you would be
able to kill single user threads with their own process id, or kill the
main process and let the spawned threads continue. That would be totally
crazy.

In my configuration, Tomcat doesn't shut down correctly if I call
bin/shutdown.sh, so I have to kill the process manually. I don't know
why. This might be the reason why stockii has 3 Tomcat processes running.

Re: PHP output at a multiValued AND dynamicField

2010-06-02 Thread Jörg Agatz

i don't understand what you mean!

Re: Many Tomcat Processes on Server ?!?!?

This is impressive, I had this in any Linux I've been using: SuSE,  
Ubuntu, Debian, Mandrake, ...
Maybe there's some modern JDK with a modern Linux where it doesn't  
happen?

It surely is not one process per thread though.

paul


Le 02-juin-10 à 16:29, Michael Kuhlmann a écrit :


Am 02.06.2010 16:13, schrieb Paul Libbrecht:

Is your server Linux?
In this case this is very normal.. any java application spawns many  
new
processes on linux... it's not exactly bound to threads  
unfortunately.


Uh, no. New threads in Java typically don't spawn new processes on  
OS level.


I never had more than one tomcat process on any Linux machine. In  
fact,

if there was more than one because a previous Tomcat hadn't shut down
correctly, the new process wouldn't respond to HTTP requests.

55 Tomcat processes shouldn't be normal, at least not if that's what  
"ps

aux" responds.




smime.p7s
Description: S/MIME cryptographic signature

Re: Many Tomcat Processes on Server ?!?!?


oha... "ps aux" shows only 3 processes from tomcat55.

but why show htop 55 ? close the garbage collector these not ?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864849.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Many Tomcat Processes on Server ?!?!?

2010-06-02 Thread Patrick Wilson

Maybe he was looking at the output from top or htop?

-Original Message-
From: Michael Kuhlmann [mailto:michael.kuhlm...@zalando.de]
Sent: Wednesday, June 02, 2010 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: Many Tomcat Processes on Server ?!?!?

Am 02.06.2010 16:13, schrieb Paul Libbrecht:
> Is your server Linux?
> In this case this is very normal.. any java application spawns many new
> processes on linux... it's not exactly bound to threads unfortunately.

Uh, no. New threads in Java typically don't spawn new processes on OS level.

I never had more than one tomcat process on any Linux machine. In fact,
if there was more than one because a previous Tomcat hadn't shut down
correctly, the new process wouldn't respond to HTTP requests.

55 Tomcat processes shouldn't be normal, at least not if that's what "ps
aux" responds.

Re: PHP output at a multiValued AND dynamicField

Am 02.06.2010 16:15, schrieb Jörg Agatz:
> yes i done.. but i dont know how i get the information out of the big
> Array...

They're simply the keys of a single response array.

Re: Many Tomcat Processes on Server ?!?!?

Am 02.06.2010 16:13, schrieb Paul Libbrecht:
> Is your server Linux?
> In this case this is very normal.. any java application spawns many new
> processes on linux... it's not exactly bound to threads unfortunately.

Uh, no. New threads in Java typically don't spawn new processes on OS level.

I never had more than one tomcat process on any Linux machine. In fact,
if there was more than one because a previous Tomcat hadn't shut down
correctly, the new process wouldn't respond to HTTP requests.

55 Tomcat processes shouldn't be normal, at least not if that's what "ps
aux" responds.

Re: Many Tomcat Processes on Server ?!?!?


You'd need to search explanations for this at generic java forums.
It's the same with any java process on Linux.
In the Unix family Solaris and MacOSX do it better, fortunately and is  
probably due to the very old time where the Linux java was a  
translation of the Solaris java with the special features implemented  
when it was not found in Linux (e.g. green-threads).


paul




Le 02-juin-10 à 16:21, stockii a écrit :



yes, its a Linux... Debian System.

when i running a import. only 2-3 tomcat processes are running. the  
other

doing nothing ... thats what is strange for me .. ^^
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864804.html
Sent from the Solr - User mailing list archive at Nabble.com.




smime.p7s
Description: S/MIME cryptographic signature

Re: Many Tomcat Processes on Server ?!?!?


yes, its a Linux... Debian System.

when i running a import. only 2-3 tomcat processes are running. the other
doing nothing ... thats what is strange for me .. ^^
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864804.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: RIA sample and minimal JARs required to embed Solr

2010-06-02 Thread Eric Pugh

Glad to hear someone looking at Solr not just as web enabled search engine, but 
as a simpler/more powerful interface to Lucene!   

When you download the source code, look at the Chapter 8 "Crawler" project, 
specifically "Indexer.java", it demonstrates how to index into both a 
traditional separate Solr process and how to fire up an embedded Solr.   It is 
remarkably easy to interact with an embedded Solr!   In terms of minimal 
dependencies, what you need for a standalone Solr (outside of the servlet 
container like Tomcat/Jetty) is what you need for an embedded Solr.

Eric

On May 29, 2010, at 9:32 PM, Thomas J. Buhr wrote:

> Solr,
> 
> The Solr 1.4 EES book arrived yesterday and I'm very much enjoying it. I was 
> glad to see that "rich clients" are one case for embedding Solr as this is 
> the case for my application. Multi Cores will also be important for my RIA.
> 
> The book covers a lot and makes it clear that Solr has extensive abilities. 
> There is however no clean and simple sample of embedding Solr in a RIA in the 
> book, only a few alternate language usage samples. Is there a link to a Java 
> sample that simply embeds Solr for local indexing and searching using Multi 
> Cores?
> 
> Also, what kind of memory footprint am I looking at for embedding Solr? What 
> are the minimal dependancies?
> 
> Thom

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from 
http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal

Re: PHP output at a multiValued AND dynamicField

2010-06-02 Thread Jörg Agatz

yes i done.. but i dont know how i get the information out of the big
Array...

Al fields like P_VIP_ADR_*

Re: Many Tomcat Processes on Server ?!?!?


Is your server Linux?
In this case this is very normal.. any java application spawns many  
new processes on linux... it's not exactly bound to threads  
unfortunately.


And, of course, they all refer to the same invocation path.

paul


Le 02-juin-10 à 15:59, stockii a écrit :



Hello.

Our Server is a 8-Core Server with 12 GB RAM.
Solr is running with 4 Cores.

55 Tomcat 5.5 processes are running. ist this normal ???

htop show me a list of these processes of the server. and tomcat  
have about

55.
every process using:
/usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/ 
bootstrap.jar.


is this normal ?




smime.p7s
Description: S/MIME cryptographic signature

Re: Many Tomcat Processes on Server ?!?!?

2010-06-02 Thread Eric Pugh

My guess would be that commons-daemon is somehow thinking that Tomcat has gone 
down and started up multiple copies...   You only need one Tomcat process for 
your 4 core Solr instance!   You may have many other WAR applications hosted in 
Tomcat, I know a lot of places would have 1 tomcat per deployed WAR pattern.


On Jun 2, 2010, at 9:59 AM, stockii wrote:

> 
> Hello.
> 
> Our Server is a 8-Core Server with 12 GB RAM.  
> Solr is running with 4 Cores. 
> 
> 55 Tomcat 5.5 processes are running. ist this normal ??? 
> 
> htop show me a list of these processes of the server. and tomcat have about
> 55. 
> every process using:
> /usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/bootstrap.jar.
> 
> is this normal ? 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864732.html
> Sent from the Solr - User mailing list archive at Nabble.com.

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from 
http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal

Re: PHP output at a multiValued AND dynamicField

2010-06-02 Thread Erik Hatcher

You probably should try the php or phps response writer - it'll likely  
make your PHP integration easier.


Erik

On Jun 2, 2010, at 9:50 AM, Jörg Agatz wrote:


Hallo Users...

I have a Problem...
In my SolR, i have a lot of multiValued, dynamicFields and now i  
must print

ther Fields in php..

But i dont know how...


In schema.xml:

 stored="true"/>

 
 stored="true"/>

 
 stored="true"/>

 
 stored="true"/>
 stored="true"/>


output from Solr:


A201005311740560002.xml
NO
A201005311740560002
2010-05-31 17:40:56
−

Q:\DatenIBP\AADMS\telli_vip\xml\A201005311740560002.xml


D





Leichlingen


42799











Schlo� Eicherhof

ADRESS
KYETG201005311740560002


I don now ha is the name of the Fields, so i dont know how i get the  
name to

"printr" it in PHP

Maby someone of you has a answer of the problem?

King

Many Tomcat Processes on Server ?!?!?


Hello.

Our Server is a 8-Core Server with 12 GB RAM.  
Solr is running with 4 Cores. 

55 Tomcat 5.5 processes are running. ist this normal ??? 

htop show me a list of these processes of the server. and tomcat have about
55. 
every process using:
/usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/bootstrap.jar.

is this normal ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864732.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Array of arguments in URL?

2010-06-02 Thread Jonathan Rochkind

You CAN easily turn spellchecking on or off, or set the spellcheck dictionary, 
in request parameters.  So there's really no need, that I can think of,  to try 
to actually add or remove the spellcheck component in request parameters; you 
could just leave it turned off in your default parameters, but turn it on in 
request parameters when you want it.  With 
&spellcheck=true&spellcheck.dictionary=whatever. 

But I suspect you weren't really asking about spellcheck component, but in 
general, or perhaps for some other specific purpose? I don't think there's any 
"general" way to pass an array to request parameters. Request parameters that 
take list-like data structures tend to use whitespace to seperate the elements 
instead, to allow you to pass them as request parameters. For instance dismax 
df, pf, etc fields, elements ordinarily seperated by newlines when seen in a 
solrconfig.xml as default params, can also be seperated simply by spaces in an 
actual URL too. (newlines in the URL might work too, never tried it, spaces 
more convenient for an actual URL). 

From: Grant Ingersoll [gsi...@gmail.com] On Behalf Of Grant Ingersoll 
[gsing...@apache.org]
Sent: Wednesday, June 02, 2010 6:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Array of arguments in URL?

Those aren't in the default parameters.  They are config for the SearchHandler 
itself.

On Jun 1, 2010, at 9:00 PM, Lance Norskog wrote:

> In the "/spell" declaration in the example solrconfig.xml, we find
> these lines among the default parameters:
>
>
>  spellcheck
>
>
> How does one supply such an array of strings in HTTP parameters? Does
> Solr have a parsing option for this?
>
> --
> Lance Norskog
> goks...@gmail.com

PHP output at a multiValued AND dynamicField

2010-06-02 Thread Jörg Agatz

Hallo Users...

I have a Problem...
In my SolR, i have a lot of multiValued, dynamicFields and now i must print
ther Fields in php..

But i dont know how...


In schema.xml:

  
  
  
  
  
  
  
  

output from Solr:


A201005311740560002.xml
NO
A201005311740560002
2010-05-31 17:40:56
−

Q:\DatenIBP\AADMS\telli_vip\xml\A201005311740560002.xml


D





Leichlingen


42799











Schlo� Eicherhof

ADRESS
KYETG201005311740560002


I don now ha is the name of the Fields, so i dont know how i get the name to
"printr" it in PHP

Maby someone of you has a answer of the problem?

King

Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.

2010-06-02 Thread Ninad Raut

Thanks Greet-Jan. Din't know about this trick. [?]

On Wed, Jun 2, 2010 at 5:39 PM, Geert-Jan Brits  wrote:

> Hi Ninad,
>
> SolrQuery q = new SolrQuery();
> q.setQuery("*:*");
> q.setFacet(true);
> q.set("facet.data", "pub");
> q.set("facet.date.start", "2000-01-01T00:00:00Z")
> ... etc.
>
> basically you can completely build your entire query with the 'raw' set
> (and
> add) methods.
> The specific methods are just helpers.
>
> So this is the same as above:
>
> SolrQuery q = new SolrQuery();
> q.set("q","*:*");
> q.set("facet","true");
> q.set("facet.data", "pub");
> q.set("facet.date.start", "2000-01-01T00:00:00Z")
> ... etc.
>
>
> Geert-Jan
>
> 2010/6/2 Ninad Raut 
>
> > Hi,
> >
> > I want to hit the query given below :
> >
> >
> >
> ?q=*:*&facet=true&facet.date=pub&facet.date.start=2000-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1YEAR
> >
> > using SolrJ. I am browsing the net but not getting any clues about how
> > should I approach it.  How can SolJ API be used to create above mentioned
> > Query.
> >
> > Regards,
> > Ninad R
> >
>

Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.

2010-06-02 Thread Geert-Jan Brits

Hi Ninad,

SolrQuery q = new SolrQuery();
q.setQuery("*:*");
q.setFacet(true);
q.set("facet.data", "pub");
q.set("facet.date.start", "2000-01-01T00:00:00Z")
... etc.

basically you can completely build your entire query with the 'raw' set (and
add) methods.
The specific methods are just helpers.

So this is the same as above:

SolrQuery q = new SolrQuery();
q.set("q","*:*");
q.set("facet","true");
q.set("facet.data", "pub");
q.set("facet.date.start", "2000-01-01T00:00:00Z")
... etc.


Geert-Jan

2010/6/2 Ninad Raut 

> Hi,
>
> I want to hit the query given below :
>
>
> ?q=*:*&facet=true&facet.date=pub&facet.date.start=2000-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1YEAR
>
> using SolrJ. I am browsing the net but not getting any clues about how
> should I approach it.  How can SolJ API be used to create above mentioned
> Query.
>
> Regards,
> Ninad R
>

Regarding Facet Date query using SolrJ -- Not getting any examples to start with.

2010-06-02 Thread Ninad Raut

Hi,

I want to hit the query given below :

?q=*:*&facet=true&facet.date=pub&facet.date.start=2000-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1YEAR

using SolrJ. I am browsing the net but not getting any clues about how
should I approach it.  How can SolJ API be used to create above mentioned
Query.

Regards,
Ninad R

Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki

On 2010-06-02 13:12, Grant Ingersoll wrote:
> 
> On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
> 
>> On 2010-06-02 12:42, Grant Ingersoll wrote:
>>>
>>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>>>

 We have around 5 million items in our index and each item has a description
 located on a separate physical database. These item descriptions vary in
 size and for the most part are quite large. Currently we are only indexing
 items and not their corresponding description and a full import takes 
 around
 4 hours. Ideally we want to index both our items and their descriptions but
 after some quick profiling I determined that a full import would take in
 excess of 24 hours. 

 - How would I profile the indexing process to determine if the bottleneck 
 is
 Solr or our Database.
>>>
>>> As a data point, I routinely see clients index 5M items on normal
>>> hardware in approx. 1 hour (give or take 30 minutes).  
>>>
>>> When you say "quite large", what do you mean?  Are we talking books here or 
>>> maybe a couple pages of text or just a couple KB of data?
>>>
>>> How long does it take you to get that data out (and, from the sounds of it, 
>>> merge it with your item) w/o going to Solr?
>>>
 - In either case, how would one speed up this process? Is there a way to 
 run
 parallel import processes and then merge them together at the end? Possibly
 use some sort of distributed computing?
>>>
>>> DataImportHandler now supports multiple threads.  The absolute fastest way 
>>> that I know of to index is via multiple threads sending batches of 
>>> documents at a time (at least 100).  Often, from DBs one can split up the 
>>> table via SQL statements that can then be fetched separately.  You may want 
>>> to write your own multithreaded client to index.
>>
>> SOLR-1301 is also an option if you are familiar with Hadoop ...
>>
> 
> If the bottleneck is the DB, will that do much?
> 

Nope. But the workflow could be set up so that during night hours a DB
export takes place that results in a CSV or SolrXML file (there you
could measure the time it takes to do this export), and then indexing
can work from this file.


-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Importing large datasets

2010-06-02 Thread Grant Ingersoll


On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:

> On 2010-06-02 12:42, Grant Ingersoll wrote:
>> 
>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>> 
>>> 
>>> We have around 5 million items in our index and each item has a description
>>> located on a separate physical database. These item descriptions vary in
>>> size and for the most part are quite large. Currently we are only indexing
>>> items and not their corresponding description and a full import takes around
>>> 4 hours. Ideally we want to index both our items and their descriptions but
>>> after some quick profiling I determined that a full import would take in
>>> excess of 24 hours. 
>>> 
>>> - How would I profile the indexing process to determine if the bottleneck is
>>> Solr or our Database.
>> 
>> As a data point, I routinely see clients index 5M items on normal
>> hardware in approx. 1 hour (give or take 30 minutes).  
>> 
>> When you say "quite large", what do you mean?  Are we talking books here or 
>> maybe a couple pages of text or just a couple KB of data?
>> 
>> How long does it take you to get that data out (and, from the sounds of it, 
>> merge it with your item) w/o going to Solr?
>> 
>>> - In either case, how would one speed up this process? Is there a way to run
>>> parallel import processes and then merge them together at the end? Possibly
>>> use some sort of distributed computing?
>> 
>> DataImportHandler now supports multiple threads.  The absolute fastest way 
>> that I know of to index is via multiple threads sending batches of documents 
>> at a time (at least 100).  Often, from DBs one can split up the table via 
>> SQL statements that can then be fetched separately.  You may want to write 
>> your own multithreaded client to index.
> 
> SOLR-1301 is also an option if you are familiar with Hadoop ...
> 

If the bottleneck is the DB, will that do much?

Re: Importing large datasets

2010-06-02 Thread Andrzej Bialecki

On 2010-06-02 12:42, Grant Ingersoll wrote:
> 
> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> 
>>
>> We have around 5 million items in our index and each item has a description
>> located on a separate physical database. These item descriptions vary in
>> size and for the most part are quite large. Currently we are only indexing
>> items and not their corresponding description and a full import takes around
>> 4 hours. Ideally we want to index both our items and their descriptions but
>> after some quick profiling I determined that a full import would take in
>> excess of 24 hours. 
>>
>> - How would I profile the indexing process to determine if the bottleneck is
>> Solr or our Database.
> 
> As a data point, I routinely see clients index 5M items on normal
> hardware in approx. 1 hour (give or take 30 minutes).  
> 
> When you say "quite large", what do you mean?  Are we talking books here or 
> maybe a couple pages of text or just a couple KB of data?
> 
> How long does it take you to get that data out (and, from the sounds of it, 
> merge it with your item) w/o going to Solr?
> 
>> - In either case, how would one speed up this process? Is there a way to run
>> parallel import processes and then merge them together at the end? Possibly
>> use some sort of distributed computing?
> 
> DataImportHandler now supports multiple threads.  The absolute fastest way 
> that I know of to index is via multiple threads sending batches of documents 
> at a time (at least 100).  Often, from DBs one can split up the table via SQL 
> statements that can then be fetched separately.  You may want to write your 
> own multithreaded client to index.

SOLR-1301 is also an option if you are familiar with Hadoop ...



-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Importing large datasets

2010-06-02 Thread Grant Ingersoll

On Jun 1, 2010, at 9:54 PM, Blargy wrote:

> 
> We have around 5 million items in our index and each item has a description
> located on a separate physical database. These item descriptions vary in
> size and for the most part are quite large. Currently we are only indexing
> items and not their corresponding description and a full import takes around
> 4 hours. Ideally we want to index both our items and their descriptions but
> after some quick profiling I determined that a full import would take in
> excess of 24 hours. 
> 
> - How would I profile the indexing process to determine if the bottleneck is
> Solr or our Database.

As a data point, I routinely see clients index 5M items on normal
hardware in approx. 1 hour (give or take 30 minutes).  

When you say "quite large", what do you mean?  Are we talking books here or 
maybe a couple pages of text or just a couple KB of data?

How long does it take you to get that data out (and, from the sounds of it, 
merge it with your item) w/o going to Solr?

> - In either case, how would one speed up this process? Is there a way to run
> parallel import processes and then merge them together at the end? Possibly
> use some sort of distributed computing?

DataImportHandler now supports multiple threads.  The absolute fastest way that 
I know of to index is via multiple threads sending batches of documents at a 
time (at least 100).  Often, from DBs one can split up the table via SQL 
statements that can then be fetched separately.  You may want to write your own 
multithreaded client to index.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Array of arguments in URL?

2010-06-02 Thread Grant Ingersoll

Those aren't in the default parameters.  They are config for the SearchHandler 
itself.

On Jun 1, 2010, at 9:00 PM, Lance Norskog wrote:

> In the "/spell" declaration in the example solrconfig.xml, we find
> these lines among the default parameters:
> 
>
>  spellcheck
>
> 
> How does one supply such an array of strings in HTTP parameters? Does
> Solr have a parsing option for this?
> 
> -- 
> Lance Norskog
> goks...@gmail.com

Re: logic for auto-index

2010-06-02 Thread findbestopensource

You need to do schedule your task. Check out schedulers available in all
programming languages.
http://www.findbestopensource.com/tagged/job-scheduler

Regards
Aditya
www.findbestopensource.com



On Wed, Jun 2, 2010 at 2:39 PM, Jonty Rhods  wrote:

> Hi Peter,
>
> actually I want the index process should start automatically. right now I
> am
> doing mannually.
> same thing I want to start indexing when less load on server i.e. late
> night. So setting auto will fix my
> problem..
>
>  On Wed, Jun 2, 2010 at 2:00 PM, Peter Karich  wrote:
>
> > Hi Jonty,
> >
> > what is your specific problem?
> > You could use a cronjob or the Java-lib called quartz to automate this
> > task.
> > Or did you mean replication?
> >
> > Regards,
> > Peter.
> >
> > > Hi All,
> > >
> > > I am very new to solr as well as java too.
> > > I require to use solrj for indexing also require to index automatically
> > once
> > > in 24 hour.
> > > I wrote java code for indexing now I want to do further coding for
> > automatic
> > > process.
> > > Could you suggest or give me sample code for automatic index process..
> > > please help..
> > >
> > > with regards
> > > Jonty.
> > >
> >
>

RE: DIH, Full-Import, DB and Performance.