Why does Solr (1.4.1) keep so many Tokenizer objects?

2012-09-08 Thread T. Kuro Kurosaka

While investigating a bug, I found that Solr keeps many Tokenizer objects.

This experimental 80-core Solr 1.4.1 system runs on Tomcat. It was 
continuously sent indexing requests in parallel, and it eventually died 
due to OutOfMemory.
The heap dump that was taken by the JVM shows there were 14477 Tokenizer 
objects, or about 180 Tokenizer objects per core, at the time it died.
Each core's schema.xml has only 5 Fields that uses this Tokenizer, so 
I'd think 5 Tokenizer per indexing thread are needed at most.
Tomcat at its default configuration can run up to 200 threads.  So at 
most 1000 Tokenizer objects should be enough.


My colleague ran a similar experiment on 10-core Solr 3.6 system, and 
observed a fewer Tokenizer objects there, but still there are 48 
Tokenizers per core.

Why does Solr keep this many Tokenizer objects ?

Kuro



SolrCloud vs SolrReplication

2012-09-08 Thread thaihai
Hi All,

im little bit confussed about the new cloud functinalities.

some questions:

1) its possible to use the old style solrreplication in solr4 (it means not
using solrcloud. not starting with zk params) ?

2) in our production-environment we use solr 3.6 with solrreplication. we
have 1 index server und 2 front (slave) server. one webapp use the both
front-server for searching. another application push index-requests to the
index-server. the app have queueing. so we dont must have HA here.
if we make index (schema) changes or need to scratch and reeindex the whole
index we have do following szenario:
 1 remove replication for both front-server 
 2 scratch index server
 3 reeindex index server
 4 remove front 1 server from web app (at this point webapp use only front2
for searches)
 5 scratch front 1
 6 enable front 1 replication
 7 test front 1 server with searches over lucene admin ui on front 1 
 8 if all correct, enable front 1 for web app
 9 done all with second slave at point 4

so, my problem is to do the same functionality with solr cloud ?

supposed, i have a 2 shared with replicas cluster. how can i make a complete
re-eindex with no affects for the web app during the index process ? and i
will check the rebuild before i approve the new index to the web app. ???

any ideas or tips ?

sorry for the bad english



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-vs-SolrReplication-tp4006327.html
Sent from the Solr - User mailing list archive at Nabble.com.


ConcurrentModificationException - SolrCmdDistributor

2012-09-08 Thread Balaji Gandhi
Hi,

I am trying to implement a multi-threaded version of DIH (SOLR 4) and am
able to successfully run this with a single SOLR node. Getting
ConcurrentModificationException at SolrCmdDistributor.java:223 when doing
it with more than one node. Has anyone faced this issue? Please let me know.

Thanks,
Balaji

-- 
:: Daring ideas are like chessmen moved forward; they may be beaten, but
they may start a winning game. - Johann Wolfgang von Goethe ::


Re: Solr 4: Private master, public slave?

2012-09-08 Thread Erick Erickson
These are really unrelated. Presumably you have some
program that accesses your system of record, that you
want to keep private. No problem, that program (SolrJ?)
is accessing your private data and sending the SolrInputDocuments
to the cloud-based Solr program for searching.

Or I don't understand the problem at all ..

Best
Erick

On Fri, Sep 7, 2012 at 11:43 AM, Alexandre Rafalovitch
 wrote:
> Hello,
>
> I have a bunch of documents that I would like to index on a local
> server behind the firewall. But then, the actual search will happen on
> a public infrastructure (Amazon, etc). The documents themselves are
> not quite public, so I want just the index content (indexed, not
> stored) being available outside the firewall.
>
> Is that something that is doable with Solr Cloud or index copying, etc?
>
> Regards,
>Alex.
>
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)


Re: Re: Schema model to store additional field metadata

2012-09-08 Thread Erick Erickson
You might be confusing indexing and storing. When you
specify index="true" in your field definition, the input
is tokenized, transformed, etc and the results of this
(see the admin/analysis) page is what is searched.

But when you specify stored="true", a literal, verbatim
copy of the text is put in a distinct file, and when you
return data (e.g. fl=field1, field2...) then the verbatim
copy is returned.

If you specify both index="true" and stored="true" both
things happen, but they're entirely separate operations
even though they're on the same field

So, let's assume you want to provide links to the images.
Having a field (multiValued?) with index="false" and stored="true"
would allow you to store all the img urls in a single field.

All that said, now it's up to your application layer that
constructs the pages for presentation to the user to
"do the right thing" with the returned fields to allow
images (or whatever) to be displayed.

Best
Erick

On Fri, Sep 7, 2012 at 12:03 PM,   wrote:
>> Why would you store the actual images in SOLR?
>
> No, the images are files on the filesystem. Only the path to the image should 
> be stored in Solr.
>
>> And you are most likely looking at dynamic fields as the solution
>>
>> 1) Define *_Path, *_Size, *_Alt as a dynamic field with appropriate types
>> 2) During indexing, write those properties as Image_1_Path,
>> Image_1_Size, Image_1_Alt or some such
>> 3) Make sure that whatever search algorithm you have looks at those or
>> do a copyField to aggregate them into AllImage_Alt, etc.
>
> I was also thinking of a solution with dynamic fields, but I am very new to 
> Solr and I am not sure if it is a good solution to solve this modelling 
> issue. For example I thought about introducing two multiValued dynamic fields 
> (image_src_*, image_alt_*) and store image data like file path on disc and 
> alt-attribute like this:
>
> title: An article about Foo and Bar
> content:   This is some text about Foo and Bar.
> published: 2012.09.07T19:23
> image_src_1: 2012/09/foo.png
> image_alt_1: Foo. Waiting for the bus.
> image_src_2: 2012/04/images/bar.png
> image_src_3: 2012/02/abc.png
> image_alt_3: Foo and Bar at the beach
>
> Of course a alt attribute for some images could be missing. I don't know if 
> this is a good or better solution for this. It feels clumsy to me, like a 
> workaround. But maybe this is the way to model this data, I don't know?


Re: Solr search not working after copying a new field to an existing Indexed Field

2012-09-08 Thread Erick Erickson
Solr docs a complete delete and re-add, there's no way
to do a partial update.

When you add a doc with the same unique key as an old doc,
the data associated with the first version of the doc is entirely
thrown away and its as though you'd never indexed it at all,
the second version completely replaces it.

Does that help?

Best
Erick

On Fri, Sep 7, 2012 at 12:54 PM, Mani  wrote:
> yes..I do have this uniquekey defined properly.
>
> id
>
>
> Before the schema change...
> 
> 
>
>
> After the schema change...
> 
> 
> 
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-search-not-working-after-copying-a-new-field-to-an-existing-Indexed-Field-tp4005993p4006217.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud vs SolrReplication

2012-09-08 Thread Erick Erickson
See inline

On Sat, Sep 8, 2012 at 1:09 AM, thaihai  wrote:
> Hi All,
>
> im little bit confussed about the new cloud functinalities.
>
> some questions:
>
> 1) its possible to use the old style solrreplication in solr4 (it means not
> using solrcloud. not starting with zk params) ?
>

Yes. If you use SolrCloud (the Zookeeper options), you don't need to
set up replication. But if you don't use SolrCloud it's just like it was
in 3.x.


> 2) in our production-environment we use solr 3.6 with solrreplication. we
> have 1 index server und 2 front (slave) server. one webapp use the both
> front-server for searching. another application push index-requests to the
> index-server. the app have queueing. so we dont must have HA here.
> if we make index (schema) changes or need to scratch and reeindex the whole
> index we have do following szenario:
>  1 remove replication for both front-server
>  2 scratch index server
>  3 reeindex index server
>  4 remove front 1 server from web app (at this point webapp use only front2
> for searches)
>  5 scratch front 1
>  6 enable front 1 replication
>  7 test front 1 server with searches over lucene admin ui on front 1
>  8 if all correct, enable front 1 for web app
>  9 done all with second slave at point 4
>
> so, my problem is to do the same functionality with solr cloud ?
>
> supposed, i have a 2 shared with replicas cluster. how can i make a complete
> re-eindex with no affects for the web app during the index process ? and i
> will check the rebuild before i approve the new index to the web app. ???
>
> any ideas or tips ?
>
> sorry for the bad english
>
>

I'm not entirely sure about this, meaning I haven't done it personally. But
I think you can do this...

Let's take the simple 2-shard case, each with a leader and replica.
Take one machine
out of each slice (or have two other machines you can use). Make your schema
changes and re-index to these non-user-facing machines. These are now a complete
new index of two shards.

Now point your user traffic to these new indexes (they are SolrCloud machines).
Now simply scratch your old machines and bring them up in the same
cluster as the
two new machines, and SolrCloud will automatically
1> assign them as replicas of your two shards appropriately
2> synchronize the index (actually, it'll automatically use old-style
replication
 to do the bulk synchronization, you don't have to configure anything).
3> route searches to the new replicas as appropriate.

You really have to forget most of what you know about Solr replication when
moving to the Solr Cloud world, it's all magic ...

Best
Erick

>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-vs-SolrReplication-tp4006327.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: JRockit with SOLR3.4/3.5

2012-09-08 Thread Snehal Chennuru
I am running into a similar issue with Lucene 3.6 which I believe is used in
Solr 3.4.

Following is the exception stack trace:

2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:) Exception in thread
"Load thread" java.lang.OutOfMemoryError: classblock allocation, 2814576
loaded, 2816K footprint, in check_alloc
(src/jvm/model/classload/classalloc.c:215).

Attempting to allocate 4320M bytes

There is insufficient native memory for the Java
Runtime Environment to continue.

Possible reasons:
  The system is out of physical RAM or swap space
  In 32 bit mode, the process size limit was hit

Possible solutions:
  Reduce memory load on the system
  Increase physical memory or swap space
  Check if swap backing store is full
  Use 64 bit Java on a 64 bit OS
  Decrease Java heap size (-Xmx/-Xms)
  Decrease number of Java threads
  Decrease Java thread stack sizes (-Xss)
  Disable compressed references (-XXcompressedRefs=false)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:) 
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.lang.ClassLoader.defineClass1(Native Method)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.lang.ClassLoader.defineClass(ClassLoader.java:615)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.net.URLClassLoader.access$000(URLClassLoader.java:58)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.net.URLClassLoader$1.run(URLClassLoader.java:197)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.net.URLClassLoader.findClass(URLClassLoader.java:190)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.lang.ClassLoader.loadClass(ClassLoader.java:306)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
java.lang.ClassLoader.loadClass(ClassLoader.java:247)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.spi.ThrowableInformation.getThrowableStrRep(ThrowableInformation.java:87)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.spi.LoggingEvent.getThrowableStrRep(LoggingEvent.java:413)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:313)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:276)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.Category.callAppenders(Category.java:206)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.Category.forcedLog(Category.java:391)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
org.apache.log4j.Category.warn(Category.java:1060)
2012-09-08 18:08:56,341 WARN  [STDERR] (Load thread:)   at
com.teneo.esa.common.util.Logger.printStack(Logger.java:584)

JVM details: 
Jrockit R28.2.3, Xms4320m, Xmx4320m. 

The code under question is trying to fetch all the documents one at a time
from the index. Does this have any problems? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/JRockit-with-SOLR3-4-3-5-tp3995148p4006389.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fail to huge collection extraction

2012-09-08 Thread neosky
I am sorry that I can't get your point. Would you explain a little more?
I am still struggling with this problem. It seems crash by no meaning
sometimes. Even I reduce to 5000 records each time, but sometimes it works
well with 1 per page.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559p4006399.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fail to huge collection extraction

2012-09-08 Thread Alexandre Rafalovitch
I think the point here is the question about your use of the data.

If you want to show it to the client, then you are unlikely to need
details of more than 1 screenful of records (e.g. 10). When user goes
to another screen, you rerun the query and specify values 11-20, etc.
SOLR does not have a problem rerunning complex queries and returning
different subset of results.

On the other hand, if you are not presenting this to the user directly
and do need all records at once, perhaps you should not be pulling all
records details from SOLR, but just use it for search. That is, let
SOLR return just the primary keys of the matches and you can then send
a request to a dedicated database with the list of IDs. Databases and
drives are specifically designed around giving streaming results
without crashing/timing-out. SOLR is a search system and is not
perfect as a retrieval system or primary system of record (though it
is getting there slowly).

Hope this helps.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sat, Sep 8, 2012 at 11:20 PM, neosky  wrote:
> I am sorry that I can't get your point. Would you explain a little more?
> I am still struggling with this problem. It seems crash by no meaning
> sometimes. Even I reduce to 5000 records each time, but sometimes it works
> well with 1 per page.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559p4006399.html
> Sent from the Solr - User mailing list archive at Nabble.com.


zkcli command line util

2012-09-08 Thread JesseBuesking
I was trying to use the command line util to update my zookeeper instance
with config files from a machine running solr, but I'm getting the following
error:

/Could not find or load main class org.apache.solr.cloud.ZkCLI/

The command I'm trying to execute is

/java -classpath
/usr/share/solr/apache-solr-4.0.0-BETA/example/solr-webapp/WEB-INF/lib/*
org.apache.solr.cloud.ZkCLI -cmd upconfig -confdir
/usr/share/solr/collection-01/conf/ -confname defaultConfig -zkhost : -solrhome /usr/share/solr/

(I actually have a host:port set)

When I dig into example/solr-webapp, I actually don't see any
subdirectories.  Did something change in version 4.0.0-BETA?  What should
the appropriate classpath be?

Any help would be much appreciated!

- Jesse



--
View this message in context: 
http://lucene.472066.n3.nabble.com/zkcli-command-line-util-tp4006403.html
Sent from the Solr - User mailing list archive at Nabble.com.


Cloud terminology clarification

2012-09-08 Thread JesseBuesking
It's been a while since the terminology at
http://wiki.apache.org/solr/SolrTerminology has been updated, so I'm
wondering how these terms apply to solr cloud setups.

My take on what the terms mean:

Collection: Basically the highest level container that bundles together the
other pieces for servicing a particular search setup
Core: An individual solr instance (represents entire indexes)
Shard: A portion of a core (represents a subset of an index)

Therefore:
- increasing the number of shards allows for indexing more documents (aka
scaling the amount of data that can be indexed)
- increasing the number of cores increases the potential throughput of
requests (aka cores mirror each other allowing you to distribute requests to
multiple servers)

Does this sound right?

If so, then my follow up question would be does the following directory
structure look right/standard?

.../solr # = solr home
.../solr/collection-01
.../solr/collection-01/core-01
.../solr/collection-01/core-02

And if this is right, I'm on a roll :D

My next question would then be:
Given we're using zookeeper (separate machine), do we need 1 conf folder at
collection-01's level?  Or do we need 1 conf folder per core?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Cloud-terminology-clarification-tp4006407.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: SolrCloud vs SolrReplication

2012-09-08 Thread Zhang, Lisheng
Hi Erick,

You mentioned that "it'll automatically use old-style
replication to do the bulk synchronization" in solr
cloud, so it uses HTTP for replication as in 3.6, does
this mean the synchronization in solrCloud is not real 
time (has to have some delays)?

Thanks very much for helps, Lisheng

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Saturday, September 08, 2012 1:44 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud vs SolrReplication


See inline

On Sat, Sep 8, 2012 at 1:09 AM, thaihai  wrote:
> Hi All,
>
> im little bit confussed about the new cloud functinalities.
>
> some questions:
>
> 1) its possible to use the old style solrreplication in solr4 (it means not
> using solrcloud. not starting with zk params) ?
>

Yes. If you use SolrCloud (the Zookeeper options), you don't need to
set up replication. But if you don't use SolrCloud it's just like it was
in 3.x.


> 2) in our production-environment we use solr 3.6 with solrreplication. we
> have 1 index server und 2 front (slave) server. one webapp use the both
> front-server for searching. another application push index-requests to the
> index-server. the app have queueing. so we dont must have HA here.
> if we make index (schema) changes or need to scratch and reeindex the whole
> index we have do following szenario:
>  1 remove replication for both front-server
>  2 scratch index server
>  3 reeindex index server
>  4 remove front 1 server from web app (at this point webapp use only front2
> for searches)
>  5 scratch front 1
>  6 enable front 1 replication
>  7 test front 1 server with searches over lucene admin ui on front 1
>  8 if all correct, enable front 1 for web app
>  9 done all with second slave at point 4
>
> so, my problem is to do the same functionality with solr cloud ?
>
> supposed, i have a 2 shared with replicas cluster. how can i make a complete
> re-eindex with no affects for the web app during the index process ? and i
> will check the rebuild before i approve the new index to the web app. ???
>
> any ideas or tips ?
>
> sorry for the bad english
>
>

I'm not entirely sure about this, meaning I haven't done it personally. But
I think you can do this...

Let's take the simple 2-shard case, each with a leader and replica.
Take one machine
out of each slice (or have two other machines you can use). Make your schema
changes and re-index to these non-user-facing machines. These are now a complete
new index of two shards.

Now point your user traffic to these new indexes (they are SolrCloud machines).
Now simply scratch your old machines and bring them up in the same
cluster as the
two new machines, and SolrCloud will automatically
1> assign them as replicas of your two shards appropriately
2> synchronize the index (actually, it'll automatically use old-style
replication
 to do the bulk synchronization, you don't have to configure anything).
3> route searches to the new replicas as appropriate.

You really have to forget most of what you know about Solr replication when
moving to the Solr Cloud world, it's all magic ...

Best
Erick

>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-vs-SolrReplication-tp4006327.html
> Sent from the Solr - User mailing list archive at Nabble.com.