Re: ReplicationFactor for solrcloud

2013-09-11 Thread Aloke Ghoshal
Hi Aditya,

You need to start another 6 instances (9 instances in total) to
achieve this. The first 3 instances, as you mention, are already
assigned to the 3 shards. The next 3 will be become their replicas,
followed by the next 3 as the next replicas.

You could create two copies each of the example folder and start each
one on a different jetty port. See:
http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster

Regards,
Aloke

On 9/12/13, Aditya Sakhuja  wrote:
> Hi -
>
> I am trying to set the 3 shards and 3 replicas for my solrcloud deployment
> with 3 servers, specifying the replicationFactor=3 and numShards=3 when
> starting the first node. I see each of the servers allocated to 1 shard
> each.however, do not see 3 replicas allocated on each node.
>
> I specifically need to have 3 replicas across 3 servers with 3 shards. Do
> we think of any reason to not have this configuration ?
>
> --
> Regards,
> -Aditya Sakhuja
>


Re: charset encoding

2013-09-11 Thread Andreas Owen
no jetty, and yes for tomcat i've seen a couple of answers

On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:

> Using tomcat by any chance? The ML archive has the solution. May be on
> Wiki, too.
> 
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Sep 11, 2013 8:56 AM, "Andreas Owen"  wrote:
> 
>> i'm using solr 4.3.1 with tika to index html-pages. the html files are
>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
>> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>> 
>> when i index a page with special chars like ä,ö,ü solr outputs it
>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
>> has anyone got a idea whats wrong?
>> 
>> 



create a core with explicit node_name

2013-09-11 Thread YouPeng Yang
Hi solr users

   I want to create a core with node_name through the api
CloudSolrServer.query(SolrParams params  ).
  For example:
  ModifiableSolrParams  params  = new ModifiableSolrParams();
params.set("qt", "/admin/cores");
params.set("action", "CREATE");
params.set("name", newcore.getName());
params.set("shard",  newcore.getShardname());
params.set("collection.configName",
newcore.getCollectionconfigname());
params.set("schema", newcore.getSchemaXMLFilename());
params.set("config", newcore.getSolrConfigFilename());
params.set("coreNodeName", newcore.getCorenodename());
params.set("node_name", "10.7.23.124:8080_solr");
params.set("collection", newcore.getCollectionname());

 The newcore encapsulats the create properties about the created core.


  It seems  not to work. As a result,the core was created on the other node.

  Do I need to send the  params directly  to the explicit web server -here
is the 10.7.23.124 - instead of using the CloudSolrServer.query(SolrParams
params  )?




regards


Storing/indexing speed drops quickly

2013-09-11 Thread Per Steffensen

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node on 
each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread one 
doc at the time, full speed (they always have a new doc to store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt storing/indexing 
speed for the first two-three hours (100M docs per hour), then speed 
goes down dramatically, to an, for us, unacceptable level (max 10M per 
hour). At the same time as speed goes down, we see that I/O wait 
increases dramatically. I am not 100% sure, but quick investigation has 
shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of "small" files, and I guess this is not good for 
search response-time)


Regards, Per Steffensen


DataImportHandler oddity

2013-09-11 Thread Raymond Wiker
I'm trying to index a view in an Oracle database, and have come across some
strange behaviour: all the VARCHAR2 fields are being returned as empty
strings; this also applies to a datetime field converted to a string via
TO_CHAR, and the url field built by concatenating two constant strings and
a numeric filed converted via TO_CHAR.

If I cast the fields columns to CHAR(N), I get values back, but this is not
an acceptable workaround (the maximum length of CHAR(N) is less than
VARCHAR2(N), and the result is padded to the specified length).

Note that this query works as it should in sqldeveloper, and also in some
code that uses the .NET sqlclient api.

The query I'm using is

select 'APPLICATION' as sourceid,
  'http://app.company.com' || '/app/report.aspx?trsid=' ||
to_char(incident_no) as "URL",
  incident_no, trans_date, location,
  responsible_unit, process_eng, product_eng,
  case_title, case_description,
  index_lob,
  investigated, investigated_eng,
  to_char(modified_date, '-MM-DD"T"HH24:MI:SS"Z"') as modified_date
  from synx.dw_fast
  where (investigated <> 3)

while the view is
INCIDENT_NONUMBER(38)
TRANS_DATEVARCHAR2(8)
LOCATIONVARCHAR2(4000)
RESPONSIBLE_UNITVARCHAR2(4000)
PROCESS_ENGVARCHAR2(4000)
PROCESS_NOVARCHAR2(4000)
PRODUCT_ENGVARCHAR2(4000)
PRODUCT_NOVARCHAR2(4000)
CASE_TITLEVARCHAR2(4000)
CASE_DESCRIPTIONVARCHAR2(4000)
INDEX_LOBCLOB
INVESTIGATEDNUMBER(38)
INVESTIGATED_ENGVARCHAR2(254)
INVESTIGATED_NOVARCHAR2(254)
MODIFIED_DATEDATE


Re: SolrCloud 4.x hangs under high update volume

2013-09-11 Thread Tim Vaillancourt

Thanks Erick!

Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 
patch. I think that is a very, very useful patch by the way. SOLR-5232 
seems promising as well.


I see your point on the more-shards idea, this is obviously a 
global/instance-level lock. If I really had to, I suppose I could run 
more Solr instances to reduce locking then? Currently I have 2 cores per 
instance and I could go 1-to-1 to simplify things.


The good news is we seem to be more stable since changing to a bigger 
client->solr batch-size and fewer client threads updating.


Cheers,

Tim

On 11/09/13 04:19 AM, Erick Erickson wrote:

If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
copy of the 4x branch. By "recent", I mean like today, it looks like Mark
applied this early this morning. But several reports indicate that this will
solve your problem.

I would expect that increasing the number of shards would make the problem
worse, not
better.

There's also SOLR-5232...

Best
Erick


On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourtwrote:


Hey guys,

Based on my understanding of the problem we are encountering, I feel we've
been able to reduce the likelihood of this issue by making the following
changes to our app's usage of SolrCloud:

1) We increased our document batch size to 200 from 10 - our app batches
updates to reduce HTTP requests/overhead. The theory is increasing the
batch size reduces the likelihood of this issue happening.
2) We reduced to 1 application node sending updates to SolrCloud - we write
Solr updates to Redis, and have previously had 4 application nodes pushing
the updates to Solr (popping off the Redis queue). Reducing the number of
nodes pushing to Solr reduces the concurrency on SolrCloud.
3) Less threads pushing to SolrCloud - due to the increase in batch size,
we were able to go down to 5 update threads on the update-pushing-app (from
10 threads).

To be clear the above only reduces the likelihood of the issue happening,
and DOES NOT actually resolve the issue at hand.

If we happen to encounter issues with the above 3 changes, the next steps
(I could use some advice on) are:

1) Increase the number of shards (2x) - the theory here is this reduces the
locking on shards because there are more shards. Am I onto something here,
or will this not help at all?
2) Use CloudSolrServer - currently we have a plain-old least-connection
HTTP VIP. If we go "direct" to what we need to update, this will reduce
concurrency in SolrCloud a bit. Thoughts?

Thanks all!

Cheers,

Tim


On 6 September 2013 14:47, Tim Vaillancourt  wrote:


Enjoy your trip, Mark! Thanks again for the help!

Tim


On 6 September 2013 14:18, Mark Miller  wrote:


Okay, thanks, useful info. Getting on a plane, but ill look more at this
soon. That 10k thread spike is good to know - that's no good and could
easily be part of the problem. We want to keep that from happening.

Mark

Sent from my iPhone

On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt
wrote:


Hey Mark,

The farthest we've made it at the same batch size/volume was 12 hours
without this patch, but that isn't consistent. Sometimes we would only

get

to 6 hours or less.

During the crash I can see an amazing spike in threads to 10k which is
essentially our ulimit for the JVM, but I strangely see no

"OutOfMemory:

cannot open native thread errors" that always follow this. Weird!

We also notice a spike in CPU around the crash. The instability caused

some

shard recovery/replication though, so that CPU may be a symptom of the
replication, or is possibly the root cause. The CPU spikes from about
20-30% utilization (system + user) to 60% fairly sharply, so the CPU,

while

spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,

whole

index is in 128GB RAM, 6xRAID10 15k).

More on resources: our disk I/O seemed to spike about 2x during the

crash

(about 1300kbps written to 3500kbps), but this may have been the
replication, or ERROR logging (we generally log nothing due to
WARN-severity unless something breaks).

Lastly, I found this stack trace occurring frequently, and have no

idea

what it is (may be useful or not):

"java.lang.IllegalStateException :
  at

org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)

  at org.eclipse.jetty.server.Response.sendError(Response.java:325)
  at


org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)

  at


org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)

  at


org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)

  at


org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)

  at


org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)

Re: No or limited use of FieldCache

2013-09-11 Thread Per Steffensen
Thanks, guys. Now I know a little more about DocValues and realize that 
they will do the job wrt FieldCache.


Regards, Per Steffensen

On 9/12/13 3:11 AM, Otis Gospodnetic wrote:

Per,  check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, "Michael Sokolov" 
wrote:


On 09/11/2013 08:40 AM, Per Steffensen wrote:


The reason I mention sort is that we in my project, half a year ago, have
dealt with the FieldCache->OOM-problem when doing sort-requests. We
basically just reject sort-requests unless they hit below X documents - in
case they do we just find them without sorting and sort them ourselves
afterwards.

Currently our problem is, that we have to do a group/distinct (in
SQL-language) query and we have found that we can do what we want to do
using group 
(http://wiki.apache.org/solr/**FieldCollapsing)
or facet - either will work for us. Problem is that they both use
FieldCache and we "know" that using FieldCache will lead to OOM-execptions
with the amount of data each of our Solr-nodes administrate. This time we
have really no option of just "limit" usage as we did with sort. Therefore
we need a group/distinct-functionality that works even on huge data-amounts
(and a algorithm using FieldCache will not)

I believe setting facet.method=enum will actually make facet not use the
FieldCache. Is that true? Is it a bad idea?

I do not know much about DocValues, but I do not believe that you will
avoid FieldCache by using DocValues? Please elaborate, or point to
documentation where I will be able to read that I am wrong. Thanks!


There is Simon Willnauer's presentation http://www.slideshare.net/**
lucenerevolution/willnauer-**simon-doc-values-column-**
stride-fields-in-lucene

and this blog post http://blog.trifork.com/2011/**
10/27/introducing-lucene-**index-doc-values/

and this one that shows some performance comparisons:
http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/








Re: number of replicas in Cloud

2013-09-11 Thread Prasi S
Hi Anshum,
Im using solr 4.4. Is there a problem with using replicationFactor of 2




On Thu, Sep 12, 2013 at 11:20 AM, Anshum Gupta wrote:

> Prasi, a replicationFactor of 2 is what you want. However, as of the
> current releases, this is not persisted.
>
>
>
> On Thu, Sep 12, 2013 at 11:17 AM, Prasi S  wrote:
>
> > Hi,
> > I want to setup solrcloud with 2 shards and 1 replica for each shard.
> >
> > MyCollection
> >
> > shard1 , shard2
> > shard1-replica , shard2-replica
> >
> > In this case, i would "numShards=2". For replicationFactor , should give
> > replicationFactor=1 or replicationFActor=2 ?
> >
> >
> > Pls suggest me.
> >
> > thanks,
> > Prasi
> >
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>


Re: number of replicas in Cloud

2013-09-11 Thread Anshum Gupta
Prasi, a replicationFactor of 2 is what you want. However, as of the
current releases, this is not persisted.



On Thu, Sep 12, 2013 at 11:17 AM, Prasi S  wrote:

> Hi,
> I want to setup solrcloud with 2 shards and 1 replica for each shard.
>
> MyCollection
>
> shard1 , shard2
> shard1-replica , shard2-replica
>
> In this case, i would "numShards=2". For replicationFactor , should give
> replicationFactor=1 or replicationFActor=2 ?
>
>
> Pls suggest me.
>
> thanks,
> Prasi
>



-- 

Anshum Gupta
http://www.anshumgupta.net


number of replicas in Cloud

2013-09-11 Thread Prasi S
Hi,
I want to setup solrcloud with 2 shards and 1 replica for each shard.

MyCollection

shard1 , shard2
shard1-replica , shard2-replica

In this case, i would "numShards=2". For replicationFactor , should give
replicationFactor=1 or replicationFActor=2 ?


Pls suggest me.

thanks,
Prasi


Wrapper for SOLR for Compression

2013-09-11 Thread William Bell
I asked this before... But can we add a parameter for SOLR to expose the
compression modes to solrconfig.xml ?



>* https://issues.apache.org/jira/browse/LUCENE-4226*
>* It mentions that we can set compression mode:*
>* FAST, HIGH_COMPRESSION, FAST_UNCOMPRESSION.*
>

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Can we used CloudSolrServer for searching data

2013-09-11 Thread Dharmendra Jaiswal
Thanks for your reply.

I am using Solrcloud with zookeeper setup.
And using CloudSolrServer for both indexing and searching.
As per my understanding CloudSolrserver by default using LBHttpSolrServer.
And CloudSolrServer connect to Zookeeper and passing all the running server
node to LBHttpSolrServer.

Thanks for you guys once again for your reply.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-we-used-CloudSolrServer-for-searching-data-tp4086766p4089475.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Very helpful link. Thanks for sharing that.


-Deepak



On Wed, Sep 11, 2013 at 4:34 PM, Shawn Heisey  wrote:

> On 9/11/2013 4:16 PM, Deepak Konidena wrote:
>
>> As far as RAM usage goes, I believe we set the heap size to about 40% of
>> the RAM and less than 10% is available for OS caching ( since replica
>> takes
>> another 40%). Why does unallocated RAM help? How does it impact
>> performance
>> under load?
>>
>
> Because once the data is in the OS disk cache, reading it becomes
> instantaneous, it doesn't need to go out to the disk.  Disks are glacial
> compared to RAM.  Even SSD has a far slower response time.  Any recent
> operating system does this automatically, including the one from Redmond
> that we all love to hate.
>
> http://blog.thetaphi.de/2012/**07/use-lucenes-mmapdirectory-**
> on-64bit.html
>
> Thanks,
> Shawn
>
>


Re: charset encoding

2013-09-11 Thread Otis Gospodnetic
Using tomcat by any chance? The ML archive has the solution. May be on
Wiki, too.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 8:56 AM, "Andreas Owen"  wrote:

> i'm using solr 4.3.1 with tika to index html-pages. the html files are
> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>
> when i index a page with special chars like ä,ö,ü solr outputs it
> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
> has anyone got a idea whats wrong?
>
>


Re: No or limited use of FieldCache

2013-09-11 Thread Otis Gospodnetic
Per,  check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, "Michael Sokolov" 
wrote:

> On 09/11/2013 08:40 AM, Per Steffensen wrote:
>
>> The reason I mention sort is that we in my project, half a year ago, have
>> dealt with the FieldCache->OOM-problem when doing sort-requests. We
>> basically just reject sort-requests unless they hit below X documents - in
>> case they do we just find them without sorting and sort them ourselves
>> afterwards.
>>
>> Currently our problem is, that we have to do a group/distinct (in
>> SQL-language) query and we have found that we can do what we want to do
>> using group 
>> (http://wiki.apache.org/solr/**FieldCollapsing)
>> or facet - either will work for us. Problem is that they both use
>> FieldCache and we "know" that using FieldCache will lead to OOM-execptions
>> with the amount of data each of our Solr-nodes administrate. This time we
>> have really no option of just "limit" usage as we did with sort. Therefore
>> we need a group/distinct-functionality that works even on huge data-amounts
>> (and a algorithm using FieldCache will not)
>>
>> I believe setting facet.method=enum will actually make facet not use the
>> FieldCache. Is that true? Is it a bad idea?
>>
>> I do not know much about DocValues, but I do not believe that you will
>> avoid FieldCache by using DocValues? Please elaborate, or point to
>> documentation where I will be able to read that I am wrong. Thanks!
>>
> There is Simon Willnauer's presentation http://www.slideshare.net/**
> lucenerevolution/willnauer-**simon-doc-values-column-**
> stride-fields-in-lucene
>
> and this blog post http://blog.trifork.com/2011/**
> 10/27/introducing-lucene-**index-doc-values/
>
> and this one that shows some performance comparisons:
> http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/
>
>
>
>


Re: solr performance against oracle

2013-09-11 Thread Chris Hostetter

Setting asside the excellent responses that have already been made in this 
thread, there are fundemental discrepencies in what you are comparing in 
your respective timing tests.

first off: a micro benchmark like this is virtually useless -- unless you 
really plan on only ever executing a single query in a single run of a 
java application that then terminates, trying to time a single query is 
silly -- you should do lots and lots of iterations using a large set of 
sample inputs.

Second: what you are timing is vastly different between the two cases.

In your Solr timing, no communication happens over the wire to the solr 
server until the call to server.query() inside your time stamps -- if you 
were doing multiple requests using the same SolrServer object, the HTTP 
connection would get re-used, but as things stand your timing includes all 
of hte network overhead of connecting to the server, sending hte request, 
and reading the response.

in your oracle method however, the timestamps you record are only arround 
the call to executeQuery(), rs.next(), and rs.getString() ... you are 
ignoring the timing neccessary for the getConnection() and 
prepareStatement() methods, which may be significant as they both involved 
over the wire communication with the remote server (And it's not like 
these are one time execute and forget about them methods ... in a real 
long lived application you'd need to manage your connections, re-open if 
they get closed, recreate the prepared statement if your connection has to 
be re-open, etc... )

Your comparison is definitly apples and oranges.


Lastly, as others have mentioned: 150-200ms to request a single document 
by uniqueKey from an index containing 800K docs seems ridiculously slow, 
and suggests that something is poorly configured about your solr instance 
(another apples to oranges comparison: you've got an ad-hoc solr 
installation setup on your laptop and you're benchmarking it against a 
remote oracle server running on dedicated remote hardware that has 
probably been heavily tunned/optimized for queries).  

You haven't provided us any details however about how your index is setup, 
or how you have confiugred solr, or what JVM options you are using to run 
solr, or what physical resources are available to your solr process (disk, 
jvm heap ram, os file system cache ram) so there isn't much we can offer 
in the way of advice on how to speed things up.


FWIW:  On my laptop, using Solr 4.4 w/ the example configs and built in 
jetty (ie: "java -jart start.jar") i got a 3.4 GB max heap, and a 1.5 GB 
default heap, with plenty of physical ram left over for the os file system 
cache of an index i created containing 1,000,000 documents with 6 small 
fields containing small amounts of random terms.  I then used curl to 
execute ~4150 requests for documents by id (using simple search, not the 
/get RTG handler) and return the results using JSON.

This commpleted in under 4.5 seconds, or ~1.0ms/request.

Using the more verbose XML response format (after restarting solr to 
ensure nothing in the query result caches) only took 0.3 seconds longer on 
the total time (~1.1ms/request)

$ time curl -sS 
'http://localhost:8983/solr/collection1/select?q=id%3A[1-100:241]&wt=json&indent=true'
 > /dev/null

real0m4.471s
user0m0.412s
sys 0m0.116s
$ time curl -sS 
'http://localhost:8983/solr/collection1/select?q=id%3A[1-100:241]&wt=xml&indent=true'
 > /dev/null

real0m4.868s
user0m0.376s
sys 0m0.136s
$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
$ uname -a
Linux frisbee 3.2.0-52-generic #78-Ubuntu SMP Fri Jul 26 16:21:44 UTC 2013 
x86_64 x86_64 x86_64 GNU/Linux






-Hoss


Re: Grouping by field substring?

2013-09-11 Thread Jack Krupansky
Do a copyField to another field, with a limit of 8 characters, and then use 
that other field.


-- Jack Krupansky

-Original Message- 
From: Ken Krugler

Sent: Wednesday, September 11, 2013 8:24 PM
To: solr-user@lucene.apache.org
Subject: Grouping by field substring?

Hi all,

Assuming I want to use the first N characters of a specific field for 
grouping results, is such a thing possible out-of-the-box?


If not, then what would the next best option be? E.g. a custom function 
query?


Thanks,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Grouping by field substring?

2013-09-11 Thread Ken Krugler
Hi all,

Assuming I want to use the first N characters of a specific field for grouping 
results, is such a thing possible out-of-the-box?

If not, then what would the next best option be? E.g. a custom function query?

Thanks,

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Do I need to delete my index?

2013-09-11 Thread Brian Robinson

Thanks Erick

On 9/11/2013 6:46 PM, Erick Erickson wrote:

Typically I'll just delete the entire data dir recursively after shutting
down Solr, the default location is /solr/collectionblah/data


On Wed, Sep 11, 2013 at 6:01 PM, Brian Robinson
wrote:


Thanks Shawn. I had actually tried changing &load= to &load=, but
still got the error. It sounds like addDocuments is worth a try, though.


On 9/11/2013 4:37 PM, Shawn Heisey wrote:


On 9/11/2013 2:17 PM, Brian Robinson wrote:


I'm in the process of creating my index using a series of
SolrClient::request commands in PHP. I ran into a problem when some of
the fields that I had as "text_general" fieldType contained "&load=" in
a URL, triggering an error because the HTML entity "load" wasn't
recognized. I realized that I should have made my URL fields of type
"string" instead, so that they would be taken as is (they're not being
indexed, just stored), so I removed all docs from my index, updated
schema.xml, and restarted Solr, but I'm still getting the same error. Do
I need to delete the index itself and then restart to get this to work?
Am I correct that changing those fields to "string" type should fix the
issue?


Changing the field type is not going to affect this issue. Because you
are not indexing the field, the choice of string or text_general is not
really going to matter, but string will probably be more efficient.

What is happening here is an XML issue with the update request itself.
The PHP client is sending an XML update request to Solr, and the request
includes the URL text as-is in the XML request. It is not properly XML
encoded.  For an XML update request, that snippet of your text would need
to be encoded as "&load=" to work properly.

XML has a much smaller list of valid entities than HTML, but "load" is
not a valid entity in either XML or HTML.

I was going to call this a bug in the PHP client library, but then I got
a look at what SolrClient::request actually does:

http://php.net/manual/en/**solrclient.request.php

It expects you to create the XML yourself, which means you have to do all
the encoding of characters which have special meaning to XML.

If you have no desire to figure out proper XML encoding, you should
probably be using SolrClient::addDocument or SolrClient::addDocuments
instead.

Thanks,
Shawn







Re: Do I need to delete my index?

2013-09-11 Thread Erick Erickson
Typically I'll just delete the entire data dir recursively after shutting
down Solr, the default location is /solr/collectionblah/data


On Wed, Sep 11, 2013 at 6:01 PM, Brian Robinson
wrote:

> Thanks Shawn. I had actually tried changing &load= to &load=, but
> still got the error. It sounds like addDocuments is worth a try, though.
>
>
> On 9/11/2013 4:37 PM, Shawn Heisey wrote:
>
>> On 9/11/2013 2:17 PM, Brian Robinson wrote:
>>
>>> I'm in the process of creating my index using a series of
>>> SolrClient::request commands in PHP. I ran into a problem when some of
>>> the fields that I had as "text_general" fieldType contained "&load=" in
>>> a URL, triggering an error because the HTML entity "load" wasn't
>>> recognized. I realized that I should have made my URL fields of type
>>> "string" instead, so that they would be taken as is (they're not being
>>> indexed, just stored), so I removed all docs from my index, updated
>>> schema.xml, and restarted Solr, but I'm still getting the same error. Do
>>> I need to delete the index itself and then restart to get this to work?
>>> Am I correct that changing those fields to "string" type should fix the
>>> issue?
>>>
>>
>> Changing the field type is not going to affect this issue. Because you
>> are not indexing the field, the choice of string or text_general is not
>> really going to matter, but string will probably be more efficient.
>>
>> What is happening here is an XML issue with the update request itself.
>> The PHP client is sending an XML update request to Solr, and the request
>> includes the URL text as-is in the XML request. It is not properly XML
>> encoded.  For an XML update request, that snippet of your text would need
>> to be encoded as "&load=" to work properly.
>>
>> XML has a much smaller list of valid entities than HTML, but "load" is
>> not a valid entity in either XML or HTML.
>>
>> I was going to call this a bug in the PHP client library, but then I got
>> a look at what SolrClient::request actually does:
>>
>> http://php.net/manual/en/**solrclient.request.php
>>
>> It expects you to create the XML yourself, which means you have to do all
>> the encoding of characters which have special meaning to XML.
>>
>> If you have no desire to figure out proper XML encoding, you should
>> probably be using SolrClient::addDocument or SolrClient::addDocuments
>> instead.
>>
>> Thanks,
>> Shawn
>>
>>
>>
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Shawn Heisey

On 9/11/2013 4:16 PM, Deepak Konidena wrote:

As far as RAM usage goes, I believe we set the heap size to about 40% of
the RAM and less than 10% is available for OS caching ( since replica takes
another 40%). Why does unallocated RAM help? How does it impact performance
under load?


Because once the data is in the OS disk cache, reading it becomes 
instantaneous, it doesn't need to go out to the disk.  Disks are glacial 
compared to RAM.  Even SSD has a far slower response time.  Any recent 
operating system does this automatically, including the one from Redmond 
that we all love to hate.


http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Thanks,
Shawn



Re: Do I need to delete my index?

2013-09-11 Thread Brian Robinson
Thanks Shawn. I had actually tried changing &load= to &load=, but 
still got the error. It sounds like addDocuments is worth a try, though.


On 9/11/2013 4:37 PM, Shawn Heisey wrote:

On 9/11/2013 2:17 PM, Brian Robinson wrote:

I'm in the process of creating my index using a series of
SolrClient::request commands in PHP. I ran into a problem when some of
the fields that I had as "text_general" fieldType contained "&load=" in
a URL, triggering an error because the HTML entity "load" wasn't
recognized. I realized that I should have made my URL fields of type
"string" instead, so that they would be taken as is (they're not being
indexed, just stored), so I removed all docs from my index, updated
schema.xml, and restarted Solr, but I'm still getting the same error. Do
I need to delete the index itself and then restart to get this to work?
Am I correct that changing those fields to "string" type should fix the
issue?


Changing the field type is not going to affect this issue. Because you 
are not indexing the field, the choice of string or text_general is 
not really going to matter, but string will probably be more efficient.


What is happening here is an XML issue with the update request itself. 
The PHP client is sending an XML update request to Solr, and the 
request includes the URL text as-is in the XML request. It is not 
properly XML encoded.  For an XML update request, that snippet of your 
text would need to be encoded as "&load=" to work properly.


XML has a much smaller list of valid entities than HTML, but "load" is 
not a valid entity in either XML or HTML.


I was going to call this a bug in the PHP client library, but then I 
got a look at what SolrClient::request actually does:


http://php.net/manual/en/solrclient.request.php

It expects you to create the XML yourself, which means you have to do 
all the encoding of characters which have special meaning to XML.


If you have no desire to figure out proper XML encoding, you should 
probably be using SolrClient::addDocument or SolrClient::addDocuments 
instead.


Thanks,
Shawn






Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
@Greg - Thanks for the suggestion. Will pass it along to my folks.

@Shawn - That's the link I was looking for 'non-SolrCloud approach to
distributed search'. Thanks for passing that along. Will give it a try.

As far as RAM usage goes, I believe we set the heap size to about 40% of
the RAM and less than 10% is available for OS caching ( since replica takes
another 40%). Why does unallocated RAM help? How does it impact performance
under load?


-Deepak



On Wed, Sep 11, 2013 at 2:50 PM, Shawn Heisey  wrote:

> On 9/11/2013 2:57 PM, Deepak Konidena wrote:
>
>> I guess at this point in the discussion, I should probably give some more
>> background on why I am doing what I am doing. Having a single Solr shard
>> (multiple segments) on the same disk is posing severe performance problems
>> under load,in that, calls to Solr cause a lot of connection timeouts. When
>> we looked at the ganglia stats for the Solr box, we saw that while memory,
>> cpu and network usage were quite normal, the i/o wait spiked. We are
>> unsure
>> on what caused the i/o wait and why there were no spikes in the cpu/memory
>> usage. Since the Solr box is a beefy box (multi-core setup, huge ram,
>> SSD),
>> we'd like to distribute the segments to multiple locations (disks) and see
>> whether this improves performance under load.
>>
>> @Greg - Thanks for clarifying that.  I just learnt that I can't set them
>> up
>> using RAID as some of them are SSDs and some others are SATA (spinning
>> disks).
>>
>> @Shawn Heisey - Could you elaborate more about the "broker" core and
>> delegating the requests to other cores?
>>
>
> On the broker core - I have a core on my servers that has no index of its
> own.  In the /select handler (and others) I have placed a shards parameter,
> and many of them also have a shards.qt parameter.  The shards paramter is
> how a non-cloud distributed search is done.
>
> http://wiki.apache.org/solr/**DistributedSearch
>
> Addressing your first paragraph: You say that you have lots of RAM ... but
> is there a lot of unallocated RAM that the OS can use for caching, or is it
> mostly allocated to processes, such as the java heap for Solr?
>
> Depending on exactly how your indexes are composed, you need up to 100% of
> the total index size available as unallocated RAM.  With SSD, the
> requirement is less, but cannot be ignored.  I personally wouldn't go below
> about 25-50% even with SSD, and I'd plan on 50-100% for regular disks.
>
> There is some evidence to suggest that you only need unallocated RAM equal
> to 10% of your index size for caching with SSD, but that is only likely to
> work if you have a lot of stored (as opposed to indexed) data.  If most of
> your index is unstored, then more would be required.
>
> Thanks,
> Shawn
>
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Shawn Heisey

On 9/11/2013 2:57 PM, Deepak Konidena wrote:

I guess at this point in the discussion, I should probably give some more
background on why I am doing what I am doing. Having a single Solr shard
(multiple segments) on the same disk is posing severe performance problems
under load,in that, calls to Solr cause a lot of connection timeouts. When
we looked at the ganglia stats for the Solr box, we saw that while memory,
cpu and network usage were quite normal, the i/o wait spiked. We are unsure
on what caused the i/o wait and why there were no spikes in the cpu/memory
usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD),
we'd like to distribute the segments to multiple locations (disks) and see
whether this improves performance under load.

@Greg - Thanks for clarifying that.  I just learnt that I can't set them up
using RAID as some of them are SSDs and some others are SATA (spinning
disks).

@Shawn Heisey - Could you elaborate more about the "broker" core and
delegating the requests to other cores?


On the broker core - I have a core on my servers that has no index of 
its own.  In the /select handler (and others) I have placed a shards 
parameter, and many of them also have a shards.qt parameter.  The shards 
paramter is how a non-cloud distributed search is done.


http://wiki.apache.org/solr/DistributedSearch

Addressing your first paragraph: You say that you have lots of RAM ... 
but is there a lot of unallocated RAM that the OS can use for caching, 
or is it mostly allocated to processes, such as the java heap for Solr?


Depending on exactly how your indexes are composed, you need up to 100% 
of the total index size available as unallocated RAM.  With SSD, the 
requirement is less, but cannot be ignored.  I personally wouldn't go 
below about 25-50% even with SSD, and I'd plan on 50-100% for regular disks.


There is some evidence to suggest that you only need unallocated RAM 
equal to 10% of your index size for caching with SSD, but that is only 
likely to work if you have a lot of stored (as opposed to indexed) data. 
 If most of your index is unstored, then more would be required.


Thanks,
Shawn



ReplicationFactor for solrcloud

2013-09-11 Thread Aditya Sakhuja
Hi -

I am trying to set the 3 shards and 3 replicas for my solrcloud deployment
with 3 servers, specifying the replicationFactor=3 and numShards=3 when
starting the first node. I see each of the servers allocated to 1 shard
each.however, do not see 3 replicas allocated on each node.

I specifically need to have 3 replicas across 3 servers with 3 shards. Do
we think of any reason to not have this configuration ?

-- 
Regards,
-Aditya Sakhuja


Re: Do I need to delete my index?

2013-09-11 Thread Shawn Heisey

On 9/11/2013 3:17 PM, Brian Robinson wrote:

In addition, if I do need to delete my index, how do I go about that?
I've been looking through the documentation and can't find anything
specific. I know where the index is, I'm just not sure which files to
delete.


Generally you'll find it in a path that ends with data/index ... but if 
you have messed with dataDir, it might just end in /index instead.  Here 
is an example of index directory contents, from a system that *is* 
changing dataDir:


ncindex@bigindy5 /index/solr4/data/s2_0 $ echo `ls -1 index`
_m5o.fdt _m5o.fdx _m5o.fnm _m5o_Lucene41_0.doc _m5o_Lucene41_0.pos 
_m5o_Lucene41_0.tim _m5o_Lucene41_0.tip _m5o_Lucene45_0.dvd 
_m5o_Lucene45_0.dvm _m5o.nvd _m5o.nvm _m5o.si _m5o.tvd _m5o.tvx 
_m5o_z.del _m5p.fdt _m5p.fdx _m5p.fnm _m5p_Lucene41_0.doc 
_m5p_Lucene41_0.pos _m5p_Lucene41_0.tim _m5p_Lucene41_0.tip 
_m5p_Lucene45_0.dvd _m5p_Lucene45_0.dvm _m5p.nvd _m5p.nvm _m5p.si 
_m5p.tvd _m5p.tvx _m5v.fdt _m5v.fdx _m5v.fnm _m5v_Lucene41_0.doc 
_m5v_Lucene41_0.pos _m5v_Lucene41_0.tim _m5v_Lucene41_0.tip 
_m5v_Lucene45_0.dvd _m5v_Lucene45_0.dvm _m5v.nvd _m5v.nvm _m5v.si 
_m5v.tvd _m5v.tvx _m5w.fdt _m5w.fdx _m5w.fnm _m5w_Lucene41_0.doc 
_m5w_Lucene41_0.pos _m5w_Lucene41_0.tim _m5w_Lucene41_0.tip 
_m5w_Lucene45_0.dvd _m5w_Lucene45_0.dvm _m5w.nvd _m5w.nvm _m5w.si 
_m5w.tvd _m5w.tvx _m5x.fdt _m5x.fdx _m5x.fnm _m5x_Lucene41_0.doc 
_m5x_Lucene41_0.pos _m5x_Lucene41_0.tim _m5x_Lucene41_0.tip 
_m5x_Lucene45_0.dvd _m5x_Lucene45_0.dvm _m5x.nvd _m5x.nvm _m5x.si 
_m5x.tvd _m5x.tvx _m5y.fdt _m5y.fdx _m5y.fnm _m5y_Lucene41_0.doc 
_m5y_Lucene41_0.pos _m5y_Lucene41_0.tim _m5y_Lucene41_0.tip 
_m5y_Lucene45_0.dvd _m5y_Lucene45_0.dvm _m5y.nvd _m5y.nvm _m5y.si 
_m5y.tvd _m5y.tvx _m5z.fdt _m5z.fdx _m5z.fnm _m5z_Lucene41_0.doc 
_m5z_Lucene41_0.pos _m5z_Lucene41_0.tim _m5z_Lucene41_0.tip 
_m5z_Lucene45_0.dvd _m5z_Lucene45_0.dvm _m5z.nvd _m5z.nvm _m5z.si 
_m5z.tvd _m5z.tvx segments_554 segments.gen write.lock


Thanks,
Shawn



RE: Distributing lucene segments across multiple disks.

2013-09-11 Thread Greg Walters
Deepak,

It might be a bit outside what you're willing to consider but you can make a 
raid out of your spinning disks then use your SSD(s) as a dm-cache device to 
accelerate reads and writes to the raid device. If you're putting lucene 
indexes on a mixed bag of disks and ssd's without any type of control for what 
goes where you'd want to use the ssd to accelerate the spinning disks anyway. 
Check out http://lwn.net/Articles/540996/ for more information on the dm-cache 
device.

Thanks,
Greg

-Original Message-
From: Deepak Konidena [mailto:deepakk...@gmail.com] 
Sent: Wednesday, September 11, 2013 3:57 PM
To: solr-user@lucene.apache.org
Subject: Re: Distributing lucene segments across multiple disks.

I guess at this point in the discussion, I should probably give some more 
background on why I am doing what I am doing. Having a single Solr shard 
(multiple segments) on the same disk is posing severe performance problems 
under load,in that, calls to Solr cause a lot of connection timeouts. When we 
looked at the ganglia stats for the Solr box, we saw that while memory, cpu and 
network usage were quite normal, the i/o wait spiked. We are unsure on what 
caused the i/o wait and why there were no spikes in the cpu/memory usage. Since 
the Solr box is a beefy box (multi-core setup, huge ram, SSD), we'd like to 
distribute the segments to multiple locations (disks) and see whether this 
improves performance under load.

@Greg - Thanks for clarifying that.  I just learnt that I can't set them up 
using RAID as some of them are SSDs and some others are SATA (spinning disks).

@Shawn Heisey - Could you elaborate more about the "broker" core and delegating 
the requests to other cores?


-Deepak



On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey  wrote:

> On 9/11/2013 1:07 PM, Deepak Konidena wrote:
>
>> Are you suggesting a multi-core setup, where all the cores share the 
>> same schema, and the cores lie on different disks?
>>
>> Basically, I'd like to know if I can distribute shards/segments on a 
>> single machine (with multiple disks) without the use of zookeeper.
>>
>
> Sure, you can do it all manually.  At that point you would not be using
> SolrCloud at all, because the way to enable SolrCloud is to tell Solr where
> zookeeper lives.
>
> Without SolrCloud, there is no cluster automation at all.  There is no
> "collection" paradigm, you just have cores.  You have to send updates to
> the correct core; they not be redirected for you.  Similarly, queries will
> not be load balanced automatically.  For Java clients, the CloudSolrServer
> object can work seamlessly when servers go down.  If you're not using
> SolrCloud, you can't use CloudSolrServer.
>
> You would be in charge of creating the shards parameter yourself.  The way
> that I do this on my index is that I have a "broker" core that has no index
> of its own, but its solrconfig.xml has the shards and shards.qt parameters
> in all the request handler definitions.  You can also include the parameter
> with the query.
>
> You would also have to handle redundancy yourself, either with replication
> or with independently updated indexes.  I use the latter method, because it
> offers a lot more flexibility than replication.
>
> As mentioned in another reply, setting up RAID with a lot of disks may be
> better than trying to split your index up on different filesystems that
> each reside on different disks.  I would recommend RAID10 for Solr, and it
> works best if it's hardware RAID and the controller has battery-backed (or
> NVRAM) cache.
>
> Thanks,
> Shawn
>
>


Re: Do I need to delete my index?

2013-09-11 Thread Shawn Heisey

On 9/11/2013 2:17 PM, Brian Robinson wrote:

I'm in the process of creating my index using a series of
SolrClient::request commands in PHP. I ran into a problem when some of
the fields that I had as "text_general" fieldType contained "&load=" in
a URL, triggering an error because the HTML entity "load" wasn't
recognized. I realized that I should have made my URL fields of type
"string" instead, so that they would be taken as is (they're not being
indexed, just stored), so I removed all docs from my index, updated
schema.xml, and restarted Solr, but I'm still getting the same error. Do
I need to delete the index itself and then restart to get this to work?
Am I correct that changing those fields to "string" type should fix the
issue?


Changing the field type is not going to affect this issue.  Because you 
are not indexing the field, the choice of string or text_general is not 
really going to matter, but string will probably be more efficient.


What is happening here is an XML issue with the update request itself. 
The PHP client is sending an XML update request to Solr, and the request 
includes the URL text as-is in the XML request.  It is not properly XML 
encoded.  For an XML update request, that snippet of your text would 
need to be encoded as "&load=" to work properly.


XML has a much smaller list of valid entities than HTML, but "load" is 
not a valid entity in either XML or HTML.


I was going to call this a bug in the PHP client library, but then I got 
a look at what SolrClient::request actually does:


http://php.net/manual/en/solrclient.request.php

It expects you to create the XML yourself, which means you have to do 
all the encoding of characters which have special meaning to XML.


If you have no desire to figure out proper XML encoding, you should 
probably be using SolrClient::addDocument or SolrClient::addDocuments 
instead.


Thanks,
Shawn



Re: Do I need to delete my index?

2013-09-11 Thread Brian Robinson
In addition, if I do need to delete my index, how do I go about that? 
I've been looking through the documentation and can't find anything 
specific. I know where the index is, I'm just not sure which files to 
delete.



Hello,
I'm in the process of creating my index using a series of 
SolrClient::request commands in PHP. I ran into a problem when some of 
the fields that I had as "text_general" fieldType contained "&load=" 
in a URL, triggering an error because the HTML entity "load" wasn't 
recognized. I realized that I should have made my URL fields of type 
"string" instead, so that they would be taken as is (they're not being 
indexed, just stored), so I removed all docs from my index, updated 
schema.xml, and restarted Solr, but I'm still getting the same error. 
Do I need to delete the index itself and then restart to get this to 
work? Am I correct that changing those fields to "string" type should 
fix the issue?

Thanks,
Brian






Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
I guess at this point in the discussion, I should probably give some more
background on why I am doing what I am doing. Having a single Solr shard
(multiple segments) on the same disk is posing severe performance problems
under load,in that, calls to Solr cause a lot of connection timeouts. When
we looked at the ganglia stats for the Solr box, we saw that while memory,
cpu and network usage were quite normal, the i/o wait spiked. We are unsure
on what caused the i/o wait and why there were no spikes in the cpu/memory
usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD),
we'd like to distribute the segments to multiple locations (disks) and see
whether this improves performance under load.

@Greg - Thanks for clarifying that.  I just learnt that I can't set them up
using RAID as some of them are SSDs and some others are SATA (spinning
disks).

@Shawn Heisey - Could you elaborate more about the "broker" core and
delegating the requests to other cores?


-Deepak



On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey  wrote:

> On 9/11/2013 1:07 PM, Deepak Konidena wrote:
>
>> Are you suggesting a multi-core setup, where all the cores share the same
>> schema, and the cores lie on different disks?
>>
>> Basically, I'd like to know if I can distribute shards/segments on a
>> single
>> machine (with multiple disks) without the use of zookeeper.
>>
>
> Sure, you can do it all manually.  At that point you would not be using
> SolrCloud at all, because the way to enable SolrCloud is to tell Solr where
> zookeeper lives.
>
> Without SolrCloud, there is no cluster automation at all.  There is no
> "collection" paradigm, you just have cores.  You have to send updates to
> the correct core; they not be redirected for you.  Similarly, queries will
> not be load balanced automatically.  For Java clients, the CloudSolrServer
> object can work seamlessly when servers go down.  If you're not using
> SolrCloud, you can't use CloudSolrServer.
>
> You would be in charge of creating the shards parameter yourself.  The way
> that I do this on my index is that I have a "broker" core that has no index
> of its own, but its solrconfig.xml has the shards and shards.qt parameters
> in all the request handler definitions.  You can also include the parameter
> with the query.
>
> You would also have to handle redundancy yourself, either with replication
> or with independently updated indexes.  I use the latter method, because it
> offers a lot more flexibility than replication.
>
> As mentioned in another reply, setting up RAID with a lot of disks may be
> better than trying to split your index up on different filesystems that
> each reside on different disks.  I would recommend RAID10 for Solr, and it
> works best if it's hardware RAID and the controller has battery-backed (or
> NVRAM) cache.
>
> Thanks,
> Shawn
>
>


Re: Error while importing HBase data to Solr using the DataImportHandler

2013-09-11 Thread ppatel
Hi,

Can you provide me an example of data-config.xml? because with my Hbase
configuration, I am getting
Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
org.apache.hadoop.net.NetUtils.getInputStream(Ljava/net/Socket;)Ljava/io/InputStream;

AND

Exception while processing: item document :
SolrInputDocument[]:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute SCANNER: [tableName=Item, startRow=null, stopRow=null,
columns=[{Item|r}, {Item|m}, {Item|u}]] Processing Document # 1

Mine data-config.xml:




















Please respond me ASAP.

Thanks in advance!!




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-while-importing-HBase-data-to-Solr-using-the-DataImportHandler-tp4085613p4089402.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Shawn Heisey

On 9/11/2013 1:07 PM, Deepak Konidena wrote:

Are you suggesting a multi-core setup, where all the cores share the same
schema, and the cores lie on different disks?

Basically, I'd like to know if I can distribute shards/segments on a single
machine (with multiple disks) without the use of zookeeper.


Sure, you can do it all manually.  At that point you would not be using 
SolrCloud at all, because the way to enable SolrCloud is to tell Solr 
where zookeeper lives.


Without SolrCloud, there is no cluster automation at all.  There is no 
"collection" paradigm, you just have cores.  You have to send updates to 
the correct core; they not be redirected for you.  Similarly, queries 
will not be load balanced automatically.  For Java clients, the 
CloudSolrServer object can work seamlessly when servers go down.  If 
you're not using SolrCloud, you can't use CloudSolrServer.


You would be in charge of creating the shards parameter yourself.  The 
way that I do this on my index is that I have a "broker" core that has 
no index of its own, but its solrconfig.xml has the shards and shards.qt 
parameters in all the request handler definitions.  You can also include 
the parameter with the query.


You would also have to handle redundancy yourself, either with 
replication or with independently updated indexes.  I use the latter 
method, because it offers a lot more flexibility than replication.


As mentioned in another reply, setting up RAID with a lot of disks may 
be better than trying to split your index up on different filesystems 
that each reside on different disks.  I would recommend RAID10 for Solr, 
and it works best if it's hardware RAID and the controller has 
battery-backed (or NVRAM) cache.


Thanks,
Shawn



Do I need to delete my index?

2013-09-11 Thread Brian Robinson

Hello,
I'm in the process of creating my index using a series of 
SolrClient::request commands in PHP. I ran into a problem when some of 
the fields that I had as "text_general" fieldType contained "&load=" in 
a URL, triggering an error because the HTML entity "load" wasn't 
recognized. I realized that I should have made my URL fields of type 
"string" instead, so that they would be taken as is (they're not being 
indexed, just stored), so I removed all docs from my index, updated 
schema.xml, and restarted Solr, but I'm still getting the same error. Do 
I need to delete the index itself and then restart to get this to work? 
Am I correct that changing those fields to "string" type should fix the 
issue?

Thanks,
Brian



RE: Distributing lucene segments across multiple disks.

2013-09-11 Thread Greg Walters
Deepak,

Sorry for not being more verbose in my previous suggestion. As I take your 
question, you'd like to spread your index files across multiple disks (for 
performance or space reasons I assume). If you used even a basic md-raid setup 
you could then format the raid device and thus your entire set of disks with 
your favorite filesystem, mount it in one directory in the directory tree then 
configure it as the data directory that solr uses.

This setup would accomplish your goal of having the lucene indexes spread 
across multiple disks without the complexity of using multiple solr 
cores/collections.

Thanks,
Greg


-Original Message-
From: Deepak Konidena [mailto:deepakk...@gmail.com] 
Sent: Wednesday, September 11, 2013 2:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Distributing lucene segments across multiple disks.

@Greg - Are you suggesting RAID as a replacement for Solr or making Solr work 
with RAID? Could you elaborate more on the latter, if that's you meant?
We make use of solr's advanced text processing features which would be hard to 
replicate just using RAID.


-Deepak



On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters  wrote:

> Why not use some form of RAID for your index store? You'd get the 
> performance benefit of multiple disks without the complexity of 
> managing them via solr.
>
> Thanks,
> Greg
>
>
>
> -Original Message-
> From: Deepak Konidena [mailto:deepakk...@gmail.com]
> Sent: Wednesday, September 11, 2013 2:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Distributing lucene segments across multiple disks.
>
> Are you suggesting a multi-core setup, where all the cores share the 
> same schema, and the cores lie on different disks?
>
> Basically, I'd like to know if I can distribute shards/segments on a 
> single machine (with multiple disks) without the use of zookeeper.
>
>
>
>
>
> -Deepak
>
>
>
> On Wed, Sep 11, 2013 at 11:55 AM, Upayavira  wrote:
>
> > I think you'll find it hard to distribute different segments between 
> > disks, as they are typically stored in the same directory.
> >
> > However, instantiating separate cores on different disks should be 
> > straight-forward enough, and would give you a performance benefit.
> >
> > I've certainly heard of that done at Amazon, with a separate EBS 
> > volume per core giving some performance improvement.
> >
> > Upayavira
> >
> > On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
> > > Hi,
> > >
> > > I know that SolrCloud allows you to have multiple shards on 
> > > different machines (or a single machine). But it requires a 
> > > zookeeper installation for doing things like leader election, 
> > > leader availability, etc
> > >
> > > While SolrCloud may be the ideal solution for my usecase 
> > > eventually, I'd like to know if there's a way I can point my Solr 
> > > instance to read lucene segments distributed across different 
> > > disks attached to
> the same machine.
> > >
> > > Thanks!
> > >
> > > -Deepak
> >
>


RE: Distributing lucene segments across multiple disks.

2013-09-11 Thread Greg Walters
Why not use some form of RAID for your index store? You'd get the performance 
benefit of multiple disks without the complexity of managing them via solr.

Thanks,
Greg



-Original Message-
From: Deepak Konidena [mailto:deepakk...@gmail.com] 
Sent: Wednesday, September 11, 2013 2:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Distributing lucene segments across multiple disks.

Are you suggesting a multi-core setup, where all the cores share the same 
schema, and the cores lie on different disks?

Basically, I'd like to know if I can distribute shards/segments on a single 
machine (with multiple disks) without the use of zookeeper.





-Deepak



On Wed, Sep 11, 2013 at 11:55 AM, Upayavira  wrote:

> I think you'll find it hard to distribute different segments between 
> disks, as they are typically stored in the same directory.
>
> However, instantiating separate cores on different disks should be 
> straight-forward enough, and would give you a performance benefit.
>
> I've certainly heard of that done at Amazon, with a separate EBS 
> volume per core giving some performance improvement.
>
> Upayavira
>
> On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
> > Hi,
> >
> > I know that SolrCloud allows you to have multiple shards on 
> > different machines (or a single machine). But it requires a 
> > zookeeper installation for doing things like leader election, leader 
> > availability, etc
> >
> > While SolrCloud may be the ideal solution for my usecase eventually, 
> > I'd like to know if there's a way I can point my Solr instance to 
> > read lucene segments distributed across different disks attached to the 
> > same machine.
> >
> > Thanks!
> >
> > -Deepak
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Are you suggesting a multi-core setup, where all the cores share the same
schema, and the cores lie on different disks?

Basically, I'd like to know if I can distribute shards/segments on a single
machine (with multiple disks) without the use of zookeeper.





-Deepak



On Wed, Sep 11, 2013 at 11:55 AM, Upayavira  wrote:

> I think you'll find it hard to distribute different segments between
> disks, as they are typically stored in the same directory.
>
> However, instantiating separate cores on different disks should be
> straight-forward enough, and would give you a performance benefit.
>
> I've certainly heard of that done at Amazon, with a separate EBS volume
> per core giving some performance improvement.
>
> Upayavira
>
> On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
> > Hi,
> >
> > I know that SolrCloud allows you to have multiple shards on different
> > machines (or a single machine). But it requires a zookeeper installation
> > for doing things like leader election, leader availability, etc
> >
> > While SolrCloud may be the ideal solution for my usecase eventually, I'd
> > like to know if there's a way I can point my Solr instance to read lucene
> > segments distributed across different disks attached to the same machine.
> >
> > Thanks!
> >
> > -Deepak
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
@Greg - Are you suggesting RAID as a replacement for Solr or making Solr
work with RAID? Could you elaborate more on the latter, if that's you
meant?
We make use of solr's advanced text processing features which would be hard
to replicate just using RAID.


-Deepak



On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters  wrote:

> Why not use some form of RAID for your index store? You'd get the
> performance benefit of multiple disks without the complexity of managing
> them via solr.
>
> Thanks,
> Greg
>
>
>
> -Original Message-
> From: Deepak Konidena [mailto:deepakk...@gmail.com]
> Sent: Wednesday, September 11, 2013 2:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Distributing lucene segments across multiple disks.
>
> Are you suggesting a multi-core setup, where all the cores share the same
> schema, and the cores lie on different disks?
>
> Basically, I'd like to know if I can distribute shards/segments on a
> single machine (with multiple disks) without the use of zookeeper.
>
>
>
>
>
> -Deepak
>
>
>
> On Wed, Sep 11, 2013 at 11:55 AM, Upayavira  wrote:
>
> > I think you'll find it hard to distribute different segments between
> > disks, as they are typically stored in the same directory.
> >
> > However, instantiating separate cores on different disks should be
> > straight-forward enough, and would give you a performance benefit.
> >
> > I've certainly heard of that done at Amazon, with a separate EBS
> > volume per core giving some performance improvement.
> >
> > Upayavira
> >
> > On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
> > > Hi,
> > >
> > > I know that SolrCloud allows you to have multiple shards on
> > > different machines (or a single machine). But it requires a
> > > zookeeper installation for doing things like leader election, leader
> > > availability, etc
> > >
> > > While SolrCloud may be the ideal solution for my usecase eventually,
> > > I'd like to know if there's a way I can point my Solr instance to
> > > read lucene segments distributed across different disks attached to
> the same machine.
> > >
> > > Thanks!
> > >
> > > -Deepak
> >
>


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Upayavira
I think you'll find it hard to distribute different segments between
disks, as they are typically stored in the same directory.

However, instantiating separate cores on different disks should be
straight-forward enough, and would give you a performance benefit.

I've certainly heard of that done at Amazon, with a separate EBS volume
per core giving some performance improvement.

Upayavira

On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
> Hi,
> 
> I know that SolrCloud allows you to have multiple shards on different
> machines (or a single machine). But it requires a zookeeper installation
> for doing things like leader election, leader availability, etc
> 
> While SolrCloud may be the ideal solution for my usecase eventually, I'd
> like to know if there's a way I can point my Solr instance to read lucene
> segments distributed across different disks attached to the same machine.
> 
> Thanks!
> 
> -Deepak


Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Hi,

I know that SolrCloud allows you to have multiple shards on different
machines (or a single machine). But it requires a zookeeper installation
for doing things like leader election, leader availability, etc

While SolrCloud may be the ideal solution for my usecase eventually, I'd
like to know if there's a way I can point my Solr instance to read lucene
segments distributed across different disks attached to the same machine.

Thanks!

-Deepak


Re: Higher Memory Usage with solr 4.4

2013-09-11 Thread Shawn Heisey

On 9/11/2013 8:54 AM, Kuchekar wrote:

  We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the
Solr with 40GB but we noticed that the QTime is way high compared to
similar on 3.5 solr.
Both the 3.5 and 4.4 solr's configurations and schema are similarly
constructed. Also during the triage we found the physical memory to be
utilized at 95..%.


A 40GB heap is *huge*.  Unless you are dealing with millions of 
super-large documents or many many millions of smaller documents, there 
should be no need for a heap that large.  Additionally, if you are 
allocating most of your system memory to Java, then you will have little 
or no RAM available for OS disk caching, which will cause major 
performance issues.


For most indexes, memory usage should be less after an upgrade, but 
there are exceptions.


I see that you had an earlier question about stored field compression, 
and that you talked about exporting data from your 3.5 install to index 
into 4.4, in which you had stored every field, including copyFields.


If you have a lot of stored data, memory usage for decompression can 
become a problem.  It's usually a lot better to store minimal 
information, just enough to display a result grid/list, and some ID 
information so that when someone clicks on an individual result, you can 
retrieve the entire record from another data source, like a database or 
a filesystem.


Here's a more exhaustive list of potential performance and memory 
problems with Solr:


http://wiki.apache.org/solr/SolrPerformanceProblems

OpenJDK may be problematic, especially if it's version 6.  With Java 7, 
OpenJDK is actually the reference implementation, so if you are using 
OpenJDK 7, I would be less concerned.  With either version, Oracle Java 
tends to produce better results.


Thanks,
Shawn



Re: synonyms not working

2013-09-11 Thread cheops
thanx for your help. could solve the problem meanwhile!
i used 

...which is wrong, it must be






--
View this message in context: 
http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318p4089345.html
Sent from the Solr - User mailing list archive at Nabble.com.


synonyms not working

2013-09-11 Thread cheops
Hi,
I'm using solr4.4 and try to use different synonyms based on different
fieldtypes:


  



  
  




  



...I have the same fieldtype for english (name="text_general_en" and
synonyms="synonyms_en.txt").
The first fieldtype works fine, my synonyms are processed and the result is
as expected. But the "en"-version doesn't seem to work. I'm able to find the
original english words but the synonyms are not processed.
ps: yes, i know using synonyms at query time is not a good idea :-) ... but
can't change it here

Any help would be appreciated!

Thank you.

Best regards
Marcus



--
View this message in context: 
http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: synonyms not working

2013-09-11 Thread Erick Erickson
Attach &debug=query to your URL and inspect the parsed
query, you should be seeing the substitutions if you're
configured correctly. Multi-word synonyms at query time
have the "getting through the query parser" problem.

Best
Erick


On Wed, Sep 11, 2013 at 11:04 AM, cheops  wrote:

> Hi,
> I'm using solr4.4 and try to use different synonyms based on different
> fieldtypes:
>
>  positionIncrementGap="100">
>   
> 
>  words="stopwords.txt" />
> 
>   
>   
> 
>  words="stopwords.txt" />
>  ignoreCase="true" expand="true"/>
> 
>   
> 
>
>
> ...I have the same fieldtype for english (name="text_general_en" and
> synonyms="synonyms_en.txt").
> The first fieldtype works fine, my synonyms are processed and the result is
> as expected. But the "en"-version doesn't seem to work. I'm able to find
> the
> original english words but the synonyms are not processed.
> ps: yes, i know using synonyms at query time is not a good idea :-) ... but
> can't change it here
>
> Any help would be appreciated!
>
> Thank you.
>
> Best regards
> Marcus
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Higher Memory Usage with solr 4.4

2013-09-11 Thread Erick Erickson
There are some defaults (sorry, don't have them listed) that are
somewhat different. If you took your 3.5 and just used it for
4.x, it's probably worth going back over it and start with the 4.x
example and add in any customizations you did for 3.5...

But in general, the memory usage for 4.x should be much smaller
than for 3.5, there were some _major_ improvements in that area.
So I'm guessing you've moved over some innocent-seeming config..

FWIW,
Erick


On Wed, Sep 11, 2013 at 10:54 AM, Kuchekar wrote:

> Hi,
>
>  We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the
> Solr with 40GB but we noticed that the QTime is way high compared to
> similar on 3.5 solr.
> Both the 3.5 and 4.4 solr's configurations and schema are similarly
> constructed. Also during the triage we found the physical memory to be
> utilized at 95..%.
>
> Is there any configuration we might be missing.
>
> Looking forward for your reply.
>
> Thanks.
> Kuchekar, Nilesh
>


Re: Error with Solr 4.4.0, Glassfish, and CentOS 6.2

2013-09-11 Thread Shawn Heisey

On 9/10/2013 9:18 PM, vhoangvu wrote:

Yesterday, I just install latest version of Solr 4.4.0 on Glassfish and
CentOS 6.2 and got an error when try to access the administration page. I
have checked this version on Mac OS one month ago, it works well. So, please
help me clarify what problem.





[#|2013-09-10T18:31:36.896+|INFO|oracle-glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=1;_ThreadName=Thread-2;|2907
[main] ERROR org.apache.solr.core.SolrCore  ?
null:org.apache.solr.common.SolrException: Error instantiating
shardHandlerFactory class [HttpShardHandlerFactory]: Failure initializing
default system SSL context


This is a container problem.  It can't initialize SSL.  The most common 
reason is that the java keystore has a password and it hasn't been 
provided.  If that's the problem, here's one solution:


http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1364232676233-4051159.p...@n3.nabble.com%3E

Another solution, especially if you aren't going to be hosting SSL in 
Java containers at all on that machine, is to get rid of the keystore 
entirely.  If that doesn't do it, you'll need to get help from a 
Glassfish support avenue.


Thanks,
Shawn



Higher Memory Usage with solr 4.4

2013-09-11 Thread Kuchekar
Hi,

 We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the
Solr with 40GB but we noticed that the QTime is way high compared to
similar on 3.5 solr.
Both the 3.5 and 4.4 solr's configurations and schema are similarly
constructed. Also during the triage we found the physical memory to be
utilized at 95..%.

Is there any configuration we might be missing.

Looking forward for your reply.

Thanks.
Kuchekar, Nilesh


Re: Dynamic analizer settings change

2013-09-11 Thread Erick Erickson
You're still in danger of overly-broad hits. When you
try stemming differently into the _same_ underlying
field you get things that make sense in one language
but are totally bogus in another language matching
the query.

As far as lots and lots of fields is concerned, if you
want to restrict your searches to only one language
you have a couple of choices here

Consider a different core per language. Solr easily
handles many cores/server. Now you have no
'wasted' space, it just happens that the stemmer for
the core uses the DE-specific stemmers. Which
you can extend to German de-compounding etc.

Alternatively, you can form your queries with some
care. There's nothing that requires, say, edismax to
be specified in solrconfig.xml. Anything you would
put in the defaults section of the config you can
override on the command line. So, for instance,
if you knew you were querying in French, you could
form something like (going from memory)
defType=edismax&qf=title_fr,text_fr
or
&qf=title_de,text_de

and so completely avoid cross-languge searching.

Or you could simply include a field that has the
language and tack on an fq clause like fq=de.

But you haven't told us how big your problem is. I wouldn't
worry at all about efficiency at this stage if you have, say,
10M documents, I'd just try the simplest thing first and
measure.

500M documents is probably another story.

FWIW
Erick


On Wed, Sep 11, 2013 at 9:50 AM, maephisto  wrote:

> Thanks Jack! Indeed, very nice examples in your book.
>
> Inspired from there, here's a crazy idea: would it be possible to build a
> custom processor chain that would detect the language and use it to apply
> filters, like the aforementioned SnowballPorterFilter.
> That would leave at the end a document having as fields: text(with filtered
> content) and language(the one determined by the processor).
> And at search time, always append the language=.
>
> Does this make sense? If so, would it affect the performance at index time?
> Thanks!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Facet values for spacial field

2013-09-11 Thread Erick Erickson
It seems like the right thing to do here is store something
more intelligible than an encoded lat/lon pair and facet on
that instead. lat/lon, even bare are not all that useful
without some effort anywa...

FWIW,
Erick


On Wed, Sep 11, 2013 at 9:24 AM, Köhler Christian  wrote:

> Hi Eric (and others),
>
> thanx for the the explanation. This helps.
>
> For the usecase: I am cataloging findings of field expeditions. The
> collectors usualy store a single location for the field trip, so the numer
> of locations is limited.
>
> Regards
> Chris
> 
> Von: Erick Erickson [erickerick...@gmail.com]
> Gesendet: Dienstag, 10. September 2013 19:14
> Bis: solr-user@lucene.apache.org
> Betreff: Re: Facet values for spacial field
>
> You might be able to facet by query, but faceting by
> location fields doesn't make a huge amount of sense,
> you'll have lots of facets on individual lat/lon points.
>
> What is the use-case you are trying to support here?
>
> Best,
> Erick
>
>
> On Tue, Sep 10, 2013 at 8:43 AM, Christian Köhler - ZFMK
> wrote:
>
> > Hi,
> >
> > I use the new SpatialRecursivePrefixTreeFiel**dType field to store geo
> > coordinates (e.g. 14.021666,51.5433353 ). I can retrieve the coordinates
> > just find so I am sure they are indexed correctly.
> >
> > However when I try to create facets from this field, solr returns
> > something which looks like a hash of the coordinates:
> >
> > Schema:
> >
> > 
> > 
> >   
> >...
> >  >class="solr.**SpatialRecursivePrefixTreeFiel**dType"
> >units="degrees" />
> > ...
> >>  stored="true"  />
> > 
> >
> > Result:
> > http://localhost/solr/browse?**facet=true&facet.field=geo_**locality<
> http://localhost/solr/browse?facet=true&facet.field=geo_locality>->
> > ...
> > 
> >  
> >   660
> >   290
> >   214
> >   179
> >   165
> >   143
> >...
> >  
> > 
> >
> > Filtering by this hashes fails:
> > http://localhost/solr/browse?&**q=&fq=geo_locality<
> http://localhost/solr/browse?&q=&fq=geo_locality>
> > :"**t4m70cmvej9"
> > java.lang.**IllegalArgumentException: missing parens: t4m70cmvej9
> >
> > How do I get the results of a single location using faceting?
> > Any thoughts?
> >
> > Regards
> > Chris
> >
> > --
> > Christian Köhler
> >
> > Zoologisches Forschungsmuseum Alexander Koenig
> > Leibniz-Institut für Biodiversität der Tiere
> > Adenauerallee 160, 53113 Bonn, Germany
> > www.zfmk.de
> >
> > Stiftung des öffentlichen Rechts
> > Direktor: Prof. J. Wolfgang Wägele
> > Sitz: Bonn
> > --
> > Zoologisches Forschungsmuseum Alexander Koenig
> > - Leibniz-Institut für Biodiversität der Tiere -
> > Adenauerallee 160, 53113 Bonn, Germany
> > www.zfmk.de
> >
> > Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> > Sitz: Bonn
> >
> --
> Zoologisches Forschungsmuseum Alexander Koenig
> - Leibniz-Institut für Biodiversität der Tiere -
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
>
> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn
>


Re: solrj-httpclient-slow

2013-09-11 Thread Erick Erickson
First, I would be wary of mixing the solrj version
with a different solr version. They are pretty compatible
but what are you expecting to gain for the risk?
Regardless, though, that shouldn't be your problem.

You'll have to give us a lot more detail about what
you're trying to do, what you mean by slow (300ms?
300 secnds?) and what you expect.

Best
Erick


On Wed, Sep 11, 2013 at 7:44 AM, xiaoqi  wrote:

>
> hi,everyone
>
> when i track my solr client  timing cost , i find one problem :
>
> some time the whole execute time is very long ,when i go to detail ,i find
> the solr server execute short time  , then the main costs inside httpclient
> (make a connection ,send request or recived  response ,blablabla.
>
> i am not familar httpclient inside code . does anyone met the same problem
> ?
>
> although , i update solrj 's new version ,the problem still.
>
> by the way : my solrj version is : 4.2 ,solr is 3.*
>
>
> Thanks a lot
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solrj-httpclient-slow-tp4089287.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dynamic analizer settings change

2013-09-11 Thread maephisto
Thanks Jack! Indeed, very nice examples in your book.

Inspired from there, here's a crazy idea: would it be possible to build a
custom processor chain that would detect the language and use it to apply
filters, like the aforementioned SnowballPorterFilter.
That would leave at the end a document having as fields: text(with filtered
content) and language(the one determined by the processor).
And at search time, always append the language=.

Does this make sense? If so, would it affect the performance at index time?
Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html
Sent from the Solr - User mailing list archive at Nabble.com.


AW: Facet values for spacial field

2013-09-11 Thread Köhler Christian
Hi Eric (and others),

thanx for the the explanation. This helps.

For the usecase: I am cataloging findings of field expeditions. The collectors 
usualy store a single location for the field trip, so the numer of locations is 
limited.

Regards
Chris

Von: Erick Erickson [erickerick...@gmail.com]
Gesendet: Dienstag, 10. September 2013 19:14
Bis: solr-user@lucene.apache.org
Betreff: Re: Facet values for spacial field

You might be able to facet by query, but faceting by
location fields doesn't make a huge amount of sense,
you'll have lots of facets on individual lat/lon points.

What is the use-case you are trying to support here?

Best,
Erick


On Tue, Sep 10, 2013 at 8:43 AM, Christian Köhler - ZFMK
wrote:

> Hi,
>
> I use the new SpatialRecursivePrefixTreeFiel**dType field to store geo
> coordinates (e.g. 14.021666,51.5433353 ). I can retrieve the coordinates
> just find so I am sure they are indexed correctly.
>
> However when I try to create facets from this field, solr returns
> something which looks like a hash of the coordinates:
>
> Schema:
>
> 
> 
>   
>...
> class="solr.**SpatialRecursivePrefixTreeFiel**dType"
>units="degrees" />
> ...
> stored="true"  />
> 
>
> Result:
> http://localhost/solr/browse?**facet=true&facet.field=geo_**locality->
> ...
> 
>  
>   660
>   290
>   214
>   179
>   165
>   143
>...
>  
> 
>
> Filtering by this hashes fails:
> http://localhost/solr/browse?&**q=&fq=geo_locality
> :"**t4m70cmvej9"
> java.lang.**IllegalArgumentException: missing parens: t4m70cmvej9
>
> How do I get the results of a single location using faceting?
> Any thoughts?
>
> Regards
> Chris
>
> --
> Christian Köhler
>
> Zoologisches Forschungsmuseum Alexander Koenig
> Leibniz-Institut für Biodiversität der Tiere
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
>
> Stiftung des öffentlichen Rechts
> Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn
> --
> Zoologisches Forschungsmuseum Alexander Koenig
> - Leibniz-Institut für Biodiversität der Tiere -
> Adenauerallee 160, 53113 Bonn, Germany
> www.zfmk.de
>
> Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
> Sitz: Bonn
>
--
Zoologisches Forschungsmuseum Alexander Koenig
- Leibniz-Institut für Biodiversität der Tiere -
Adenauerallee 160, 53113 Bonn, Germany
www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele
Sitz: Bonn


Re: No or limited use of FieldCache

2013-09-11 Thread Michael Sokolov

On 09/11/2013 08:40 AM, Per Steffensen wrote:
The reason I mention sort is that we in my project, half a year ago, 
have dealt with the FieldCache->OOM-problem when doing sort-requests. 
We basically just reject sort-requests unless they hit below X 
documents - in case they do we just find them without sorting and sort 
them ourselves afterwards.


Currently our problem is, that we have to do a group/distinct (in 
SQL-language) query and we have found that we can do what we want to 
do using group (http://wiki.apache.org/solr/FieldCollapsing) or facet 
- either will work for us. Problem is that they both use FieldCache 
and we "know" that using FieldCache will lead to OOM-execptions with 
the amount of data each of our Solr-nodes administrate. This time we 
have really no option of just "limit" usage as we did with sort. 
Therefore we need a group/distinct-functionality that works even on 
huge data-amounts (and a algorithm using FieldCache will not)


I believe setting facet.method=enum will actually make facet not use 
the FieldCache. Is that true? Is it a bad idea?


I do not know much about DocValues, but I do not believe that you will 
avoid FieldCache by using DocValues? Please elaborate, or point to 
documentation where I will be able to read that I am wrong. Thanks!
There is Simon Willnauer's presentation 
http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene


and this blog post 
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/


and this one that shows some performance comparisons: 
http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/






charset encoding

2013-09-11 Thread Andreas Owen
i'm using solr 4.3.1 with tika to index html-pages. the html files are 
iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the 
server-http-header says it's utf8 and firefox-webdeveloper agrees. 

when i index a page with special chars like ä,ö,ü solr outputs it completly 
foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it 
seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone 
got a idea whats wrong?



Re: Dynamic analizer settings change

2013-09-11 Thread Jack Krupansky
Yes, supporting multiple languages will be a performance hit, but maybe it 
won't be so bad since all but one of these language-specific fields will be 
empty for each document and Lucene text search should handle empty field 
values just fine. If you can't accept that performance hit, don't support 
multiple languages! It is completely your choice.


There are index-time update processors that can do language detection and 
then automatically direct the text to the proper text_xx field.


See:
https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing

Although my e-book has a lot better examples, especially for the field 
redirection aspect.


-- Jack Krupansky

-Original Message- 
From: maephisto

Sent: Wednesday, September 11, 2013 8:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Dynamic analizer settings change

Thanks, Erik!

I might have missed mentioning something relevant. When querying Solr, I
wouldn't actually need to query all fields, but only the one corresponding
to the language picked by the user on the website. If he's using DE, then
the search should only apply to the text_de field.

What if I need to work with 50 different languages?
Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
text_de, ...): won't this affect the performance ? bigger documents ->
slower queries.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: No or limited use of FieldCache

2013-09-11 Thread Per Steffensen
The reason I mention sort is that we in my project, half a year ago, 
have dealt with the FieldCache->OOM-problem when doing sort-requests. We 
basically just reject sort-requests unless they hit below X documents - 
in case they do we just find them without sorting and sort them 
ourselves afterwards.


Currently our problem is, that we have to do a group/distinct (in 
SQL-language) query and we have found that we can do what we want to do 
using group (http://wiki.apache.org/solr/FieldCollapsing) or facet - 
either will work for us. Problem is that they both use FieldCache and we 
"know" that using FieldCache will lead to OOM-execptions with the amount 
of data each of our Solr-nodes administrate. This time we have really no 
option of just "limit" usage as we did with sort. Therefore we need a 
group/distinct-functionality that works even on huge data-amounts (and a 
algorithm using FieldCache will not)


I believe setting facet.method=enum will actually make facet not use the 
FieldCache. Is that true? Is it a bad idea?


I do not know much about DocValues, but I do not believe that you will 
avoid FieldCache by using DocValues? Please elaborate, or point to 
documentation where I will be able to read that I am wrong. Thanks!


Regards, Per Steffensen

On 9/11/13 1:38 PM, Erick Erickson wrote:

I don't know any more than Michael, but I'd _love_ some reports from the
field.

There are some restriction on DocValues though, I believe one of them
is that they don't really work on analyzed data

FWIW,
Erick




RE: Dynamic analizer settings change

2013-09-11 Thread Markus Jelsma


 
 
-Original message-
> From:maephisto 
> Sent: Wednesday 11th September 2013 14:34
> To: solr-user@lucene.apache.org
> Subject: Re: Dynamic analizer settings change
> 
> Thanks, Erik!
> 
> I might have missed mentioning something relevant. When querying Solr, I
> wouldn't actually need to query all fields, but only the one corresponding
> to the language picked by the user on the website. If he's using DE, then
> the search should only apply to the text_de field.
> 
> What if I need to work with 50 different languages?
> Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
> text_de, ...): won't this affect the performance ? bigger documents ->
> slower queries.

Yes, that will affect performance greatly! The problem is not searching 50 
languages but when using (e)dismax, the problem is creating the entire query.  
You will see good performance in the `process` part of a search but poor 
performance in the `prepare` part of the search when debugging.

> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: Profiling Solr Lucene for query

2013-09-11 Thread Manuel Le Normand
Dmitry - currently we don't have such a front end, this sounds like a good
idea creating it. And yes, we do query all 36 shards every query.

Mikhail - I do think 1 minute is enough data, as during this exact minute I
had a single query running (that took a qtime of 1 minute). I wanted to
isolate these hard queries. I repeated this profiling few times.

I think I will take the termInterval from 128 to 32 and check the results.
I'm currently using NRTCachingDirectoryFactory




On Mon, Sep 9, 2013 at 11:29 PM, Dmitry Kan  wrote:

> Hi Manuel,
>
> The frontend solr instance is the one that does not have its own index and
> is doing merging of the results. Is this the case? If yes, are all 36
> shards always queried?
>
> Dmitry
>
>
> On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand <
> manuel.lenorm...@gmail.com> wrote:
>
> > Hi Dmitry,
> >
> > I have solr 4.3 and every query is distributed and merged back for
> ranking
> > purpose.
> >
> > What do you mean by frontend solr?
> >
> >
> > On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan  wrote:
> >
> > > are you querying your shards via a frontend solr? We have noticed, that
> > > querying becomes much faster if results merging can be avoided.
> > >
> > > Dmitry
> > >
> > >
> > > On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand <
> > > manuel.lenorm...@gmail.com> wrote:
> > >
> > > > Hello all
> > > > Looking on the 10% slowest queries, I get very bad performances (~60
> > sec
> > > > per query).
> > > > These queries have lots of conditions on my main field (more than a
> > > > hundred), including phrase queries and rows=1000. I do return only
> id's
> > > > though.
> > > > I can quite firmly say that this bad performance is due to slow
> storage
> > > > issue (that are beyond my control for now). Despite this I want to
> > > improve
> > > > my performances.
> > > >
> > > > As tought in school, I started profiling these queries and the data
> of
> > ~1
> > > > minute profile is located here:
> > > > http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg
> > > >
> > > > Main observation: most of the time I do wait for readVInt, who's
> > > stacktrace
> > > > (2 out of 2 thread dumps) is:
> > > >
> > > > catalina-exec-3870 - Thread t@6615
> > > >  java.lang.Thread.State: RUNNABLE
> > > >  at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108)
> > > >  at
> > > >
> > > >
> > >
> >
> org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java:
> > > > 2357)
> > > >  at
> > > >
> > > >
> > >
> >
> ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745)
> > > >  at org.apadhe.lucene.index.TermContext.build(TermContext.java:95)
> > > >  at
> > > >
> > > >
> > >
> >
> org.apache.lucene.search.PhraseQuery$PhraseWeight.(PhraseQuery.java:221)
> > > >  at
> > > org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326)
> > > >  at
> > > >
> > > >
> > >
> >
> org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:183)
> > > >  at
> > > >
> > org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
> > > >  at
> > > >
> > > >
> > >
> >
> org.apache.lucene.searth.BooleanQuery$BooleanWeight.(BooleanQuery.java:183)
> > > >  at
> > > >
> > oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
> > > >  at
> > > >
> > > >
> > >
> >
> org.apache.lucene.searth.BooleanQuery$BooleanWeight.(BooleanQuery.java:183)
> > > >  at
> > > >
> > org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384)
> > > >  at
> > > >
> > > >
> > >
> >
> org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675)
> > > >  at
> > org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
> > > >
> > > >
> > > > So I do actually wait for IO as expected, but I might be too many
> time
> > > page
> > > > faulting while looking for the TermBlocks (tim file), ie locating the
> > > term.
> > > > As I reindex now, would it be useful lowering down the termInterval
> > > > (default to 128)? As the FST (tip files) are that small (few 10-100
> MB)
> > > so
> > > > there are no memory contentions, could I lower down this param to 8
> for
> > > > example? The benefit from lowering down the term interval would be to
> > > > obligate the FST to get on memory (JVM - thanks to the
> > > NRTCachingDirectory)
> > > > as I do not control the term dictionary file (OS caching, loads an
> > > average
> > > > of 6% of it).
> > > >
> > > >
> > > > General configs:
> > > > solr 4.3
> > > > 36 shards, each has few million docs
> > > > These 36 servers (each server has 2 replicas) are running virtual,
> 16GB
> > > > memory each (4GB for JVM, 12GB remain for the OS caching),  consuming
> > > 260GB
> > > > of disk mounted for the index files.
> > > >
> > >
> >
>


Re: Dynamic analizer settings change

2013-09-11 Thread maephisto
Thanks, Erik!

I might have missed mentioning something relevant. When querying Solr, I
wouldn't actually need to query all fields, but only the one corresponding
to the language picked by the user on the website. If he's using DE, then
the search should only apply to the text_de field.

What if I need to work with 50 different languages?
Then I would get a schema with 50 types and 50 fields (text_en, text_fr,
text_de, ...): won't this affect the performance ? bigger documents ->
slower queries.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html
Sent from the Solr - User mailing list archive at Nabble.com.


solrj-httpclient-slow

2013-09-11 Thread xiaoqi

hi,everyone

when i track my solr client  timing cost , i find one problem : 

some time the whole execute time is very long ,when i go to detail ,i find
the solr server execute short time  , then the main costs inside httpclient
(make a connection ,send request or recived  response ,blablabla. 

i am not familar httpclient inside code . does anyone met the same problem ? 

although , i update solrj 's new version ,the problem still. 

by the way : my solrj version is : 4.2 ,solr is 3.* 


Thanks a lot 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solrj-httpclient-slow-tp4089287.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Stemming and protwords configuration

2013-09-11 Thread Erick Erickson
Did you try putting them _all_ in protwords.txt? i.e.
frais, fraise, fraises?

Don't forget to re-index.

An alternative is to index in a second field that doesn't have the
stemmer and when you want exact matches, search against that
field.

Best
Erick


On Mon, Sep 9, 2013 at 10:29 AM,  wrote:

> Hi,
>
> We have a Solr server using stemming:
>
>  protected="protwords.txt" />
>
> I would like to query the French words "frais" and "fraise" separately. I
> put the word "fraise" in protwords.txt file.
>
> - When I query the word "fraise", no document indexed with the word
> "frais" are found.
> - When I query the word "frais", I've got documents indexed with the word
> "fraise".
>
> Is there a way to do not match "fraises" documents in the second situation
> ?
>
> I hope this is clear. Thanks for your reply.
>
> Christophe
>
>
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
> recu ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and
> delete this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been
> modified, changed or falsified.
> Thank you.
>
>


Re: No or limited use of FieldCache

2013-09-11 Thread Erick Erickson
I don't know any more than Michael, but I'd _love_ some reports from the
field.

There are some restriction on DocValues though, I believe one of them
is that they don't really work on analyzed data

FWIW,
Erick


On Wed, Sep 11, 2013 at 7:00 AM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> On 9/11/13 3:11 AM, Per Steffensen wrote:
>
>> Hi
>>
>> We have a SolrCloud setup handling huge amounts of data. When we do
>> group, facet or sort searches Solr will use its FieldCache, and add data in
>> it for every single document we have. For us it is not realistic that this
>> will ever fit in memory and we get OOM exceptions. Are there some way of
>> disabling the FieldCache (taking the performance penalty of course) or make
>> it behave in a nicer way where it only uses up to e.g. 80% of the memory
>> available to the JVM? Or other suggestions?
>>
>> Regards, Per Steffensen
>>
> I think you might want to look into using DocValues fields, which are
> column-stride fields stored as compressed arrays - one value per document
> -- for the fields on which you are sorting and faceting. My understanding
> (which is limited) is that these avoid the use of the field cache, and I
> believe you have the option to control whether they are held in memory or
> on disk.  I hope someone who knows more will elaborate...
>
> -Mike
>


Re: Dynamic analizer settings change

2013-09-11 Thread Erick Erickson
I wouldn't :). Here's the problem. Say you do this successfully at
index time. How do you then search reasonably? There's often
not near enough information to know what the search language is,
there's little or no context.

If the number of languages is limited, people often index into separate
language-specific fields, say title_fr and title_en and use edismax
to automatically distribute queries against all the fields.

Others index "families" of languages in separate fields using things
like the folding filters for Western languages, another field for, say,
CJK languages and another for Middle Eastern languages etc.

FWIW,
Erick


On Wed, Sep 11, 2013 at 6:55 AM, maephisto  wrote:

> Let's take the following type definition and schema (borrowed from Rafal
> Kuc's Solr 4 cookbook) :
> 
> 
> 
> 
> 
> 
> 
>
> and schema:
>
>  required="true" />
> 
>
> The above analizer will apply SnowballPorterFilter english language filter.
> But would it be possible to change the language to french during indexing
> for some documents. is this possible? If not, what would be the best
> solution for having the same analizer but with different languages, which
> languange being determined at index time ?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Regarding improving performance of the solr

2013-09-11 Thread Erick Erickson
Be a little careful when extrapolating from disk to memory.
Any fields where you've set stored="true" will put data in
segment files with extensions .fdt and .fdx, see
These are the compressed verbatim copy of the data
for stored fields and have very little impact on
memory required for searching. I've seen indexes where
75% of the data is stored and indexes where 5% of the
data is stored.

"Summary of File Extensions" here:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html

Best,
Erick


On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy wrote:

> @Shawn: Correctly I am trying to reduce the index size. I am working on
> reindex the solr with some of the features as indexed and not stored
>
> @Jean: I tried with  different caches. It did not show much improvement.
>
>
> On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey  wrote:
>
> > On 9/6/2013 2:54 AM, prabu palanisamy wrote:
> > > I am currently using solr -3.5.0,  indexed  wikipedia dump (50 gb) with
> > > java 1.6.
> > > I am searching the solr with text (which is actually twitter tweets) .
> > > Currently it takes average time of 210 millisecond for each post, out
> of
> > > which 200 millisecond is consumed by solr server (QTime).  I used the
> > > jconsole monitor tool.
> >
> > If the size of all your Solr indexes on disk is in the 50GB range of
> > your wikipedia dump, then for ideal performance, you'll want to have
> > 50GB of free memory so the OS can cache your index.  You might be able
> > to get by with 25-30GB of free memory, depending on your index
> composition.
> >
> > Note that this is memory over and above what you allocate to the Solr
> > JVM, and memory used by other processes on the machine.  If you do have
> > other services on the same machine, note that those programs might ALSO
> > require OS disk cache RAM.
> >
> > http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: SolrCloud 4.x hangs under high update volume

2013-09-11 Thread Erick Erickson
If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
copy of the 4x branch. By "recent", I mean like today, it looks like Mark
applied this early this morning. But several reports indicate that this will
solve your problem.

I would expect that increasing the number of shards would make the problem
worse, not
better.

There's also SOLR-5232...

Best
Erick


On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt wrote:

> Hey guys,
>
> Based on my understanding of the problem we are encountering, I feel we've
> been able to reduce the likelihood of this issue by making the following
> changes to our app's usage of SolrCloud:
>
> 1) We increased our document batch size to 200 from 10 - our app batches
> updates to reduce HTTP requests/overhead. The theory is increasing the
> batch size reduces the likelihood of this issue happening.
> 2) We reduced to 1 application node sending updates to SolrCloud - we write
> Solr updates to Redis, and have previously had 4 application nodes pushing
> the updates to Solr (popping off the Redis queue). Reducing the number of
> nodes pushing to Solr reduces the concurrency on SolrCloud.
> 3) Less threads pushing to SolrCloud - due to the increase in batch size,
> we were able to go down to 5 update threads on the update-pushing-app (from
> 10 threads).
>
> To be clear the above only reduces the likelihood of the issue happening,
> and DOES NOT actually resolve the issue at hand.
>
> If we happen to encounter issues with the above 3 changes, the next steps
> (I could use some advice on) are:
>
> 1) Increase the number of shards (2x) - the theory here is this reduces the
> locking on shards because there are more shards. Am I onto something here,
> or will this not help at all?
> 2) Use CloudSolrServer - currently we have a plain-old least-connection
> HTTP VIP. If we go "direct" to what we need to update, this will reduce
> concurrency in SolrCloud a bit. Thoughts?
>
> Thanks all!
>
> Cheers,
>
> Tim
>
>
> On 6 September 2013 14:47, Tim Vaillancourt  wrote:
>
> > Enjoy your trip, Mark! Thanks again for the help!
> >
> > Tim
> >
> >
> > On 6 September 2013 14:18, Mark Miller  wrote:
> >
> >> Okay, thanks, useful info. Getting on a plane, but ill look more at this
> >> soon. That 10k thread spike is good to know - that's no good and could
> >> easily be part of the problem. We want to keep that from happening.
> >>
> >> Mark
> >>
> >> Sent from my iPhone
> >>
> >> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt 
> >> wrote:
> >>
> >> > Hey Mark,
> >> >
> >> > The farthest we've made it at the same batch size/volume was 12 hours
> >> > without this patch, but that isn't consistent. Sometimes we would only
> >> get
> >> > to 6 hours or less.
> >> >
> >> > During the crash I can see an amazing spike in threads to 10k which is
> >> > essentially our ulimit for the JVM, but I strangely see no
> "OutOfMemory:
> >> > cannot open native thread errors" that always follow this. Weird!
> >> >
> >> > We also notice a spike in CPU around the crash. The instability caused
> >> some
> >> > shard recovery/replication though, so that CPU may be a symptom of the
> >> > replication, or is possibly the root cause. The CPU spikes from about
> >> > 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
> >> while
> >> > spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
> >> whole
> >> > index is in 128GB RAM, 6xRAID10 15k).
> >> >
> >> > More on resources: our disk I/O seemed to spike about 2x during the
> >> crash
> >> > (about 1300kbps written to 3500kbps), but this may have been the
> >> > replication, or ERROR logging (we generally log nothing due to
> >> > WARN-severity unless something breaks).
> >> >
> >> > Lastly, I found this stack trace occurring frequently, and have no
> idea
> >> > what it is (may be useful or not):
> >> >
> >> > "java.lang.IllegalStateException :
> >> >  at
> org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
> >> >  at org.eclipse.jetty.server.Response.sendError(Response.java:325)
> >> >  at
> >> >
> >>
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
> >> >  at
> >> >
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
> >> >  at
> >> >
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >> >  at
> >> >
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
> >> >  at
> >> >
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
> >> >  at
> >> >
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >> >  at
> >> >
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >> >  at
> >> >
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >> >  at
> >> >
> >>
> org.eclipse.jetty.server.handler.Con

Re: Solr doesnt return answer when searching numbers

2013-09-11 Thread Erick Erickson
Mail guy.

You've been around long enough to know to try adding &debug=query to your
URL and looking at the results, what does that show?

Best
Erick


On Tue, Sep 10, 2013 at 9:25 AM, Mysurf Mail  wrote:

> I am querying using
>
> http://...:8983/solr/vault/select?q="design test"&fl=PackageName
>
> I get 3 result:
>
>- design test
>- design test 2013
>- design test for jobs
>
> Now when I query using q="test for jobs"
> -> I get only "design test for jobs"
>
> But when I query using q = 2013
>
> http://...:8983/solr/vault/select?q=2013&fl=PackageName
>
> I get no result. Why doesnt it return an answer when I query with numbers?
>
> In schema xml
>
>   required="true"/>
>


Re: No or limited use of FieldCache

2013-09-11 Thread Michael Sokolov

On 9/11/13 3:11 AM, Per Steffensen wrote:

Hi

We have a SolrCloud setup handling huge amounts of data. When we do 
group, facet or sort searches Solr will use its FieldCache, and add 
data in it for every single document we have. For us it is not 
realistic that this will ever fit in memory and we get OOM exceptions. 
Are there some way of disabling the FieldCache (taking the performance 
penalty of course) or make it behave in a nicer way where it only uses 
up to e.g. 80% of the memory available to the JVM? Or other suggestions?


Regards, Per Steffensen
I think you might want to look into using DocValues fields, which are 
column-stride fields stored as compressed arrays - one value per 
document -- for the fields on which you are sorting and faceting. My 
understanding (which is limited) is that these avoid the use of the 
field cache, and I believe you have the option to control whether they 
are held in memory or on disk.  I hope someone who knows more will 
elaborate...


-Mike


Dynamic analizer settings change

2013-09-11 Thread maephisto
Let's take the following type definition and schema (borrowed from Rafal
Kuc's Solr 4 cookbook) :








and schema:




The above analizer will apply SnowballPorterFilter english language filter. 
But would it be possible to change the language to french during indexing
for some documents. is this possible? If not, what would be the best
solution for having the same analizer but with different languages, which
languange being determined at index time ?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: charfilter doesn't do anything

2013-09-11 Thread Andreas Owen
perfect, i tried it before but always at the tail of the expression with no 
effect. thanks a lot. a last question, do you know how to keep the html 
comments from being filtered before the transformer has done its work?


On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote:

> Okay, I can repro the problem. Yes, in appears that the pattern replace char 
> filter does not default to multiline mode for pattern matching, so  on 
> one line and  on another line cannot be matched.
> 
> Now, whether that is by design or a bug or an option for enhancement is a 
> matter for some committer to comment on.
> 
> But, the good news is that you can in fact set multiline mode in your pattern 
> my starting it with "(?s)", which means that dot accepts line break 
> characters as well.
> 
> So, here are my revised field types:
> 
>  positionIncrementGap="100" >
> 
>pattern="(?s)^.*(.*).*$" replacement="$1" />
>   
>   
> 
> 
> 
>  positionIncrementGap="100" >
> 
>pattern="(?s)^.*(.*).*$" replacement="$1" />
>   
>   
>   
> 
> 
> 
> The first type accepts everything within , including nested HTML 
> formatting, while the latter strips nested HTML formatting as well.
> 
> The tokenizer will in fact strip out white space, but that happens after all 
> character filters have completed.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Andreas Owen
> Sent: Tuesday, September 10, 2013 7:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> ok i am getting there now but if there are newlines involved the regex stops 
> as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have 
> to get rid of the newlines. why isn't whitespaceTokenizerFactory the right 
> element for this?
> 
> 
> On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:
> 
>> Use XML then. Although you will need to escape the XML special characters as 
>> I did in the pattern.
>> 
>> The point is simply: Quickly and simply try to find the simple test scenario 
>> that illustrates the problem.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Andreas Owen
>> Sent: Monday, September 09, 2013 7:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i tried but that isn't working either, it want a data-stream, i'll have to 
>> check how to post json instead of xml
>> 
>> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
>> 
>>> Did you at least try the pattern I gave you?
>>> 
>>> The point of the curl was the data, not how you send the data. You can just 
>>> use the standard Solr simple post tool.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Andreas Owen
>>> Sent: Monday, September 09, 2013 6:40 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> i've downloaded curl and tried it in the comman prompt and power shell on 
>>> my win 2008r2 server, thats why i used my dataimporter with a single line 
>>> html file and copy/pastet the lines into schema.xml
>>> 
>>> 
>>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>>> 
 Did you in fact try my suggested example? If not, please do so.
 
 -- Jack Krupansky
 
 -Original Message- From: Andreas Owen
 Sent: Monday, September 09, 2013 4:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: charfilter doesn't do anything
 
 i index html pages with a lot of lines and not just a string with the 
 body-tag.
 it doesn't work with proper html files, even though i took all the new 
 lines out.
 
 html-file:
 nav-content nur das will ich sehenfooter-content
 
 solr update debug output:
 "text_html": ["\r\n\r\n>>> content=\"ISO-8859-1\">\r\n>>> content=\"text/html; 
 charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das 
 will ich sehenfooter-content"]
 
 
 
 On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
 
> I tried this and it seems to work when added to the standard Solr example 
> in 4.4:
> 
> 
> 
>  positionIncrementGap="100" >
> 
>  pattern="^.*(.*).*$" replacement="$1" />
> 
> 
> 
> 
> 
> That char filter retains only text between  and . Is that 
> what you wanted?
> 
> Indexing this data:
> 
> curl 'localhost:8983/solr/update?commit=true' -H 
> 'Content-type:application/json' -d '
> [{"id":"doc-1","body":"abc A test. def"}]'
> 
> And querying with these commands:
> 
> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
> Shows all data
> 
> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
> shows the body text
> 
> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
> shows nothing (outside of body)
> 
> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";

Re: How to facet data from a multivalued field?

2013-09-11 Thread Raheel Hasan
oh got it.. Thanks a lot...


On Tue, Sep 10, 2013 at 10:10 PM, Erick Erickson wrote:

> You can't facet on fields where indexed="false". When you look at
> output docs, you're seeing _stored_ not indexed data. Set
> indexed="true" and re-index...
>
> Best,
> Erick
>
>
> On Tue, Sep 10, 2013 at 5:51 AM, Rah1x  wrote:
>
> > Hi buddy,
> >
> > I am having this problem that I cant even reach to what you did at first
> > step..
> >
> > all I get is:
> > 
> >
> > This is the schema:
> >  > required="false" omitTermFreqAndPositions="true" multiValued="true" />
> >
> > Note: the data is correctly placed in the field as the query results
> shows.
> > However, the facet is not working.
> >
> > Could you please share the schema of what you did to achieve it?
> >
> > Thanks a lot.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/How-to-facet-data-from-a-multivalued-field-tp3897853p4089045.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
Regards,
Raheel Hasan


No or limited use of FieldCache

2013-09-11 Thread Per Steffensen

Hi

We have a SolrCloud setup handling huge amounts of data. When we do 
group, facet or sort searches Solr will use its FieldCache, and add data 
in it for every single document we have. For us it is not realistic that 
this will ever fit in memory and we get OOM exceptions. Are there some 
way of disabling the FieldCache (taking the performance penalty of 
course) or make it behave in a nicer way where it only uses up to e.g. 
80% of the memory available to the JVM? Or other suggestions?


Regards, Per Steffensen


Re: Some highlighted snippets aren't being returned

2013-09-11 Thread Eric O'Hanlon
Thank you, Aloke and Bryan!  I'll give this a try and I'll report back on what 
happens!

- Eric

On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal  wrote:

> Hi Eric,
> 
> As Bryan suggests, you should look at appropriately setting up the
> fragSize & maxAnalyzedChars for long documents.
> 
> One issue I find with your search request is that in trying to
> highlight across three separate fields, you have added each of them as
> a separate request param:
> hl.fl=contents&hl.fl=title&hl.fl=original_url
> 
> The way to do it would be
> (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass
> them as values to one comma (or space) separated field:
> hl.fl=contents,title,original_url
> 
> Regards,
> Aloke
> 
> On 9/9/13, Bryan Loofbourrow  wrote:
>> Eric,
>> 
>> Your example document is quite long. Are you setting hl.maxAnalyzedChars?
>> If you don't, the highlighter you appear to be using will not look past
>> the first 51,200 characters of the document for snippet candidates.
>> 
>> http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars
>> 
>> -- Bryan
>> 
>> 
>>> -Original Message-
>>> From: Eric O'Hanlon [mailto:elo2...@columbia.edu]
>>> Sent: Sunday, September 08, 2013 2:01 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Some highlighted snippets aren't being returned
>>> 
>>> Hi again Everyone,
>>> 
>>> I didn't get any replies to this, so I thought I'd re-send in case
>> anyone
>>> missed it and has any thoughts.
>>> 
>>> Thanks,
>>> Eric
>>> 
>>> On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon  wrote:
>>> 
 Hi Everyone,
 
 I'm facing an issue in which my solr query is returning highlighted
>>> snippets for some, but not all results.  For reference, I'm searching
>>> through an index that contains web crawls of human-rights-related
>>> websites.  I'm running solr as a webapp under Tomcat and I've included
>> the
>>> query's solr params from the Tomcat log:
 
 ...
 webapp=/solr-4.2
 path=/select
 
>>> 
>> params={facet=true&sort=score+desc&group.limit=10&spellcheck.q=Unangan&f.m
>>> 
>> imetype_code.facet.limit=7&hl.simple.pre=&q.alt=*:*&f.organization_t
>>> 
>> ype__facet.facet.limit=6&f.language__facet.facet.limit=6&hl=true&f.date_of
>>> 
>> _capture_.facet.limit=6&group.field=original_url&hl.simple.post=>> 
>>> &facet.field=domain&facet.field=date_of_capture_&facet.field=mimetype
>>> 
>> _code&facet.field=geographic_focus__facet&facet.field=organization_based_i
>>> 
>> n__facet&facet.field=organization_type__facet&facet.field=language__facet&
>>> 
>> facet.field=creator_name__facet&hl.fragsize=600&f.creator_name__facet.face
>>> 
>> t.limit=6&facet.mincount=1&qf=text^1&hl.fl=contents&hl.fl=title&hl.fl=orig
>>> 
>> inal_url&wt=ruby&f.geographic_focus__facet.facet.limit=6&defType=edismax&r
>>> 
>> ows=10&f.domain.facet.limit=6&q=Unangan&f.organization_based_in__facet.fac
>>> et.limit=6&q.op=AND&group=true&hl.usePhraseHighlighter=true} hits=8
>>> status=0 QTime=108
 ...
 
 For the query above (which can be simplified to say: find all
>> documents
>>> that contain the word "unangan" and return facets, highlights, etc.), I
>>> get five search results.  Only three of these are returning highlighted
>>> snippets.  Here's the "highlighting" portion of the solr response (note:
>>> printed in ruby notation because I'm receiving this response in a Rails
>>> app):
 
 
 "highlighting"=>
 
>>> 
>> {"20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%
>>> 202002%20tentang%20Perlindungan%20Anak.pdf"=>
   {},
 
>>> 
>> "20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
>>> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
   {},
 
>>> 
>> "20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2
>>> 02002%20tentang%20Perlindungan%20Anak.pdf"=>
   {},
  "20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
   {"contents"=>
 ["...actual snippet is returned here..."]},
  "20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf"=>
   {"contents"=>
 ["...actual snippet is returned here..."]},
  "20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2-
>>> uu-no-39-tahun-1999"=>
   {"contents"=>
 ["...actual snippet is returned here..."]},
 
>> "20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no-
>>> 39-tahun-1999?tmpl=component&format=raw"=>
   {"contents"=>
 ["...actual snippet is returned here..."]},
 
>>> 
>> "20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U
>>> timut_heritage.pdf"=>
   {}}
 
 
 I have eight (as opposed to five) results above because I'm also doing
>> a
>>> grouped query, grouping by a field called "original_url", and this leads
>>> to five grouped results.
 
 I've confirmed that my highlight-lacking results DO contain the word
>>> "unangan", as expe