Facet at zappos.com

2013-07-25 Thread Ifnu bima
Hi,

I'm currently looking at zappos solr implementation on their website.
One thing make me curious is how their facet filter works.

If you see zappos facet filter, there are some facet that allow us
filter using multiple value, for example size and brands. The
behaviour allow user to select multiple facet value without removing
other value in same facet filter. If you compare this behaviour with,
for example, solr sample /browse handler in solr distribution, it is
quite different, since it will only allow selection of single facet
value per facet filter.

is zappos multiple facet value can be achieved using only
configuration at solrconfig.xml? or it needs custom code while writing
solr client?

thanks and regards

-- 
http://ifnubima.org/indo-java-podcast/
http://project-template.googlecode.com/
@ifnubima

regards


Re: Duplicate documents based on attribute

2013-07-25 Thread Aditya
You need to store the color field as multi valued stored field. You have to
do pagination manually. If you worried, then use database. Have a table
with Product Name and Color. You could retrieve data with pagination.

Still if you want to achieve it via Solr. Have a separate record for every
product and color. ProductName, Color, RecordType. Since Solr is NoSQL, you
could have different fields and not all records should have all the fields.
You could store different type of document. Filter the record by its type.

Regards
Aditya
www.findbestopensource.com






On Thu, Jul 25, 2013 at 11:01 PM, Alexandre Rafalovitch
wrote:

> Look for the presentations online. You are not the first store to use Solr,
> there are some explanations around. Try one from Gilt, but I think there
> were more.
>
> You will want to store data at the lowest meaningful level of search
> granularity. So, in your case, it might be ProductVariation (shoes+color).
> Some examples I have seen, even store it down to availability level or
> price-difference level. Then, you do some post-search normalization either
> by doing groups or by doing filtering.
>
> Solr is not a database, store what you want to find.
>
> Regards,
>Alex.
>
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>
>
> On Thu, Jul 25, 2013 at 12:42 PM, Mark  wrote:
>
> > How would I go about doing something like this. Not sure if this is
> > something that can be accomplished on the index side or its something
> that
> > should be done in our application.
> >
> > Say we are an online store for shoes and we are selling Product A in red,
> > blue and green. Is there a way when we search for Product A all three
> > results can be returned even though they are logically the same item
> (same
> > product in our database).
> >
> > Thoughts on how this can be accomplished?
> >
> > Thanks
> >
> > - M
>


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-25 Thread Tim Vaillancourt

Thanks Shawn and Yonik!

Yonik: I noticed this error appears to be fairly trivial, but it is not 
appearing after a previous crash. Every time I run this high-volume test 
that produced my stack trace, I zero out the logs, Solr data and 
Zookeeper data and start over from scratch with a brand new collection 
and zero'd out logs.


The test is mostly high volume (2000-4000 updates/sec) and at the start 
the SolrCloud runs decently for a good 20-60~ minutes, no errors in the 
logs at all. Then that stack trace occurs on all 3 nodes (staggered), I 
immediately get some replica down messages and then some "cannot 
connect" errors to all other cluster nodes, who have all crashed the 
same way. The tlog error could be a symptom of the problem of running 
out of threads perhaps.


Shawn: thanks so much for sharing those details! Yes, they seem to be 
nice servers, for sure - I don't get to touch/see them but they're fast! 
I'll look into firmwares for sure and will try again after updating 
them. These Solr instances are not-bare metal and are actually KVM VMs 
so that's another layer to look into, although it is consistent between 
the two clusters.


I am not currently increasing the 'nofiles' ulimit to above default like 
you are, but does Solr use 10,000+ file handles? It won't hurt to try it 
I guess :). To rule out Java 7, I'll probably also try Jetty 8 and Java 
1.6 as an experiment as well.


Thanks!

Tim

On 25/07/13 05:55 PM, Yonik Seeley wrote:

On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt  wrote:

"ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)


That itself isn't necessarily a problem (and why it says "non fatal")
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com


Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Chris Hostetter

: Thanks for your help.   I found a workaround for this use case, which is to
: avoid using a shards query and just asking each shard for a dump of the

that would be (step#1 in) the method i would recomend for your usecase of 
"check whats in the entire index" because it drasitcally reduces the 
amount of work that needed in each query -- you're just tlaking to one 
node at a time, not doing multiplexing and mergeing of results from all 
the nodes.

: do any ranking or sorting.   What I am now seeing is that qtimes have gone
: up from about 5 seconds per request to nearly a minute as the start
: parameter gets higher.  I don't know if this is actually because of the
: start parameter or if something is happening with memory use and/or caching

it's because in order to give you results 3600-3700 it has to 
collect all the results are from 1-3700 in order to then pull out the 
last 100 (or to put it another way: the request for start=3600 
doesn't know what the 3600 it already gave you were, it has to figure 
it out again)

step #2 in hte method i would use to deal with your situation would be to 
not use "start" at all -- sort the docs on your uniqeuKey field, make rows 
as big as you are willing to handle in a single request, and then instead 
of incrementing "start" on each request add an fq on to each subsequent 
query after the first one where you filtered my results to docs with a 
uniqueKey field greater then the last one seen in my previous response.

this is similiar to what a lot of REST APIs seem to do (twitter comes to 
mind) to avoid the problem of dealing with deep paging efficiently or 
trying to keep track of "cursor" reservations on the the server side -- 
they just they don't offer either, and instead they let the client keep 
track of the the state (ie: "max_id") between requests.


-Hoss


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-25 Thread Shawn Heisey
On 7/25/2013 6:53 PM, Tim Vaillancourt wrote:
> Thanks for the reply Shawn, I can always count on you :).
> 
> We are using 10GB heaps and have over 100GB of OS cache free to answer the
> JVM question, Young has about 50% of the heap, all CMS. Our max number of
> processes for the JVM user is 10k, which is where Solr dies when it blows
> up with 'cannot create native thread'.
> 
> I also want to say this is system related, but I am seeing this occur on
> all 3 servers, which are brand-new Dell R720s. I'm not saying this is
> impossible, but I don't see much to suggest that, and it would need to be
> one hell of a coincidence.

Nice hardware.  I have some R720xd servers for another project unrelated
to Solr, love them.

I know a little about Dell servers.  If you haven't done so already, I
would install the OpenManage repo and get the firmware fully updated -
BIOS, RAID, and LAN in particular.  Instructions that are pretty easy to
follow:

http://linux.dell.com/repo/hardware/latest/

For process/file limits, I have the following in
/etc/security/limits.conf on systems that aren't using Cloud:

ncindex hardnproc   6144
ncindex softnproc   4096

ncindex hardnofile  65535
ncindex softnofile  49151

> To add more confusion to the mix, we actually run a 2nd SolrCloud cluster
> on the same Solr, Jetty and JVM versions that do not exhibit this issue,
> although using a completely different schema, servers and access-patterns,
> although it is also at high-TPS. That is some evidence to say the current
> software stack is OK, or maybe this only occurs under an extreme load that
> 2nd cluster does not see, or lastly only with a certain schema.

This is a big reason why I think you should make sure you're fully up to
date on your firmware, as the hardware seems to be one strong
difference.  As much as I love Dell server hardware, firmware issues are
relatively common, especially on early versions of the latest
generation, which includes the R720.

> Lastly, to add a bit more detail to my original description, so far I have
> tried:
> 
> - Entirely rebuilding my cluster from scratch, reinstalling all deps,
> configs, reindexing the data (in case I screwed up somewhere). The EXACT
> same issue occurs under load about 20-45 minutes in.
> - Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue
> occurs after some load.
> - Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after
> some load.

The only other thing I can think of is increasing your zkClientTimeout
to 30 seconds or so and trying Solr 4.4 so you have SOLR-4899 and
SOLR-4805.  That's very definitely a shot in the dark.

Thanks,
Shawn



Re: problems about solr replication in 4.3

2013-07-25 Thread xiaoqi
thank u for replying very much .

in fact ,we make a process for this problem , we found when master building 
index, it will clean self index when building index . so slave every minute
to sync index, destroy self index folder.  

by the way : we building index using
dataimport0?command=full-import&clean=false
,dataimport1?command=full-import&clean=false, 
dataimport2?command=full-import&clean=false .

 when i using in solr3.6 has no problem ,never delete at first . 

does solr 4 need to special config anything ? 

thanks a lot .



--
View this message in context: 
http://lucene.472066.n3.nabble.com/problems-about-solr-replication-in-4-3-tp4079665p4080480.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-25 Thread Yonik Seeley
On Thu, Jul 25, 2013 at 7:44 PM, Tim Vaillancourt  wrote:
> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
> Failure to open existing log file (non fatal)
>

That itself isn't necessarily a problem (and why it says "non fatal")
- it just means that most likely the a transaction log file was
truncated from a previous crash.  It may be unrelated to the other
issues you are seeing.

-Yonik
http://lucidworks.com


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-25 Thread Tim Vaillancourt
Thanks for the reply Shawn, I can always count on you :).

We are using 10GB heaps and have over 100GB of OS cache free to answer the
JVM question, Young has about 50% of the heap, all CMS. Our max number of
processes for the JVM user is 10k, which is where Solr dies when it blows
up with 'cannot create native thread'.

I also want to say this is system related, but I am seeing this occur on
all 3 servers, which are brand-new Dell R720s. I'm not saying this is
impossible, but I don't see much to suggest that, and it would need to be
one hell of a coincidence.

To add more confusion to the mix, we actually run a 2nd SolrCloud cluster
on the same Solr, Jetty and JVM versions that do not exhibit this issue,
although using a completely different schema, servers and access-patterns,
although it is also at high-TPS. That is some evidence to say the current
software stack is OK, or maybe this only occurs under an extreme load that
2nd cluster does not see, or lastly only with a certain schema.

Lastly, to add a bit more detail to my original description, so far I have
tried:

- Entirely rebuilding my cluster from scratch, reinstalling all deps,
configs, reindexing the data (in case I screwed up somewhere). The EXACT
same issue occurs under load about 20-45 minutes in.
- Moving to Java 1.7.0_21 from _25 due to some known bugs. Same issue
occurs after some load.
- Restarting SolrCloud / forcing rebuilds or cores. Same issue occurs after
some load.

Cheers,

Tim


On 25 July 2013 17:13, Shawn Heisey  wrote:

> On 7/25/2013 5:44 PM, Tim Vaillancourt wrote:
>
>> The transaction log error I receive after about 10-30 minutes of load
>> testing is:
>>
>> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.**SolrException]
>> Failure to open existing log file (non fatal)
>> /opt/easw/easw_apps/easo_solr_**cloud/solr/xmshd_shard3_**
>> replica2/data/tlog/tlog.**078:org.**apache.solr.common.**
>> SolrException:
>> java.io.EOFException
>>
>
> 
>
>
>  Caused by: java.io.EOFException
>>  at
>> org.apache.solr.common.util.**FastInputStream.**readUnsignedByte(**
>> FastInputStream.java:73)
>>  at
>> org.apache.solr.common.util.**FastInputStream.readInt(**
>> FastInputStream.java:216)
>>  at
>> org.apache.solr.update.**TransactionLog.readHeader(**
>> TransactionLog.java:266)
>>  at
>> org.apache.solr.update.**TransactionLog.(**TransactionLog.java:160)
>>  ... 25 more
>> "
>>
>
> This looks to me like a system problem.  RHEL should be pretty solid, I
> use CentOS without any trouble.  My initial guesses are a corrupt
> filesystem, failing hardware, or possibly a kernel problem with your
> specific hardware.
>
> I'm running Jetty 8, which is the version that the example uses.  Could
> Jetty 9 be a problem here?  I couldn't really say, though my initial guess
> is that it's not a problem.
>
> I'm running Oracle Java 1.7.0_13.  Normally later releases are better, but
> Java bugs do exist and do get introduced in later releases.  Because you're
> on the absolute latest, I'm guessing that you had the problem with an
> earlier release and upgraded to see if it went away.  If that's what
> happened, it is less likely that it's Java.
>
> My first instinct would be to do a 'yum distro-sync' followed by 'touch
> /forcefsck' and reboot with console access to the server, so that you can
> deal with any fsck problems.  Perhaps you've already tried that. I'm aware
> that this could be very very hard to get pushed through strict change
> management procedures.
>
> I did some searching.  SOLR-4519 is a different problem, but it looks like
> it has a similar underlying exception, with no resolution.  It was filed
> When Solr 4.1.0 was current.
>
> Could there be a resource problem - heap too small, not enough OS disk
> cache, etc?
>
> Thanks,
> Shawn
>
>


Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-25 Thread Shawn Heisey

On 7/25/2013 5:44 PM, Tim Vaillancourt wrote:

The transaction log error I receive after about 10-30 minutes of load
testing is:

"ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)
/opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException:
java.io.EOFException





Caused by: java.io.EOFException
 at
org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
 at
org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216)
 at
org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266)
 at
org.apache.solr.update.TransactionLog.(TransactionLog.java:160)
 ... 25 more
"


This looks to me like a system problem.  RHEL should be pretty solid, I 
use CentOS without any trouble.  My initial guesses are a corrupt 
filesystem, failing hardware, or possibly a kernel problem with your 
specific hardware.


I'm running Jetty 8, which is the version that the example uses.  Could 
Jetty 9 be a problem here?  I couldn't really say, though my initial 
guess is that it's not a problem.


I'm running Oracle Java 1.7.0_13.  Normally later releases are better, 
but Java bugs do exist and do get introduced in later releases.  Because 
you're on the absolute latest, I'm guessing that you had the problem 
with an earlier release and upgraded to see if it went away.  If that's 
what happened, it is less likely that it's Java.


My first instinct would be to do a 'yum distro-sync' followed by 'touch 
/forcefsck' and reboot with console access to the server, so that you 
can deal with any fsck problems.  Perhaps you've already tried that. 
I'm aware that this could be very very hard to get pushed through strict 
change management procedures.


I did some searching.  SOLR-4519 is a different problem, but it looks 
like it has a similar underlying exception, with no resolution.  It was 
filed When Solr 4.1.0 was current.


Could there be a resource problem - heap too small, not enough OS disk 
cache, etc?


Thanks,
Shawn



Re: SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-25 Thread Tim Vaillancourt
Stack trace:

http://timvaillancourt.com.s3.amazonaws.com/tmp/solrcloud.nodeC.2013-07-25-16.jstack.gz

Cheers!

Tim


On 25 July 2013 16:44, Tim Vaillancourt  wrote:

> Hey guys,
>
> I am reaching out to the Solr list with a very vague issue: under high
> load against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas
> (2 cores per instance), I eventually see failure messages related to
> transaction logs, and shortly after these stacktraces occur the cluster
> starts to fall apart.
>
> To explain my setup:
> - SolrCloud 4.3.1.
> - Jetty 9.x.
> - Oracle/Sun JDK 1.7.25 w/CMS.
> - RHEL 6.x 64-bit.
> - 3 instances, 1 per server.
> - 3 shards.
> - 2 replicas per shard.
>
> The transaction log error I receive after about 10-30 minutes of load
> testing is:
>
> "ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
> Failure to open existing log file (non fatal)
> /opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException:
> java.io.EOFException
> at
> org.apache.solr.update.TransactionLog.(TransactionLog.java:182)
> at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233)
> at
> org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83)
> at
> org.apache.solr.update.UpdateHandler.(UpdateHandler.java:138)
> at
> org.apache.solr.update.UpdateHandler.(UpdateHandler.java:125)
> at
> org.apache.solr.update.DirectUpdateHandler2.(DirectUpdateHandler2.java:95)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525)
> at
> org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596)
> at org.apache.solr.core.SolrCore.(SolrCore.java:805)
> at org.apache.solr.core.SolrCore.(SolrCore.java:618)
> at
> org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894)
> at
> org.apache.solr.core.CoreContainer.create(CoreContainer.java:982)
> at
> org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597)
> at
> org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: java.io.EOFException
> at
> org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
> at
> org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216)
> at
> org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266)
> at
> org.apache.solr.update.TransactionLog.(TransactionLog.java:160)
> ... 25 more
> "
>
> Eventually after a few of these stack traces, the cluster starts to lose
> shards and replicas fail. Jetty then creates hung threads until hitting
> OutOfMemory on native threads due to the maximum process ulimit.
>
> I know this is quite a vague issue, so I'm not expecting a silver-bullet
> answer, but I was wondering if anyone has suggestions on where to look
> next? Does this sound Solr-related at all, or possibly system? Has anyone
> seen this issue before, or has any hypothesis how to find out more?
>
> I will reply shortly with a thread dump, taken from 1 locked-up node.
>
> Thanks for any suggestions!
>
> Tim
>


SolrCloud 4.3.1 - "Failure to open existing log file (non fatal)" errors under high load

2013-07-25 Thread Tim Vaillancourt
Hey guys,

I am reaching out to the Solr list with a very vague issue: under high load
against a SolrCloud 4.3.1 cluster of 3 instances, 3 shards, 2 replicas (2
cores per instance), I eventually see failure messages related to
transaction logs, and shortly after these stacktraces occur the cluster
starts to fall apart.

To explain my setup:
- SolrCloud 4.3.1.
- Jetty 9.x.
- Oracle/Sun JDK 1.7.25 w/CMS.
- RHEL 6.x 64-bit.
- 3 instances, 1 per server.
- 3 shards.
- 2 replicas per shard.

The transaction log error I receive after about 10-30 minutes of load
testing is:

"ERROR [2013-07-25 19:34:24.264] [org.apache.solr.common.SolrException]
Failure to open existing log file (non fatal)
/opt/easw/easw_apps/easo_solr_cloud/solr/xmshd_shard3_replica2/data/tlog/tlog.078:org.apache.solr.common.SolrException:
java.io.EOFException
at
org.apache.solr.update.TransactionLog.(TransactionLog.java:182)
at org.apache.solr.update.UpdateLog.init(UpdateLog.java:233)
at
org.apache.solr.update.UpdateHandler.initLog(UpdateHandler.java:83)
at
org.apache.solr.update.UpdateHandler.(UpdateHandler.java:138)
at
org.apache.solr.update.UpdateHandler.(UpdateHandler.java:125)
at
org.apache.solr.update.DirectUpdateHandler2.(DirectUpdateHandler2.java:95)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:525)
at
org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:596)
at org.apache.solr.core.SolrCore.(SolrCore.java:805)
at org.apache.solr.core.SolrCore.(SolrCore.java:618)
at
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:894)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:982)
at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:597)
at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:592)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.EOFException
at
org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:73)
at
org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:216)
at
org.apache.solr.update.TransactionLog.readHeader(TransactionLog.java:266)
at
org.apache.solr.update.TransactionLog.(TransactionLog.java:160)
... 25 more
"

Eventually after a few of these stack traces, the cluster starts to lose
shards and replicas fail. Jetty then creates hung threads until hitting
OutOfMemory on native threads due to the maximum process ulimit.

I know this is quite a vague issue, so I'm not expecting a silver-bullet
answer, but I was wondering if anyone has suggestions on where to look
next? Does this sound Solr-related at all, or possibly system? Has anyone
seen this issue before, or has any hypothesis how to find out more?

I will reply shortly with a thread dump, taken from 1 locked-up node.

Thanks for any suggestions!

Tim


Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Shawn Heisey

On 7/25/2013 4:45 PM, Tom Burton-West wrote:

Thanks for your help.   I found a workaround for this use case, which is to
avoid using a shards query and just asking each shard for a dump of the
unique ids. i.e. run an *:* query and ask for 1 million rows at a time.
This should be a no scoring query, so I would think that it doesn't have to
do any ranking or sorting.   What I am now seeing is that qtimes have gone
up from about 5 seconds per request to nearly a minute as the start
parameter gets higher.  I don't know if this is actually because of the
start parameter or if something is happening with memory use and/or caching
that is just causing things to take longer.  I'm at around 35 out of 119
million for this shard and queries have gone from taking 5 seconds to
taking almost a minute.

INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=3600&q=*:*&rows=100}
hits=119220943 status=0 QTime=52952


Sounds like your servers are handling deep paging far better than I 
would have guessed.  I've seen people talk about exponential query time 
growth from deep paging after only a few pages.  Your times are going 
up, but the increase is *relatively* slow, and you've made it 36 pages in.


Getting the information as you're doing it now will be slow, but 
probably reliable.  Moving to non-distributed requests against the 
individual shards was a good idea.


From my own testing: By bumping my max heap on my dev server from 7GB 
to 9GB, I was able to get a million row result (distributed) in only 
four minutes, whereas it had reached 45 minutes before with no end in 
sight.  It was having huge GC pauses from extremely frequent full GCs. 
That problem persisted after the heap increase, but it wasn't as bad, 
and I was also dealing with the fact that my OS disk cache on the dev 
server is way too small.


Thanks,
Shawn



Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Hi Shawn,

Thanks for your help.   I found a workaround for this use case, which is to
avoid using a shards query and just asking each shard for a dump of the
unique ids. i.e. run an *:* query and ask for 1 million rows at a time.
This should be a no scoring query, so I would think that it doesn't have to
do any ranking or sorting.   What I am now seeing is that qtimes have gone
up from about 5 seconds per request to nearly a minute as the start
parameter gets higher.  I don't know if this is actually because of the
start parameter or if something is happening with memory use and/or caching
that is just causing things to take longer.  I'm at around 35 out of 119
million for this shard and queries have gone from taking 5 seconds to
taking almost a minute.

INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=3600&q=*:*&rows=100}
hits=119220943 status=0 QTime=52952


Tom


INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=700&q=*:*&rows=100}
hits=119220943 status=0 QTime=9772
Jul 25, 2013 5:39:43 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=800&q=*:*&rows=100}
hits=119220943 status=0 QTime=11274
Jul 25, 2013 5:41:44 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=900&q=*:*&rows=100}
hits=119220943 status=0 QTime=13104
Jul 25, 2013 5:43:39 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=1000&q=*:*&rows=100}
hits=119220943 status=0 QTime=13568
...
...
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=1300&q=*:*&rows=100}
hits=119220943 status=0 QTime=26703

Jul 25, 2013 5:58:20 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=1700&q=*:*&rows=100}
hits=119220943 status=0 QTime=22607
Jul 25, 2013 6:00:31 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=1800&q=*:*&rows=100}
hits=119220943 status=0 QTime=24109
...
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=3000&q=*:*&rows=100}
hits=119220943 status=0 QTime=41034
Jul 25, 2013 6:31:36 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=3100&q=*:*&rows=100}
hits=119220943 status=0 QTime=42844
Jul 25, 2013 6:34:16 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=3200&q=*:*&rows=100}
hits=119220943 status=0 QTime=45046
Jul 25, 2013 6:36:57 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=3300&q=*:*&rows=100}
hits=119220943 status=0 QTime=49792
Jul 25, 2013 6:39:43 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=3400&q=*:*&rows=100}
hits=119220943 status=0 QTime=58699




On Thu, Jul 25, 2013 at 6:18 PM, Shawn Heisey  wrote:

> On 7/25/2013 3:09 PM, Tom Burton-West wrote:
>
>> Thanks Shawn,
>>
>> I was confused by the error message: "Invalid version (expected 2, but 60)
>> or the data in not in 'javabin' format"
>>
>> Your explanation makes sense.  I didn't think about what the shards have
>> to
>> send back to the head shard.
>> Now that I look in my logs, I can see the posts that  the shards are
>> sending to the head shard and actually get a good measure of how many
>> bytes
>> are being sent around.
>>
>> I'll poke around and look at multipartUploadLimitInKB, and also see if
>> there is some servlet container limit config I might need to mess with.
>>
>
> I think I figured it out, after a peek at the source code.  I upgraded to
> Solr 4.4 first, my 100,000 row query still didn't work.  By setting
> formdataUploadLimitInKB (in addition to multipartUploadLimitInKB, not sure
> if both are required), I was able to get a 100,000 row query to work.
>
> A query for one million rows did finally respond to my browser query, but
> it took a REALLY REALLY long time (82 million docs in several shards, only
> 16GB RAM on the dev server) and it crashed firefox due to the size of the
> response.  It also seemed to error out on some of the shard responses.  My
> handler has shards.tolerant=true, so that didn't seem to kill the whole
> query ... but because the response crashed firefox, I couldn't tell.
>
> I repeated the query using curl so I could save the response.  It's been
> running for several minutes without any server-side errors, but I still
> don't have any results.
>
> Your servers are much more robust than my little dev server, so this might
> work for you - if you aren't using the start parameter in addition to the
> rows parameter.  You might need to sort ascending by your 

Re: Error opening Reader and new searcher on solr 4.4 with DocValues for fields

2013-07-25 Thread Marcin Rzewucki
http://wiki.apache.org/solr/DocValues#Specifying_a_different_Codec_implementation

OK, it seems there's no back compat for disk based docvalues
implementation. I have to reindex documents to get rid of this issue.


On 25 July 2013 22:17, Marcin Rzewucki  wrote:

> Hi,
>
> After upgrading from solr 4.3.1 to solr 4.4 I have the following issue:
>
> ERROR - 2013-07-25 20:00:15.433; org.apache.solr.core.CoreContainer;
> Unable to create core: awslocal_shard5
> org.apache.solr.common.SolrException: Error opening new searcher
> at org.apache.solr.core.SolrCore.(SolrCore.java:835)
> at org.apache.solr.core.SolrCore.(SolrCore.java:629)
> at
> org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:270)
> at
> org.apache.solr.core.CoreContainer.create(CoreContainer.java:655)
> at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:364)
> at
> org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
> at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1522)
> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1634)
> at org.apache.solr.core.SolrCore.(SolrCore.java:810)
> ... 13 more
> Caused by: org.apache.solr.common.SolrException: Error opening Reader
> at
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:177)
> at
> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:188)
> at
> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:184)
> at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1497)
> ... 15 more
> Caused by: org.apache.lucene.index.CorruptIndexException: invalid type:
> -92,
> resource=NIOFSIndexInput(path="/mnt/tmp1/test/shard5/data/index/_5q_Disk_0.dvdm")
> at
> org.apache.lucene.codecs.diskdv.DiskDocValuesProducer.readFields(DiskDocValuesProducer.java:159)
> at
> org.apache.lucene.codecs.diskdv.DiskDocValuesProducer.(DiskDocValuesProducer.java:72)
> at
> org.apache.lucene.codecs.diskdv.DiskDocValuesFormat.fieldsProducer(DiskDocValuesFormat.java:49)
> at
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.(PerFieldDocValuesFormat.java:213)
> at
> org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat.fieldsProducer(PerFieldDocValuesFormat.java:282)
> at
> org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:134)
> at
> org.apache.lucene.index.SegmentReader.(SegmentReader.java:56)
> at
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
> at
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
> at
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
> at
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
> at
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:174)
> ... 18 more
> DEBUG - 2013-07-25 20:00:15.442;
> org.eclipse.jetty.webapp.WebAppClassLoader; loaded class
> org.apache.log4j.spi.LoggingEvent from startJarLoader@665e2517
> INFO  - 2013-07-25 20:00:15.440;
> org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
> ZooKeeper...
>
> I'm using DocValues with "on disk" option. Could it be a problem ? Index
> does not seem to be corrupted. It works fine on solr 4.3.1. Is there some
> change in files format ? Is it possible to upgrade solr to 4.4 without
> reloading all documents ? Or maybe some additional settings are required
> for DocValues fields ?
>
> Thanks in advance.
> Regards.
>
>


Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Shawn Heisey

On 7/25/2013 3:09 PM, Tom Burton-West wrote:

Thanks Shawn,

I was confused by the error message: "Invalid version (expected 2, but 60)
or the data in not in 'javabin' format"

Your explanation makes sense.  I didn't think about what the shards have to
send back to the head shard.
Now that I look in my logs, I can see the posts that  the shards are
sending to the head shard and actually get a good measure of how many bytes
are being sent around.

I'll poke around and look at multipartUploadLimitInKB, and also see if
there is some servlet container limit config I might need to mess with.


I think I figured it out, after a peek at the source code.  I upgraded 
to Solr 4.4 first, my 100,000 row query still didn't work.  By setting 
formdataUploadLimitInKB (in addition to multipartUploadLimitInKB, not 
sure if both are required), I was able to get a 100,000 row query to work.


A query for one million rows did finally respond to my browser query, 
but it took a REALLY REALLY long time (82 million docs in several 
shards, only 16GB RAM on the dev server) and it crashed firefox due to 
the size of the response.  It also seemed to error out on some of the 
shard responses.  My handler has shards.tolerant=true, so that didn't 
seem to kill the whole query ... but because the response crashed 
firefox, I couldn't tell.


I repeated the query using curl so I could save the response.  It's been 
running for several minutes without any server-side errors, but I still 
don't have any results.


Your servers are much more robust than my little dev server, so this 
might work for you - if you aren't using the start parameter in addition 
to the rows parameter.  You might need to sort ascending by your unique 
key field and use a range query ([* TO *] for the first one), find the 
highest value in the response, and then send a targeted range query (the 
value {max_from_last_run TO *] would work) asking for the next million 
records.


Thanks,
Shawn



Re: returning only certain fields from the docs - parsing on the server side

2013-07-25 Thread Walter Underwood
Yes, your assumption is wrong. It does what it says, "only the fields in this 
list will be included" in the response.

wunder

On Jul 25, 2013, at 2:44 PM, Matt Lieber wrote:

> Hi,
> 
> I only want to return one field in the documents being returned from my query.
> I know there is the 'fl' parameter, which is described in the documentation 
> http://wiki.apache.org/solr/CommonQueryParameters as:
> 
> "This parameter can be used to specify a set of fields to return, limiting 
> the amount of information in the response. When returning the results to the 
> client, only fields in this list will be included."
> 
> But seems like 'fl' works on the client side, after the results have been 
> constructed on the server side, passing the whole docs back on the wire. Is 
> my assumption wrong ?
> Is there a way to filter things out directly on the Solr side, and return 
> only the field that I desire to the client?
> 
> Thanks,
> Matt
> 
> 






Re: returning only certain fields from the docs - parsing on the server side

2013-07-25 Thread Upayavira
fl is on the server side. Try it in a browser and you'll see that.

Upayavira

On Thu, Jul 25, 2013, at 10:44 PM, Matt Lieber wrote:
> Hi,
> 
> I only want to return one field in the documents being returned from my
> query.
> I know there is the 'fl' parameter, which is described in the
> documentation http://wiki.apache.org/solr/CommonQueryParameters as:
> 
> "This parameter can be used to specify a set of fields to return,
> limiting the amount of information in the response. When returning the
> results to the client, only fields in this list will be included."
> 
> But seems like 'fl' works on the client side, after the results have been
> constructed on the server side, passing the whole docs back on the wire.
> Is my assumption wrong ?
> Is there a way to filter things out directly on the Solr side, and return
> only the field that I desire to the client?
> 
> Thanks,
> Matt
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.


returning only certain fields from the docs - parsing on the server side

2013-07-25 Thread Matt Lieber
Hi,

I only want to return one field in the documents being returned from my query.
I know there is the 'fl' parameter, which is described in the documentation 
http://wiki.apache.org/solr/CommonQueryParameters as:

"This parameter can be used to specify a set of fields to return, limiting the 
amount of information in the response. When returning the results to the 
client, only fields in this list will be included."

But seems like 'fl' works on the client side, after the results have been 
constructed on the server side, passing the whole docs back on the wire. Is my 
assumption wrong ?
Is there a way to filter things out directly on the Solr side, and return only 
the field that I desire to the client?

Thanks,
Matt










NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Can we use replication to union the data of master and slave?

2013-07-25 Thread Upayavira
I'm not entirely clear about your question. However, in replication, you
should never commit docs directly to your slave, it will mess up the
synchronisation of your indexes, and hence mess up your replication. If
that's what you are proposing, don't do it!

Upayavira

On Thu, Jul 25, 2013, at 08:29 PM, SolrLover wrote:
> We are using SOLR 4.3.1 but not using solrcloud now.
> 
> We currently support both push and pull indexing and we use softcommit
> for
> push indexing purpose. Now whenever we perform pull indexing (using
> indexer
> program) the changes made by the push indexing process (during indexing
> time) might get lost hence trying to figure out a way to merge the
> modified
> documents..
> 
> I can implement a master and slave setup. I can initiate pull indexing in
> master, slave will be accepting the documents pushed via queue. Now once
> the
> indexing in master is completed, I can replicate the index in slave. I
> just
> want to confirm, if the additional documents in slave will get deleted
> during replication or the new data will get appended in slave? If not, is
> there any other way to resolve this issue?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-we-use-replication-to-union-the-data-of-master-and-slave-tp4080425.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Thanks Shawn,

I was confused by the error message: "Invalid version (expected 2, but 60)
or the data in not in 'javabin' format"

Your explanation makes sense.  I didn't think about what the shards have to
send back to the head shard.
Now that I look in my logs, I can see the posts that  the shards are
sending to the head shard and actually get a good measure of how many bytes
are being sent around.

I'll poke around and look at multipartUploadLimitInKB, and also see if
there is some servlet container limit config I might need to mess with.

Tom




On Thu, Jul 25, 2013 at 2:46 PM, Shawn Heisey  wrote:

> On 7/25/2013 12:26 PM, Shawn Heisey wrote:
>
>> Either multipartUploadLimitInKB doesn't work properly, or there may be
>> some hard limits built into the servlet container, because I set
>> multipartUploadLimitInKB in the requestDispatcher config to 32768 and it
>> still didn't work.  I wonder, perhaps there is a client-side POST buffer
>> limit as well as the servlet container limit, which comes in to play
>> because the Solr server is acting as a client for the distributed
>> requests?
>>
>
> Followup:
>
> I should probably add that I used a different version (and got some
> different errors) because what I've got on my dev server is an old
> branch_4x version:
>
> 4.4-SNAPSHOT 1497605 - ncindex - 2013-06-27 17:12:30
>
> My online production system is 4.2.1, but I am not going to run this query
> on that system because of the potential to break things.  I did try it
> against my backup production system running 3.5.0 with a 1MB server-side
> POST buffer and got an error that seems to at least partially confirm my
> suspicions.  Here's an excerpt:
>
> HTTP ERROR 500
>
> Problem accessing /solr/ncmain/select. Reason:
>
> Form too large18425104>1048576  java.lang.**IllegalStateException:
> Form too large18425104>1048576 at org.mortbay.jetty.Request.**
> extractParameters(Request.**java:1561)   at org.mortbay.jetty.Request.
> **getParameterMap(Request.java:**870)  at org.apache.solr.request.**
> ServletSolrParams.(**ServletSolrParams.java:29)  at
> org.apache.solr.servlet.**StandardRequestParser.**
> parseParamsAndFillStreams(**SolrRequestParsers.java:394) at
> org.apache.solr.servlet.**SolrRequestParsers.parse(**SolrRequestParsers.java:115)
>at 
> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**SolrDispatchFilter.java:223)
> at org.mortbay.jetty.servlet.**ServletHandler$CachedChain.**
> doFilter(ServletHandler.java:**1212)  at org.mortbay.jetty.servlet.**
> ServletHandler.handle(**ServletHandler.java:399) at
> org.mortbay.jetty.security.**SecurityHandler.handle(**SecurityHandler.java:216)
>  at 
> org.mortbay.jetty.servlet.**SessionHandler.handle(**SessionHandler.java:182)
> at 
> org.mortbay.jetty.handler.**ContextHandler.handle(**ContextHandler.java:766)
> at 
> org.mortbay.jetty.webapp.**WebAppContext.handle(**WebAppContext.java:450)
> at org.mortbay.jetty.handler.**ContextHandlerCollection.**handle(**
> ContextHandlerCollection.java:**230)at org.mortbay.jetty.handler.*
> *HandlerCollection.handle(**HandlerCollection.java:114)   at
> org.mortbay.jetty.handler.**HandlerWrapper.handle(**HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.**handle(Server.java:326) at
> org.mortbay.jetty.**HttpConnection.handleRequest(**HttpConnection.java:542)
>  at org.mortbay.jetty.**HttpConnection$RequestHandler.**
> content(HttpConnection.java:**945) at 
> org.mortbay.jetty.HttpParser.**parseNext(HttpParser.java:756)
>  at org.mortbay.jetty.HttpParser.**parseAvailable(HttpParser.**java:212)
> at org.mortbay.jetty.**HttpConnection.handle(**HttpConnection.java:404)
> at 
> org.mortbay.jetty.bio.**SocketConnector$Connection.**run(SocketConnector.java:228)
>   at org.mortbay.thread.**QueuedThreadPool$PoolThread.**
> run(QueuedThreadPool.java:582)
>
> Thanks,
> Shawn
>
>


Re: Solr Index Files in a Directories

2013-07-25 Thread Jack Krupansky
Use LucidWorks Search, define a file system data source and set the schedule 
to crawl the directory every minute, 5 minutes, 30 seconds, or whatever 
interval you want.


http://docs.lucidworks.com/display/lweug/Simple+Filesystem+Data+Sources
http://docs.lucidworks.com/display/help/Schedules

-- Jack Krupansky

-Original Message- 
From: Rajesh Jain

Sent: Thursday, July 25, 2013 3:57 PM
To: solr-user@lucene.apache.org
Subject: Solr Index Files in a Directories

I have flume sink directory where new files are being written periodically.

How can I instruct solr to index the files in the directory every time a
new file gets written.

Any ideas?

Thanks,
Rajesh 



Re: Wildcard matching of dynamic fields

2013-07-25 Thread Jack Krupansky
Yeah, those are the rules. They are more of a heuristic that manages to work 
most of the time reasonably well, but like most heuristics, it is not 
perfect.


In this particular case, your best bet would be to use an update processor 
to discard the "ignored" field values before Solr actually sees them at the 
dynamic field pattern match level.


See:
http://lucene.apache.org/solr/4_4_0/solr-core/org/apache/solr/update/processor/IgnoreFieldUpdateProcessorFactory.html


  solr_.*


You can use full regex patterns or lists of field names. (I have more 
examples in my book.)


-- Jack Krupansky

-Original Message- 
From: Artem Karpenko

Sent: Thursday, July 25, 2013 11:05 AM
To: solr-user@lucene.apache.org
Subject: Wildcard matching of dynamic fields

Hi,

given a dynamic field



There are some other suffix-based fields as well. And some of the fields
in document should be ignored, they have "nosolr_" prefix. But defining



even at the start of schema does not work, field
"nosolr_inv_dunning_boolean" is recognized as boolean anyway and shown
in search results. Documentation says that "longer patterns will be
matched first, if equal size patterns both match, the first appearing in
the schema will be used".

What can be done here (except from changing input document)? And,
generally,: why would such "longer wins" strategy be used here? What use
case does it have?

Best,
Artem. 



Error opening Reader and new searcher on solr 4.4 with DocValues for fields

2013-07-25 Thread Marcin Rzewucki
Hi,

After upgrading from solr 4.3.1 to solr 4.4 I have the following issue:

ERROR - 2013-07-25 20:00:15.433; org.apache.solr.core.CoreContainer; Unable
to create core: awslocal_shard5
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.(SolrCore.java:835)
at org.apache.solr.core.SolrCore.(SolrCore.java:629)
at
org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:270)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:655)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:364)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1522)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1634)
at org.apache.solr.core.SolrCore.(SolrCore.java:810)
... 13 more
Caused by: org.apache.solr.common.SolrException: Error opening Reader
at
org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:177)
at
org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:188)
at
org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:184)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1497)
... 15 more
Caused by: org.apache.lucene.index.CorruptIndexException: invalid type:
-92,
resource=NIOFSIndexInput(path="/mnt/tmp1/test/shard5/data/index/_5q_Disk_0.dvdm")
at
org.apache.lucene.codecs.diskdv.DiskDocValuesProducer.readFields(DiskDocValuesProducer.java:159)
at
org.apache.lucene.codecs.diskdv.DiskDocValuesProducer.(DiskDocValuesProducer.java:72)
at
org.apache.lucene.codecs.diskdv.DiskDocValuesFormat.fieldsProducer(DiskDocValuesFormat.java:49)
at
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsReader.(PerFieldDocValuesFormat.java:213)
at
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat.fieldsProducer(PerFieldDocValuesFormat.java:282)
at
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:134)
at
org.apache.lucene.index.SegmentReader.(SegmentReader.java:56)
at
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
at
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
at
org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:174)
... 18 more
DEBUG - 2013-07-25 20:00:15.442;
org.eclipse.jetty.webapp.WebAppClassLoader; loaded class
org.apache.log4j.spi.LoggingEvent from startJarLoader@665e2517
INFO  - 2013-07-25 20:00:15.440;
org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
ZooKeeper...

I'm using DocValues with "on disk" option. Could it be a problem ? Index
does not seem to be corrupted. It works fine on solr 4.3.1. Is there some
change in files format ? Is it possible to upgrade solr to 4.4 without
reloading all documents ? Or maybe some additional settings are required
for DocValues fields ?

Thanks in advance.
Regards.


Solr Index Files in a Directories

2013-07-25 Thread Rajesh Jain
I have flume sink directory where new files are being written periodically.

How can I instruct solr to index the files in the directory every time a
new file gets written.

Any ideas?

Thanks,
Rajesh


Can we use replication to union the data of master and slave?

2013-07-25 Thread SolrLover
We are using SOLR 4.3.1 but not using solrcloud now.

We currently support both push and pull indexing and we use softcommit for
push indexing purpose. Now whenever we perform pull indexing (using indexer
program) the changes made by the push indexing process (during indexing
time) might get lost hence trying to figure out a way to merge the modified
documents..

I can implement a master and slave setup. I can initiate pull indexing in
master, slave will be accepting the documents pushed via queue. Now once the
indexing in master is completed, I can replicate the index in slave. I just
want to confirm, if the additional documents in slave will get deleted
during replication or the new data will get appended in slave? If not, is
there any other way to resolve this issue?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-we-use-replication-to-union-the-data-of-master-and-slave-tp4080425.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Do I need solr.xml?

2013-07-25 Thread Brian Robinson
Great, this guidance is definitely pointing me in the right direction. 
Thanks Shawn, Erick, and Hoss. I'll pursue this some more and see if I 
can get it working.

Brian


EarlyTerminatingCollectorException in MLT Component of SOLR 4.4

2013-07-25 Thread Domma, Achim
Hi,

I send a query to SOLR, which returns exactly one document. It's a
"id:some_doc_id" search. Here are the parameters as shown in the response:

  params: {
  mlt.mindf: "1",
  mlt.count: "5",
  mlt.fl: "text",
  fl: "id,,application_id,...
project_start,project_end,project_title,score",
  start: "0",
  q: "id:some_doc_id",
  mlt.mintf: "1",
  mlt: "true",
  wt: "json",
  rows: "1"
  }

The reponse key contains the document I expected, but I also get an error,
which seems to happen in the MLT component. Here's the stack trace provided
in the response:

  org.apache.solr.search.EarlyTerminatingCollectorException
  at
org.apache.solr.search.EarlyTerminatingCollector.collect(EarlyTerminatingCollector.java:62)

  at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:289)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:624)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
  at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1494)

  at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1363)

  at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474)
  at
org.apache.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.java:1226)

  at
org.apache.solr.handler.MoreLikeThisHandler$MoreLikeThisHelper.getMoreLikeThis(MoreLikeThisHandler.java:365)

  at
org.apache.solr.handler.component.MoreLikeThisComponent.getMoreLikeThese(MoreLikeThisComponent.java:356)

  at
org.apache.solr.handler.component.MoreLikeThisComponent.process(MoreLikeThisComponent.java:113)

  at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)

  at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
  at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)

  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)

  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)

  at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

  at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)

  at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
  at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)

  at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)

  at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
  at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)

  at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)

  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)

  at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)

  at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)

  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

  at org.eclipse.jetty.server.Server.handle(Server.java:368)
  at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)

  at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)

  at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)

  at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)

  at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
  at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
  at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)

  at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)

  at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)

  at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

  at
java.lang.Thread.run(Thread.java:724org.apache.solr.search.EarlyTerminatingCollectorException

  at
org.apache.solr.search.EarlyTerminatingCollector.collect(EarlyTerminatingCollector.java:62)

  at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:289)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:624)
  at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
  at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1494)

  at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1363)

  at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474)
  at
org.apache.solr.search.SolrIndexSearcher.getDocList(

RE: Solr 4.3.0 - SolrCloud lost all documents when leaders got rebuilt

2013-07-25 Thread Joshi, Shital
Thanks for all answers. 

It appears that we will not have a data-center failure tolerant deployment of 
zookeeper without a 3rd datacenter. The other alternative is to forget about 
running zookeepers across datacenters, and instead have a live-warm deployment 
(and we'd have to manually switch/fail-over primary to backup if we lost or 
otherwise needed to do maintenance on the primary side).


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, July 25, 2013 7:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.3.0 - SolrCloud lost all documents when leaders got rebuilt

Picking up on what Donimique mentioned. Your ZK configuration
isn't doing you much good. Not only do you have an even number
6 (which is actually _less_ robust than having 5), but by splitting
them among two data centers you're effectively requiring the data
center with 4 nodes to always be up. If it goes down (or even if the
link between DCs is broken), DC2 will not be able to index
documents since the ZK nodes in DC2 can't find 4 ZK nodes to
work with.

By taking down the ZK quorum, you are effectively "freezing" the Solr
nodes with the snapshot of the system they knew about the last
time there was a quorum. It's a sticky wicket. Let's assume what you're
trying to do was allowed. Now let's assume that instead of the
machines being down you simply lost connectivity between your DCs
so the ZK nodes can't talk to each other. Now they'd each elect their
nodes as leaders. Any incoming indexing requests wold be serviced.

Now the DCs are re-connected. How could the conflicts be resolved?
This is the "split brain" problem, something ZK is specifically designed
to prevent.

Best
Erick

On Wed, Jul 24, 2013 at 6:50 PM, Dominique Bejean
 wrote:
> With 6 zookeeper instances you need at least 4 instances running at the same 
> time. How can you decide to stop 4 instances and have only 2 instances 
> running ? Zookeeper can't work anymore in these conditions.
>
> Dominique
>
> Le 25 juil. 2013 à 00:16, "Joshi, Shital"  a écrit :
>
>> We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute 
>> boxes (cloud), where 5 machines (leaders) are in datacenter1 and replicas on 
>> datacenter2.  We have 6 zookeeper instances - 4 on datacenter1 and 2 on 
>> datacenter2. The zookeeper instances are on same hosts as Solr nodes. We're 
>> using local disk (/local/data) to store solr index files.
>>
>> Infrastructure team wanted to rebuild dynamic compute boxes on datacenter1 
>> so we handed over all leader hosts to them. By doing so, We lost 4 zookeeper 
>> instances. We were expecting to see all replicas acting as leader. In order 
>> to confirm that, I went to admin console -> cloud page but the page never 
>> returned (kept hanging).  I checked log and saw constant zookeeper host 
>> connection exceptions (the zkHost system property had all 6 zookeeper 
>> instances). I restarted cloud on all replicas but got same error again. This 
>> exception is I think due to the zookeeper bug: 
>> https://issues.apache.org/jira/browse/SOLR-4899 I guess zookeeper never 
>> registered the replicas as leader.
>>
>> After dynamic compute machines were re-built (lost all local data) I 
>> restarted entire cloud (with 6 zookeeper and 10 nodes), the original leaders 
>> were still the leaders (I think zookeeper config never got updated with 
>> replicas being leader, though 2 zookeeper instances were still up). Since 
>> all leaders' /local/data/solr_data was empty, it got replicated to all 
>> replicas and we lost all data in our replica. We lost 26 million documents 
>> on replica. This was very awful.
>>
>> In our start up script (which brings up solr on all nodes one by one), the 
>> leaders are listed first.
>>
>> Any solution to this until Solr 4.4 release?
>>
>> Many Thanks!
>>
>>
>>
>>
>>


Re: Sort top N results in solr after boosting

2013-07-25 Thread Utkarsh Sengar
I agree with your comment on separating noise with the actual relevant
result.
My approach to separate relevant result with noise is not algorithmic but
an absolute measure, i.e. top 5 or top 10 results will always be relevant
(at-least the probability is higher).
But again, that kind of simple sort can be done by the client too.

The current relevant results are purely based off PMIs which is calculated
using the clickstream data. I am also trying to figure out if I can place
extra dimensions to the solr score which takes other attributes into
consideration.
i.e. extending the way solr computes the score with attachment_count (more
attachments, more important), confidence (stronger source has higher
confidence) etc.

Is there a way I can have my custom scoring function which extends (and not
overwrites) solr's scores?

Thanks,
-Utkarsh


On Wed, Jul 24, 2013 at 7:35 PM, Erick Erickson wrote:

> You can certainly just include the attachment count in the
> response and have the app apply the secondary sort. But
> that doesn't separate the "noise" as you say.
>
> How would you identify "noise"? If you don't have an algorithmic
> way to do that, I don't know how you'd manage to separate
> the signal from the noise
>
> Best
> Erick
>
> On Wed, Jul 24, 2013 at 4:37 PM, Utkarsh Sengar 
> wrote:
> > I have a solr query which has a bunch of boost params for relevancy. This
> > search works fine and returns the most relevant documents as per the user
> > query. For example, if user searches for: "iphone 5", keywords like
> > "apple", "wifi" etc are boosted. I get these keywords from external
> > training. The top 10-20 results are iphone 5 phones and then it follows
> > iphone cases and other noise.
> >
> > But I also have a field in the schema called: attachment_count. I need to
> > sort the top N result I get after boost based on this field.
> >
> > Example:
> > I want to sort the top 5 documents based on attachment_count on the
> boosted
> > result (which are relevant for the user).
> >
> > 1. iphone 5 32gb, attachment_count=0
> > 2. iphone 5 16gb, attachment_count=5
> > 3. iphone 5 32gb, attachment_count=10
> > 4. iphone 4gs, attachment_count=3
> > 5. iphone 4, attachment_count=1
> > ...
> > 11. iphone 5 case, attachment_count=100
> >
> >
> > Expected result:
> > 1. iphone 5 32gb, attachment_count=10
> > 2. iphone 5 16gb, attachment_count=5
> > 3. iphone 4gs, attachment_count=3
> > 4. iphone 4, attachment_count=1
> > 5. iphone 5 32gb, attachment_count=0
> > ...
> > 11. iphone 5 case, attachment_count=100
> >
> >
> > Is this possible using a function query? I am not sure how the results
> will
> > look like but I want to try it out.
> >
> > --
> > Thanks,
> > -Utkarsh
>



-- 
Thanks,
-Utkarsh


Re: Why Solr slows down when accessed thru load balancer

2013-07-25 Thread Gora Mohanty
On 26 July 2013 00:11, kaustubh147  wrote:
> Hi,
>
> When I am connecting my application to solr thru a load balancer
> (https://domain name/apache-solr-4.0.0), it is significantly slow. but if I
> connect Solr directly (https://11.11.1.11:8080/apache-solr-4.0.0) on the
> application server it works better.
[...]

Um, clearly there is then some issue with your setup,
and that is what you need to debug.

What does this have to do with Solr?

Regards,
Gora


Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Shawn Heisey

On 7/25/2013 12:26 PM, Shawn Heisey wrote:

Either multipartUploadLimitInKB doesn't work properly, or there may be
some hard limits built into the servlet container, because I set
multipartUploadLimitInKB in the requestDispatcher config to 32768 and it
still didn't work.  I wonder, perhaps there is a client-side POST buffer
limit as well as the servlet container limit, which comes in to play
because the Solr server is acting as a client for the distributed requests?


Followup:

I should probably add that I used a different version (and got some 
different errors) because what I've got on my dev server is an old 
branch_4x version:


4.4-SNAPSHOT 1497605 - ncindex - 2013-06-27 17:12:30

My online production system is 4.2.1, but I am not going to run this 
query on that system because of the potential to break things.  I did 
try it against my backup production system running 3.5.0 with a 1MB 
server-side POST buffer and got an error that seems to at least 
partially confirm my suspicions.  Here's an excerpt:


HTTP ERROR 500

Problem accessing /solr/ncmain/select. Reason:

Form too large18425104>1048576  java.lang.IllegalStateException: 
Form too large18425104>1048576 	at 
org.mortbay.jetty.Request.extractParameters(Request.java:1561) 	at 
org.mortbay.jetty.Request.getParameterMap(Request.java:870) 	at 
org.apache.solr.request.ServletSolrParams.(ServletSolrParams.java:29) 
	at 
org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:394) 
	at 
org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) 
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:223) 
	at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) 
	at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 	at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) 
	at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 	at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) 	at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) 
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) 
	at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) 
	at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) 	at 
org.mortbay.jetty.Server.handle(Server.java:326) 	at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) 	at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) 
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) 	at 
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) 	at 
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) 	at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) 
	at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)


Thanks,
Shawn



Why Solr slows down when accessed thru load balancer

2013-07-25 Thread kaustubh147
Hi,
 
When I am connecting my application to solr thru a load balancer
(https://domain name/apache-solr-4.0.0), it is significantly slow. but if I
connect Solr directly (https://11.11.1.11:8080/apache-solr-4.0.0) on the
application server it works better.

Ideally use of load balancer should give better performance. 

In our setup we have one load balancer which redirects the request to two
Apache web server instances, which eventually redirects to 4 Glassfish
application server instances.

Are we doing something wrong or Is it a known problem with SOLR-Glassfish
combination??
Please help

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-Solr-slows-down-when-accessed-thru-load-balancer-tp4080402.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re:

2013-07-25 Thread wiredkel

Hi!   http://motelchanty.com.br/google.com.offers.html



Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Hi Jack,

I should have pointed out our use case.  In any reasonable case where
actual end users will be looking at search results, paging 1,000 at a time
is reasonable.  But what we are doing is a dump of the unique ids with a
"*:*" query.   This allows us to verify that what our system thinks has
been indexed is actually indexed.   Since we need to dump out results in
the hundreds of millions,  requesting 1,000 at a time is not scalable.

The other context of this is that we currently index 10 million books with
each book as a Solr document.  We are looking at indexing at the page
level, which would result in about 3 billion pages.  So testing the
scalability of queries used by our current production system, such as the
query against the index that is not released to production to get a list of
the unique ids that are actually indexed in Solr is part of that testing
process.

Tom


On Thu, Jul 25, 2013 at 2:13 PM, Jack Krupansky wrote:

> As usual, there is no published hard limit per se, but I would urge
> caution about requesting more than 1,000 rows at a time or even 250. Sure,
> in a fair number of cases 5,000 or 10,000 or even 100,000 MAY work (at
> least sometimes), but Solr and Lucene are more appropriate for "paged"
> results, where page size is 10, 20, 50, 100 or something in that range. So,
> my recommendation is to use 250 to 1,000 as the limit for rows. And
> certainly do a proof of concept implementation for anything above 1,000.
>
> So, if rows=10 works for you, consider yourself lucky!
>
> That said, there is sometimes talk of supporting streaming, which
> presumably would allow access to all results, but chunked/paged in some way.
>
> -- Jack Krupansky
>
> -Original Message- From: Tom Burton-West
> Sent: Thursday, July 25, 2013 1:39 PM
> To: solr-user@lucene.apache.org
> Subject: Solr 4.2.1 limit on number of rows or number of hits per shard?
>
> Hello,
>
> I am running solr 4.2.1 on 3 shards and have about 365 million documents in
> the index total.
> I sent a query asking for 1 million rows at a time,  but I keep getting an
> error claiming that there is an invalid version or data not in javabin
> format (see below)
>
> If I lower the number of rows requested to 100,000, I have no problems.
>
> Does Solr have  a limit on number of rows that can be requested or is this
> a bug?
>
>
> Tom
>
> INFO: [core] webapp=/dev-1 path=/select
> params={shards=XXX:8111/dev-1/**core,XXX:8111/dev-2/core,XXX:**
> 8111/dev-3/core&fl=vol_id&**indent=on&start=0&q=*:*&rows=**100}
> hits=365488789 status=500 QTime=132743
> Jul 25, 2013 1:26:00 PM org.apache.solr.common.**SolrException log
> SEVERE: null:org.apache.solr.common.**SolrException:
> java.lang.RuntimeException: Invalid version (expected 2, but 60) or the
> data in not in 'javabin' format
>at
> org.apache.solr.handler.**component.SearchHandler.**handleRequestBody(**
> SearchHandler.java:302)
>at
> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> RequestHandlerBase.java:135)
>at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1817)
>at
> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> SolrDispatchFilter.java:639)
>at
> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> SolrDispatchFilter.java:345)
>at
> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> SolrDispatchFilter.java:141)
>at
> org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(**
> ApplicationFilterChain.java:**215)
>at
> org.apache.catalina.core.**ApplicationFilterChain.**doFilter(**
> ApplicationFilterChain.java:**188)
>at
> org.apache.catalina.core.**StandardWrapperValve.invoke(**
> StandardWrapperValve.java:213)
>at
> org.apache.catalina.core.**StandardContextValve.invoke(**
> StandardContextValve.java:172)
>at
> org.apache.catalina.valves.**AccessLogValve.invoke(**
> AccessLogValve.java:548)
>at
> org.apache.catalina.core.**StandardHostValve.invoke(**
> StandardHostValve.java:127)
>at
> org.apache.catalina.valves.**ErrorReportValve.invoke(**
> ErrorReportValve.java:117)
>at
> org.apache.catalina.core.**StandardEngineValve.invoke(**
> StandardEngineValve.java:108)
>at
> org.apache.catalina.connector.**CoyoteAdapter.service(**
> CoyoteAdapter.java:174)
>at
> org.apache.coyote.http11.**Http11Processor.process(**
> Http11Processor.java:875)
>at
> org.apache.coyote.http11.**Http11BaseProtocol$**Http11ConnectionHandler.**
> processConnection(**Http11BaseProtocol.java:665)
>at
> org.apache.tomcat.util.net.**PoolTcpEndpoint.processSocket(**
> PoolTcpEndpoint.java:528)
>at
> org.apache.tomcat.util.net.**LeaderFollowerWorkerThread.**runIt(**
> LeaderFollowerWorkerThread.**java:81)
>at
> org.apache.tomcat.util.**threads.ThreadPool$**ControlRunnable.run(**
> ThreadPool.java:689)
>at java.lang.Thread.run(Thread.**java:619)
> Caused by: java.lang.Runt

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Shawn Heisey

On 7/25/2013 11:39 AM, Tom Burton-West wrote:

Hello,

I am running solr 4.2.1 on 3 shards and have about 365 million documents in
the index total.
I sent a query asking for 1 million rows at a time,  but I keep getting an
error claiming that there is an invalid version or data not in javabin
format (see below)

If I lower the number of rows requested to 100,000, I have no problems.

Does Solr have  a limit on number of rows that can be requested or is this
a bug?


That particular javabin error (expected 2, but 60) usually means that 
the response it got was something other than javabin, typically HTML or XML.


I was going to say that you should hopefully get a more meaningful error 
message from the server log, but it appears that what you included *IS* 
the server log, so I'm really confused.  The error message you're 
getting is typically something you see on the *client* side.


After some testing on my server, I suspect that what's happening here is 
that the initial shard query (the one with fl=uniqueKeyField,score) is 
working, but then when Solr makes the HUGE subsequent requests for the 
actual documents it is interested in, the list is too big to fit in the 
server-side POST buffer, which defaults to 2MB.  Those queries need to 
be big enough to include an "ids" parameter that is a comma-separated 
list of values from your uniqueKey.  In my case, each of those values 
could be 32 characters, so the id list could be up to 33MB for a million 
of them.  Most of them are significantly shorter, so a 32MB buffer would 
be big enough.


Either multipartUploadLimitInKB doesn't work properly, or there may be 
some hard limits built into the servlet container, because I set 
multipartUploadLimitInKB in the requestDispatcher config to 32768 and it 
still didn't work.  I wonder, perhaps there is a client-side POST buffer 
limit as well as the servlet container limit, which comes in to play 
because the Solr server is acting as a client for the distributed requests?


Thanks,
Shawn



Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Jack Krupansky
As usual, there is no published hard limit per se, but I would urge caution 
about requesting more than 1,000 rows at a time or even 250. Sure, in a fair 
number of cases 5,000 or 10,000 or even 100,000 MAY work (at least 
sometimes), but Solr and Lucene are more appropriate for "paged" results, 
where page size is 10, 20, 50, 100 or something in that range. So, my 
recommendation is to use 250 to 1,000 as the limit for rows. And certainly 
do a proof of concept implementation for anything above 1,000.


So, if rows=10 works for you, consider yourself lucky!

That said, there is sometimes talk of supporting streaming, which presumably 
would allow access to all results, but chunked/paged in some way.


-- Jack Krupansky

-Original Message- 
From: Tom Burton-West

Sent: Thursday, July 25, 2013 1:39 PM
To: solr-user@lucene.apache.org
Subject: Solr 4.2.1 limit on number of rows or number of hits per shard?

Hello,

I am running solr 4.2.1 on 3 shards and have about 365 million documents in
the index total.
I sent a query asking for 1 million rows at a time,  but I keep getting an
error claiming that there is an invalid version or data not in javabin
format (see below)

If I lower the number of rows requested to 100,000, I have no problems.

Does Solr have  a limit on number of rows that can be requested or is this
a bug?


Tom

INFO: [core] webapp=/dev-1 path=/select
params={shards=XXX:8111/dev-1/core,XXX:8111/dev-2/core,XXX:8111/dev-3/core&fl=vol_id&indent=on&start=0&q=*:*&rows=100}
hits=365488789 status=500 QTime=132743
Jul 25, 2013 1:26:00 PM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException:
java.lang.RuntimeException: Invalid version (expected 2, but 60) or the
data in not in 'javabin' format
   at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:302)
   at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
   at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
   at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
   at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
   at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
   at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
   at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
   at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
   at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
   at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
   at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
   at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
   at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
   at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
   at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
   at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
   at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
   at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
   at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 60)
or the data in not in 'javabin' format
   at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:109)
   at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
: 



Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Hello,

I am running solr 4.2.1 on 3 shards and have about 365 million documents in
the index total.
I sent a query asking for 1 million rows at a time,  but I keep getting an
error claiming that there is an invalid version or data not in javabin
format (see below)

If I lower the number of rows requested to 100,000, I have no problems.

Does Solr have  a limit on number of rows that can be requested or is this
a bug?


Tom

INFO: [core] webapp=/dev-1 path=/select
params={shards=XXX:8111/dev-1/core,XXX:8111/dev-2/core,XXX:8111/dev-3/core&fl=vol_id&indent=on&start=0&q=*:*&rows=100}
hits=365488789 status=500 QTime=132743
Jul 25, 2013 1:26:00 PM org.apache.solr.common.SolrException log
SEVERE: null:org.apache.solr.common.SolrException:
java.lang.RuntimeException: Invalid version (expected 2, but 60) or the
data in not in 'javabin' format
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:302)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 60)
or the data in not in 'javabin' format
at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:109)
at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
:


Re: Duplicate documents based on attribute

2013-07-25 Thread Alexandre Rafalovitch
Look for the presentations online. You are not the first store to use Solr,
there are some explanations around. Try one from Gilt, but I think there
were more.

You will want to store data at the lowest meaningful level of search
granularity. So, in your case, it might be ProductVariation (shoes+color).
Some examples I have seen, even store it down to availability level or
price-difference level. Then, you do some post-search normalization either
by doing groups or by doing filtering.

Solr is not a database, store what you want to find.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Jul 25, 2013 at 12:42 PM, Mark  wrote:

> How would I go about doing something like this. Not sure if this is
> something that can be accomplished on the index side or its something that
> should be done in our application.
>
> Say we are an online store for shoes and we are selling Product A in red,
> blue and green. Is there a way when we search for Product A all three
> results can be returned even though they are logically the same item (same
> product in our database).
>
> Thoughts on how this can be accomplished?
>
> Thanks
>
> - M


Re: Duplicate documents based on attribute

2013-07-25 Thread Mark
I was hoping to do this from within Solr, that way I don't have to manually 
mess around with pagination.  The number of items on each page would be 
indeterministic. 
On Jul 25, 2013, at 9:48 AM, Anshum Gupta  wrote:

> Have a multivalued stored 'color' field and just iterate on it outside of
> solr.
> 
> 
> On Thu, Jul 25, 2013 at 10:12 PM, Mark  wrote:
> 
>> How would I go about doing something like this. Not sure if this is
>> something that can be accomplished on the index side or its something that
>> should be done in our application.
>> 
>> Say we are an online store for shoes and we are selling Product A in red,
>> blue and green. Is there a way when we search for Product A all three
>> results can be returned even though they are logically the same item (same
>> product in our database).
>> 
>> Thoughts on how this can be accomplished?
>> 
>> Thanks
>> 
>> - M
> 
> 
> 
> 
> -- 
> 
> Anshum Gupta
> http://www.anshumgupta.net



Re: Duplicate documents based on attribute

2013-07-25 Thread Anshum Gupta
Have a multivalued stored 'color' field and just iterate on it outside of
solr.


On Thu, Jul 25, 2013 at 10:12 PM, Mark  wrote:

> How would I go about doing something like this. Not sure if this is
> something that can be accomplished on the index side or its something that
> should be done in our application.
>
> Say we are an online store for shoes and we are selling Product A in red,
> blue and green. Is there a way when we search for Product A all three
> results can be returned even though they are logically the same item (same
> product in our database).
>
> Thoughts on how this can be accomplished?
>
> Thanks
>
> - M




-- 

Anshum Gupta
http://www.anshumgupta.net


Duplicate documents based on attribute

2013-07-25 Thread Mark
How would I go about doing something like this. Not sure if this is something 
that can be accomplished on the index side or its something that should be done 
in our application. 

Say we are an online store for shoes and we are selling Product A in red, blue 
and green. Is there a way when we search for Product A all three results can be 
returned even though they are logically the same item (same product in our 
database).

Thoughts on how this can be accomplished?

Thanks

- M

Re: Do I need solr.xml?

2013-07-25 Thread Shawn Heisey

On 7/25/2013 8:21 AM, Brian Robinson wrote:

The sentence on the admin page just tells me to check the logs, but I
don't appear to have any yet. Those are located in
solr/collection1/data/tlog/, right?


Those are transaction logs - for durability in the face of failure and 
for the real-time get handler.  If you are using Solr 4.3.0 or later, 
especially using the example jetty container (start.jar), the logs 
should be in logs/solr.log, relative to the current working directory 
where Solr was started.  This location is dictated by log4j.properties. 
 If you have an earlier version or a more custom setup, the log 
location could be highly variable.


The admin UI should have a logging section that will show you everything 
in the log that's at least WARN severity.  Often this isn't enough, and 
you need the actual logfile.



The only browser I appear to be able to use in my SSH (my only access to
the server) is Lynx, so this response isn't formatted, unfortunately,
but this is what I get. I tried to find a sample response so I could
compare and parse out the useful information, but no luck. It looks like
there is only the one core, though. I don't see anything that looks like
an init failure.


The lynx browser isn't going to work for the admin UI.  The links 
(elinks on some systems) browser will display some things, but it 
doesn't work either.  The UI is pretty much all javascript, and you need 
a full graphical browser for that to function properly.


SSH clients (including putty) will let you do port forwarding.  Set it 
up to forward a local port (like 8983, but you can use what you want) to 
the remote port that Solr uses.  Then in your local browser, go to 
http://localhost:8983/solr (or whatever port you chose) and the UI 
should work.



that error indicates that your solr client sent a document to some (valid
and functioning) SolrCore which has a schema.xml that does not contain a
field named "brand".

So it could be that I just updated the wrong schema.xml. But if
/etc/solr/ is my solr home directory, then /etc/solr/collection1/conf/
would be the right location, right?


If you aren't using SolrCloud, and your core is named 'collection1', 
then that should be the right schema.  You must reload the core, or more 
ideally completely restart Solr, after changing the schema.


Thanks,
Shawn



RE: Spell check SOLR 3.6.1 not working for numbers

2013-07-25 Thread Dyer, James
I think the default SpellingQueryConverter has a hard time with terms that 
contain numbers.  Can you provide a failing case...the query you're executing 
(with all the spellcheck.xxx params) and the spellcheck response (or lack 
thereof).  Is it producing any hits?

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Poornima Jay [mailto:poornima...@rocketmail.com] 
Sent: Thursday, July 25, 2013 5:00 AM
To: solr-user
Subject: Spell check SOLR 3.6.1 not working for numbers

Hi,

I using SOLR 3.6.1 and implemented spellcheck. I found that the numbers in the 
spellcheck query does not return any results. Below is my solrconfig.xml and 
schema.xml details. Please any one let me know what needs to be done in order 
to get the spell check for numbers.

solrConfig

     
    default   
    solr.IndexBasedSpellChecker
    spell  
    ./spellchecker   
    0.7    
    true
    .0001
   
  textSpell



  
    
    default   
    
    false
    
    false
    
    10
  
      
      spellcheck
        
  

Schema

         
            
            
            
            
            
            
         
        
         
        
        
        
      
      




   
   
 
   

Thanks,
Poornima



Re: How can I learn the total count of how many documents indexed and how many documents updated?

2013-07-25 Thread Otis Gospodnetic
Hi,

SPM for Solr shows numDocs, maxDocs, and their delta.  Is that what
you are after?  See http://sematext.com/spm

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Wed, Jul 17, 2013 at 4:06 PM, Furkan KAMACI  wrote:
> I have crawled some web pages and indexed them at my SolrCloud(Solr 4.2.1).
> However before I index them there was already some indexes. I can calculate
> the difference between current and previous document count. However it
> doesn't mean that I have indexed that count of documents. Because urls of
> websites are unique ids at my system. So it means that some of documents
> updated and they did not increased document count.
>
> My question is that: How can I learn the total count of how many documents
> indexed and how many documents updated?


Wildcard matching of dynamic fields

2013-07-25 Thread Artem Karpenko

Hi,

given a dynamic field

stored="true" />


There are some other suffix-based fields as well. And some of the fields 
in document should be ignored, they have "nosolr_" prefix. But defining


stored="false" />


even at the start of schema does not work, field 
"nosolr_inv_dunning_boolean" is recognized as boolean anyway and shown 
in search results. Documentation says that "longer patterns will be 
matched first, if equal size patterns both match, the first appearing in 
the schema will be used".


What can be done here (except from changing input document)? And, 
generally,: why would such "longer wins" strategy be used here? What use 
case does it have?


Best,
Artem.


Re: Solr Cloud Setup

2013-07-25 Thread AdityaR
Thanks Eric and Flavio for your responses. Sorry, I meant to I was creating
collections and not cores. 

I used the same article as suggested by Flavio to set up the solr cloud and
I did it twice. Both the times I am facing the same issue. I am not sure
where the problem is.  

I am using the following versions: 

Zookeeper : 3.4.5
Solr : 4.3.1
tomcat : 6.0.36

Thanks, 
Aditya



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Setup-tp4080182p4080352.html
Sent from the Solr - User mailing list archive at Nabble.com.


Fw:

2013-07-25 Thread wiredkel

Hi!   http://www.cedarsshawarma.com/google.com.offers.html



Re: Do I need solr.xml?

2013-07-25 Thread Brian Robinson



if you get an error on the admin UI, there should be specifics about
*what* the initialization failure is -- at last one sentence, and there
should be a full stack trace in the logs -- having those details will
help understand the root of your first problem, which may explain your
second problem.
The sentence on the admin page just tells me to check the logs, but I 
don't appear to have any yet. Those are located in 
solr/collection1/data/tlog/, right?




it would also help to know what the CoreAdmin handler returns when you ask
it for status about all the cores -- even if the *UI* is having problems
on your browser, that should return useful info (like: how many cores you
have -- if any -- and which one had an init failure)
The only browser I appear to be able to use in my SSH (my only access to 
the server) is Lynx, so this response isn't formatted, unfortunately, 
but this is what I get. I tried to find a sample response so I could 
compare and parse out the useful information, but no luck. It looks like 
there is only the one core, though. I don't see anything that looks like 
an init failure.


01collection1collection1true/etc/solr/collection1//etc/solr/collection1/data/solrconfig.xmlschema.xml2013-07-21T17:51:31.625Z33071636400010truefalseorg.apache.lucene.store.NRT
CachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/etc/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@7f322d18;
maxCacheMB=48.0 maxMergeSizeMB=4.0) 6565 bytes

My guess is it would look something like this if it were formatted

01
collection1
collection1
true
/etc/solr/collection1/
/etc/solr/collection1/data/
solrconfig.xml
schema.xml
2013-07-21T17:51:31.625Z33071636400010
truefalse
org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory
(org.apache.lucene.store.MMapDirectory@/etc/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@7f322d18; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)

6565 bytes




that error indicates that your solr client sent a document to some (valid
and functioning) SolrCore which has a schema.xml that does not contain a
field named "brand".
So it could be that I just updated the wrong schema.xml. But if 
/etc/solr/ is my solr home directory, then /etc/solr/collection1/conf/ 
would be the right location, right?


Brian


Re: SolrCloud commit process is too time consuming, even if documents are light

2013-07-25 Thread Mark Miller
I'm looking into some possible slow down after long indexing issues when I get 
back from vacation. This could be related. Very early guess though.

Another thing you might try - Lucene recently changed the merge scheduler 
policy defaults (in 4.1) - it used to use up 3 threads to merge and have a max 
merge setting of that + 2. It now defaults to 1 and 2, and that can really 
impact how fast documents are added by a significant amount. It also causes 
indexing threads to pause and wait for merges *way* more, especially when your 
index gets large and the merges start taking a long time. The tradeoff was 
supposedly that merges are faster, but honestly, I think it's a poor default, 
especially if you are measuring indexing speed and now really paying attention 
to how long merges go on afar you finish indexing, and especially if you have 
beefy hardware. You might play with those settings.

- Mark

On Jul 25, 2013, at 8:36 AM, Radu Ghita  wrote:

> Forgot to attach server and solr configurations:
> 
> SolrCloud 4.1, internal Zookeeper, 16 shards, custom java importer.
> Server: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32 cores, 192gb RAM, 10tb
> SSD and 50tb SAS memory
> 
> 
> On Thu, Jul 25, 2013 at 3:20 PM, Radu Ghita  wrote:
> 
>> 
>> Hi,
>> 
>> We are having a client with business model that requires indexing each
>> month billion rows into solr from mysql in a small time-frame. The
>> documents are very light, but the number is very high and we need to
>> achieve speeds of around 80-100k/s. The built in solr indexer goes to
>> 40-50k tops, but after some hours ( ~12h ) it crashes and the speed slows
>> down as hours go by.
>> 
>> Therefore we have developed a custom java importer that connects directly
>> to mysql and solrcloud via zookeeper, grabs data from mysql, creates
>> documents and then imports into solr. This helps because we are opening ~50
>> threads and the indexing process speeds up. We have optimized the mysql
>> queries ( mysql was the initial bottleneck ) and the speeds we get now are
>> over 100k/s, but as index number gets bigger, solr stays very long on
>> adding documents. I assume it needs to be something from solrconfig that
>> makes solr stay and even block after 100 mil documents indexed.
>> 
>> Here is the java code that creates documents and then adds to solr server:
>> 
>> public void createDocuments() throws SQLException, SolrServerException,
>> IOException
>> {
>> App.logger.write("Creating documents..");
>> this.docs = new ArrayList();
>> App.logger.incrementNumberOfRows(this.size);
>> while(this.results.next())
>> { this.docs.add(this.getDocumentFromResultSet(this.results)); }
>> 
>> this.statement.close();
>> this.results.close();
>> }
>> 
>> public void commitDocuments() throws SolrServerException, IOException
>> { App.logger.write("Committing.."); App.solrServer.add(this.docs); // here
>> it stays very long and then blocks
>> App.logger.incrementNumberOfRows(this.docs.size()); this.docs.clear(); }
>> 
>> I am also pasting solrconfig.xml parameters that make sense to this
>> discussion:
>> 128
>> false
>> 1
>> 100
>> 
>> 2
>> 100
>> 1
>> 
>> 100
>> 1024
>> 
>> 15000
>> 100
>> false
>> 
>> 
>> 200
>> 
>> 
>> The big problem stands in SOLR, because I've run the mysql queries single
>> and speed is great, but as the time passes solr adding function stays way
>> too long and then it blocks, even tho server is top level and has lots of
>> resources.
>> 
>> I'm new to this so please assist. Thanks,
>> --
>> 
>> **
>> 
>>  *Radu Ghita *
>> 
>>  Tel:   +40 721 18 18 68
>> 
>>  Fax:  +40 351 81 85 52
>> 
> 
> 
> 
> -- 
> 
> **
> 
>  *Radu Ghita *
> 
>  Tel:   +40 721 18 18 68
> 
>  Fax:  +40 351 81 85 52



Re: SolrCloud commit process is too time consuming, even if documents are light

2013-07-25 Thread Radu Ghita
Forgot to attach server and solr configurations:

SolrCloud 4.1, internal Zookeeper, 16 shards, custom java importer.
Server: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32 cores, 192gb RAM, 10tb
SSD and 50tb SAS memory


On Thu, Jul 25, 2013 at 3:20 PM, Radu Ghita  wrote:

>
> Hi,
>
> We are having a client with business model that requires indexing each
> month billion rows into solr from mysql in a small time-frame. The
> documents are very light, but the number is very high and we need to
> achieve speeds of around 80-100k/s. The built in solr indexer goes to
> 40-50k tops, but after some hours ( ~12h ) it crashes and the speed slows
> down as hours go by.
>
> Therefore we have developed a custom java importer that connects directly
> to mysql and solrcloud via zookeeper, grabs data from mysql, creates
> documents and then imports into solr. This helps because we are opening ~50
> threads and the indexing process speeds up. We have optimized the mysql
> queries ( mysql was the initial bottleneck ) and the speeds we get now are
> over 100k/s, but as index number gets bigger, solr stays very long on
> adding documents. I assume it needs to be something from solrconfig that
> makes solr stay and even block after 100 mil documents indexed.
>
> Here is the java code that creates documents and then adds to solr server:
>
> public void createDocuments() throws SQLException, SolrServerException,
> IOException
> {
> App.logger.write("Creating documents..");
> this.docs = new ArrayList();
> App.logger.incrementNumberOfRows(this.size);
> while(this.results.next())
> { this.docs.add(this.getDocumentFromResultSet(this.results)); }
>
> this.statement.close();
> this.results.close();
> }
>
> public void commitDocuments() throws SolrServerException, IOException
> { App.logger.write("Committing.."); App.solrServer.add(this.docs); // here
> it stays very long and then blocks
> App.logger.incrementNumberOfRows(this.docs.size()); this.docs.clear(); }
>
> I am also pasting solrconfig.xml parameters that make sense to this
> discussion:
> 128
> false
> 1
> 100
> 
> 2
> 100
> 1
> 
> 100
> 1024
> 
> 15000
> 100
> false
> 
> 
> 200
> 
>
> The big problem stands in SOLR, because I've run the mysql queries single
> and speed is great, but as the time passes solr adding function stays way
> too long and then it blocks, even tho server is top level and has lots of
> resources.
>
> I'm new to this so please assist. Thanks,
> --
>
> **
>
>   *Radu Ghita *
>
>   Tel:   +40 721 18 18 68
>
>   Fax:  +40 351 81 85 52
>



-- 

**

  *Radu Ghita *

  Tel:   +40 721 18 18 68

  Fax:  +40 351 81 85 52


Re: Using Solr to search between two Strings without using index

2013-07-25 Thread Roman Chyla
Hi,

I think you are pushing it too far - there is no 'string search' without an
index. And besides, these things are just better done by a few lines of
code - and if your array is too big, then you should create the index...

roman


On Thu, Jul 25, 2013 at 9:06 AM, Rohit Kumar  wrote:

> Hi,
>
> I have a scenario.
>
> String array = ["Input1 is good", ""Input2 is better", "Input2 is sweet",
> "Input3 is bad"]
>
> I want to compare the string array against the given input :
> String inputarray= ["Input1", "Input2"]
>
>
> It involves no indexes. I just want to use the power of string search to do
> a runtime search on the array and should return
>
> ["Input1 is good", ""Input2 is better", "Input2 is sweet"]
>
>
>
> Thanks
>


Using Solr to search between two Strings without using index

2013-07-25 Thread Rohit Kumar
Hi,

I have a scenario.

String array = ["Input1 is good", ""Input2 is better", "Input2 is sweet",
"Input3 is bad"]

I want to compare the string array against the given input :
String inputarray= ["Input1", "Input2"]


It involves no indexes. I just want to use the power of string search to do
a runtime search on the array and should return

["Input1 is good", ""Input2 is better", "Input2 is sweet"]



Thanks


Re: Solr 4.4.0 and solrj

2013-07-25 Thread santonel
Ok , now it works perfectly. In the past version i've renamed the default
collection, but 
with
 http://myserver/solr/
i was accessing directly
 http://myserver/solr/corename/

probably because the default collection became the one that i have renamed.

Thanks for the help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-4-0-and-solrj-tp4080282p4080322.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: maximum number of documents per shard?

2013-07-25 Thread Dmitry Kan
Well, we have hit the aforementioned jira issue with about 80 shards. The
sharding for us is a pure function of memory consumption and we use RAM
lots. With solr4 however, things look much better  and hopefully having
migrated from solr3 we can live for long time without hitting the limit
again.



On Thu, Jul 25, 2013 at 3:07 PM, Jack Krupansky wrote:

> I don't think there is any hard limit, but it will be more of a
> performance-based limit. Going beyond a couple dozen shards (lets say, 25)
> would take you into uncharted territory, where a sophisticated proof of
> concept implementation is essential. "Hundreds" or "thousands" of shards
> are likely to be problematic from a performance perspective for average
> users. In fact, I'd say that 8 shards may be a semi-practical limit, beyond
> which the design switches from "a walk in the park" to "heroic efforts". I
> mean, you should be able to do 25 shards, but in practice you will have to
> be much more alert, more careful with your hardware selection and network
> design, etc.
>
> -- Jack Krupansky
>
> -Original Message- From: Nicole Lacoste
> Sent: Thursday, July 25, 2013 4:14 AM
> To: solr-user@lucene.apache.org
> Subject: Re: maximum number of documents per shard?
>
>
> Is there a limit on the number of shards?
>
> Niki
>
>
> On 24 July 2013 01:14, Jack Krupansky  wrote:
>
>  2.1 billion documents (including deleted documents) per Lucene index, but
>> essentially per Solr shard as well.
>>
>> But don’t even think about going that high. In fact, don't plan on going
>> above 100 million unless you do a proof of concept that validates that you
>> get acceptable query and update performance . There is no hard limit
>> besides that 2.1 billion Lucene limit, but... performance will vary.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Ali, Saqib
>> Sent: Tuesday, July 23, 2013 6:18 PM
>> To: solr-user@lucene.apache.org
>> Subject: maximum number of documents per shard?
>>
>> still 2.1 billion documents?
>>
>>
>
>
> --
> * 
> >*
>
>


Re: SolrCloud commit process is too time consuming, even if documents are light

2013-07-25 Thread Jack Krupansky
Auto soft commit is great for real time access, but you need to do hard 
commits periodically or else the transaction log (which is what assures that 
soft commits are durable) gets too big - it needs to be replayed on startup 
and is used for real-time search.


So, set the auto soft commit to the currency of updates that you need on 
search. Then set hard commit to something like every 10 minutes, 15 minutes, 
30 minutes, 1 hour, 2 hours, 4 hours, 8 hours, or whatever makes sense for 
your application.


Hard auto commit should of course be at a greater interval than auto soft 
commit.


-- Jack Krupansky

-Original Message- 
From: Radu Ghita

Sent: Thursday, July 25, 2013 8:20 AM
To: solr-user@lucene.apache.org
Subject: SolrCloud commit process is too time consuming, even if documents 
are light


Hi,

We are having a client with business model that requires indexing each
month billion rows into solr from mysql in a small time-frame. The
documents are very light, but the number is very high and we need to
achieve speeds of around 80-100k/s. The built in solr indexer goes to
40-50k tops, but after some hours ( ~12h ) it crashes and the speed slows
down as hours go by.

Therefore we have developed a custom java importer that connects directly
to mysql and solrcloud via zookeeper, grabs data from mysql, creates
documents and then imports into solr. This helps because we are opening ~50
threads and the indexing process speeds up. We have optimized the mysql
queries ( mysql was the initial bottleneck ) and the speeds we get now are
over 100k/s, but as index number gets bigger, solr stays very long on
adding documents. I assume it needs to be something from solrconfig that
makes solr stay and even block after 100 mil documents indexed.

Here is the java code that creates documents and then adds to solr server:

public void createDocuments() throws SQLException, SolrServerException,
IOException
{
App.logger.write("Creating documents..");
this.docs = new ArrayList();
App.logger.incrementNumberOfRows(this.size);
while(this.results.next())
{ this.docs.add(this.getDocumentFromResultSet(this.results)); }

this.statement.close();
this.results.close();
}

public void commitDocuments() throws SolrServerException, IOException
{ App.logger.write("Committing.."); App.solrServer.add(this.docs); // here
it stays very long and then blocks
App.logger.incrementNumberOfRows(this.docs.size()); this.docs.clear(); }

I am also pasting solrconfig.xml parameters that make sense to this
discussion:
128
false
1
100

2
100
1

100
1024

15000
100
false


200


The big problem stands in SOLR, because I've run the mysql queries single
and speed is great, but as the time passes solr adding function stays way
too long and then it blocks, even tho server is top level and has lots of
resources.

I'm new to this so please assist. Thanks,
--

**

 *Radu Ghita *

 Tel:   +40 721 18 18 68

 Fax:  +40 351 81 85 52 



Re: softCommit doesn't work - ?

2013-07-25 Thread tskom
My actual solconfig.xml is:



   ${solr.ulog.dir:}

  
   1  
   true



I tried (solrj 4.3.1) and after 10 sec I got results:

1) server.add(doc) - nothing in index 

2) server.add(doc, 1) - nothing in index

3) server.add(doc) and server.commit() - all fine, but I don't want to
hardcommit after each document !

Any additional sugestions ?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/softCommit-doesn-t-work-tp4079578p4080319.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud commit process is too time consuming, even if documents are light

2013-07-25 Thread Radu Ghita
Hi,

We are having a client with business model that requires indexing each
month billion rows into solr from mysql in a small time-frame. The
documents are very light, but the number is very high and we need to
achieve speeds of around 80-100k/s. The built in solr indexer goes to
40-50k tops, but after some hours ( ~12h ) it crashes and the speed slows
down as hours go by.

Therefore we have developed a custom java importer that connects directly
to mysql and solrcloud via zookeeper, grabs data from mysql, creates
documents and then imports into solr. This helps because we are opening ~50
threads and the indexing process speeds up. We have optimized the mysql
queries ( mysql was the initial bottleneck ) and the speeds we get now are
over 100k/s, but as index number gets bigger, solr stays very long on
adding documents. I assume it needs to be something from solrconfig that
makes solr stay and even block after 100 mil documents indexed.

Here is the java code that creates documents and then adds to solr server:

public void createDocuments() throws SQLException, SolrServerException,
IOException
{
App.logger.write("Creating documents..");
this.docs = new ArrayList();
App.logger.incrementNumberOfRows(this.size);
while(this.results.next())
{ this.docs.add(this.getDocumentFromResultSet(this.results)); }

this.statement.close();
this.results.close();
}

public void commitDocuments() throws SolrServerException, IOException
{ App.logger.write("Committing.."); App.solrServer.add(this.docs); // here
it stays very long and then blocks
App.logger.incrementNumberOfRows(this.docs.size()); this.docs.clear(); }

I am also pasting solrconfig.xml parameters that make sense to this
discussion:
128
false
1
100

2
100
1

100
1024

15000
100
false


200


The big problem stands in SOLR, because I've run the mysql queries single
and speed is great, but as the time passes solr adding function stays way
too long and then it blocks, even tho server is top level and has lots of
resources.

I'm new to this so please assist. Thanks,
-- 

**

  *Radu Ghita *

  Tel:   +40 721 18 18 68

  Fax:  +40 351 81 85 52


Re: new field type - enum field

2013-07-25 Thread Erick Erickson
Start here: http://wiki.apache.org/solr/HowToContribute

Then, when your patch is ready submit a JIRA and attach
your patch. Then nudge (gently) if none of the committers
picks it up and applies it

NOTE: It is _not_ necessary that the first version of your
patch is completely polished. I often put up partial/incomplete
patches (comments with //nocommit are explicitly caught by
the "ant precommit" target for instance) to see if anyone
has any comments before polishing.

Best
Erick

On Thu, Jul 25, 2013 at 5:04 AM, Elran Dvir  wrote:
> Hi,
>
> I have implemented like Chris described it:
> The field is indexed as numeric, but displayed as string, according to 
> configuration.
> It applies to facet, pivot, group and query.
>
> How do we proceed? How do I contribute it?
>
> Thanks.
>
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: Thursday, July 25, 2013 4:40 AM
> To: solr-user@lucene.apache.org
> Subject: Re: new field type - enum field
>
>
> : Doable at Lucene level by any chance?
>
> Given how well the Trie fields compress (ByteField and ShortField have been 
> deprecated in favor of TrieIntField for this reason) it probably just makes 
> sense to treat it as a numeric at the Lucene level.
>
> : > If there's positive feedback, I'll open an issue with a patch for the 
> functionality.
>
> I've typically dealt with this sort of thing at the client layer using a 
> simple numeric field in Solr, or used an UpdateProcessor to convert the
> String->numeric mapping when indexing & used clinet logic of a
> DocTransformer to handle the stored value at query time -- but having a built 
> in FieldType that handles that for you automatically (and helps ensure the 
> indexed values conform to the enum) would certainly be cool if you'd like to 
> contribute it.
>
>
> -Hoss
>
> Email secured by Check Point


Re: Solr Cloud Setup

2013-07-25 Thread Flavio Pompermaier
I find this article very interesting about cloud deployment:
http://myjeeva.com/solrcloud-cluster-single-collection-deployment.html

Best,
Flavio


On Thu, Jul 25, 2013 at 1:59 PM, Erick Erickson wrote:

> I'd advise you to tear it down and start over. You should be
> creating new _collections_, not cores at this level I believe. And
> manually editing the cluster state is just _asking_ for
> trouble unless you really understand what's happening under
> the covers, and since you say you're relatively new
>
> See:
> http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API
>
> Best
> Erick
>
> On Wed, Jul 24, 2013 at 6:11 PM, AdityaR 
> wrote:
> > Hi,
> >
> > I am new to solr and am trying to setup a solr cloud
> >
> > I have created 3 server solr cloud and 1 zookeeeper and I am facing the
> > following problems with my set up.
> >
> > 1) When I create a new core using the collections API , the cores are
> > created, but all are in down state. How can I make them active? or is
> there
> > anything wrong with my set up?
> >
> > I edited the clusterstate.json to get them active and then they become
> > active.
> >
> > 2 ) I have configured a collection to have 2 shards and 2 replicas and
> added
> > documents to the collection. But when I query the servers, I am getting
> > inconsistent results. I have in one shard 241 documents and in another
> 230
> > documents. When I query any server in the cloud I get randomly 471, 230
> or
> > 241 documents. Could you suggest as to where the problem might be.
> >
> > Thanks,
> > Aditya
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Setup-tp4080182.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Querying a specific core in solr cloud

2013-07-25 Thread Erick Erickson
Vicky:

Please define "&distrib=false doesn't work".
_What_ doesn't work? What are the symptoms?
It could be a bug or it could be a misunderstanding, I
have no way of even guessing.

Best
Erick

On Thu, Jul 25, 2013 at 3:52 AM, vicky desai  wrote:
> Hi,
>
> I have also noticed that once I put the core up on both the machine
> &distrib=false works well. could this be a possible bug that when a core is
> down on one instance &distrib=false doesnt work
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Querying-a-specific-core-in-solr-cloud-tp4079964p4080246.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Do I need solr.xml?

2013-07-25 Thread Erick Erickson
Actually, you're getting a solr.xml file but you don't know it.
When Solr doesn't find solr.xml, there's a default one hard-
coded that is used. See ConfigSolrOld.java, at the end
DEF_SOLR_XML is defined.

So, as Hoss says, it's much better to make one anyway so
you know what you're getting.

Consider setting it up for the "discovery" mode, see:
http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond

Best
Erick

On Wed, Jul 24, 2013 at 9:21 PM, Chris Hostetter
 wrote:
>
> : I get what looks like the admin page, but it says that there are solr core
> : initialization failures, and the links on the page just bring me back to the
> : same page.
>
> if you get an error on the admin UI, there should be specifics about
> *what* the initialization failure is -- at last one sentence, and there
> should be a full stack trace in the logs -- having those details will
> help understand the root of your first problem, which may explain your
> second problem.
>
> it would also help to know what the CoreAdmin handler returns when you ask
> it for status about all the cores -- even if the *UI* is having problems
> on your browser, that should return useful info (like: how many cores you
> have -- if any -- and which one had an init failure)
>
> https://cwiki.apache.org/confluence/display/solr/CoreAdminHandler+Parameters+and+Usage#CoreAdminHandlerParametersandUsage-{{STATUS}}
>
> : Second, when I try to put a doc in the index using the PHP Pecl Solr package
> : from a page on my site, I get errors that indicate that Solr can't see my
> : schema.xml file, since Solr doesn't recognize some of the fields that I've
> : defined. I have my updated schema.xml file in /etc/solr/collection1/conf/
>
> that doesn't make sense -- if solr can't see your schema.xml file at all,
> you wouldn't get an error about the fields you definied being missing --
> you'd get an error about the collection you are talking to not existin,g
> because if your schema.xml file can't be found (or has a problem loading)
> the entire SolrCore won't load.
>
> : ERROR: [doc=334455] unknown field 'brand'  : name="code">400   ' in X:
> : SolrClient->addDocument(Object(SolrInputDocument)) #1 {main} thrown in 
> XX
>
> that error indicates that your solr client sent a document to some (valid
> and functioning) SolrCore which has a schema.xml that does not contain a
> field named "brand".
>
> : And this is the relevant section of my schema.xml
> :
> : : required="true"/>
>
> my best guess: you have multiple core defined in your solr setup -- one of
> which is working, and is what your client is trying to talk to, but which
> doesn't have the schema.xml that you put your domain specific fields in
> (maybe it's just the default example configs?) and you have another core
> defined, using your customized configs, which failed to load properly.
>
> you mentioned that you did in fact put your configs in "collection1" dir,
> but w/o the specifics of what your solr home dir structure looks like, and
> the specifics of your error message, and details about the URLs your
> client tried to talk to when it got that error, etc  it's all just
> guesswork on our parts.
>
> http://wiki.apache.org/solr/UsingMailingLists
>
> : So my question is: do I actually need to create a solr.xml file, and all the
> : accompanying files that go into specifying a core? (I'm not sure if there 
> are,
> : but from some of the documentation it seems like there may be.) Or am I
> : pursuing an unnecessary solution to these problems, and there's a simpler 
> fix?
>
> the short answer of your specific question is "no", you don't *have* to
> have a solr.xml (at least not in Solr 4.x) but it's a really good idea,
> even if you only want a single core, because it gives you a way to be
> explicit about what you want and be sure it's what you are getting.
>
>
> -Hoss


Re: maximum number of documents per shard?

2013-07-25 Thread Jack Krupansky
I don't think there is any hard limit, but it will be more of a 
performance-based limit. Going beyond a couple dozen shards (lets say, 25) 
would take you into uncharted territory, where a sophisticated proof of 
concept implementation is essential. "Hundreds" or "thousands" of shards are 
likely to be problematic from a performance perspective for average users. 
In fact, I'd say that 8 shards may be a semi-practical limit, beyond which 
the design switches from "a walk in the park" to "heroic efforts". I mean, 
you should be able to do 25 shards, but in practice you will have to be much 
more alert, more careful with your hardware selection and network design, 
etc.


-- Jack Krupansky

-Original Message- 
From: Nicole Lacoste

Sent: Thursday, July 25, 2013 4:14 AM
To: solr-user@lucene.apache.org
Subject: Re: maximum number of documents per shard?

Is there a limit on the number of shards?

Niki


On 24 July 2013 01:14, Jack Krupansky  wrote:


2.1 billion documents (including deleted documents) per Lucene index, but
essentially per Solr shard as well.

But don’t even think about going that high. In fact, don't plan on going
above 100 million unless you do a proof of concept that validates that you
get acceptable query and update performance . There is no hard limit
besides that 2.1 billion Lucene limit, but... performance will vary.

-- Jack Krupansky

-Original Message- From: Ali, Saqib
Sent: Tuesday, July 23, 2013 6:18 PM
To: solr-user@lucene.apache.org
Subject: maximum number of documents per shard?

still 2.1 billion documents?





--
* * 



Re: Solr Cloud Setup

2013-07-25 Thread Erick Erickson
I'd advise you to tear it down and start over. You should be
creating new _collections_, not cores at this level I believe. And
manually editing the cluster state is just _asking_ for
trouble unless you really understand what's happening under
the covers, and since you say you're relatively new

See: 
http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API

Best
Erick

On Wed, Jul 24, 2013 at 6:11 PM, AdityaR  wrote:
> Hi,
>
> I am new to solr and am trying to setup a solr cloud
>
> I have created 3 server solr cloud and 1 zookeeeper and I am facing the
> following problems with my set up.
>
> 1) When I create a new core using the collections API , the cores are
> created, but all are in down state. How can I make them active? or is there
> anything wrong with my set up?
>
> I edited the clusterstate.json to get them active and then they become
> active.
>
> 2 ) I have configured a collection to have 2 shards and 2 replicas and added
> documents to the collection. But when I query the servers, I am getting
> inconsistent results. I have in one shard 241 documents and in another 230
> documents. When I query any server in the cloud I get randomly 471, 230 or
> 241 documents. Could you suggest as to where the problem might be.
>
> Thanks,
> Aditya
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Setup-tp4080182.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto Indexing in Solr

2013-07-25 Thread Jack Krupansky
If that level of scripting is difficult for you, consider the LucidWorks 
Search product which has a built-in scheduler for crawl/import jobs, 
including web sites, file system directories, sharepoint repositories, and 
databases.


See:
http://docs.lucidworks.com/display/help/Crawling+Content

-- Jack Krupansky

-Original Message- 
From: archit2112

Sent: Thursday, July 25, 2013 2:12 AM
To: solr-user@lucene.apache.org
Subject: Auto Indexing in Solr

Hi Im using Solr 4's Data Import Utility to index Oracle 10g XE database. Im
using full imports as well as delta imports. I want these processes to be
automatic. (Eg: The import processes can be timed or should be executed as
soon any data in the database is modified). I searched for the same online
and I heard people talk about CRON and scripts. However, Im not able to
figure out how to implement it. Can you please provide a tutorial like
explanation? Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-Indexing-in-Solr-tp4080233.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Processing a lot of results in Solr

2013-07-25 Thread Otis Gospodnetic
Mikhail,

Yes, +1.
This question comes up a few times a year.  Grant created a JIRA issue
for this many moons ago.

https://issues.apache.org/jira/browse/LUCENE-2127
https://issues.apache.org/jira/browse/SOLR-1726

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Wed, Jul 24, 2013 at 9:58 PM, Mikhail Khludnev
 wrote:
> fwiw,
> i did some prototype with the following differences:
> - it streams straight to the socket output stream
> - it streams on-going during collecting, without necessity to store a
> bitset.
> It might have some limited extreme usage. Is there anyone interested?
>
>
> On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla  wrote:
>
>> On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber  wrote:
>>
>> > That sounds like a satisfactory solution for the time being -
>> > I am assuming you dump the data from Solr in a csv format?
>> >
>>
>> JSON
>>
>>
>> > How did you implement the streaming processor ? (what tool did you use
>> for
>> > this? Not familiar with that)
>> >
>>
>> this is what dumps the docs:
>>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java
>>
>> it is called by one of our batch processors, which can pass it a bitset of
>> recs
>>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java
>>
>> as far as streaming is concerned, we were all very nicely surprised, a few
>> GB file (on local network) took ridiculously short time - in fact, a
>> colleague of mine was assuming it is not working, until we looked into the
>> downloaded file ;-), you may want to look at line 463
>>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java
>>
>> roman
>>
>>
>> > You say it takes a few minutes only to dump the data - how long does it
>> to
>> > stream it back in, are performances acceptable (~ within minutes) ?
>> >
>> > Thanks,
>> > Matt
>> >
>> > On 7/23/13 6:57 PM, "Roman Chyla"  wrote:
>> >
>> > >Hello Matt,
>> > >
>> > >You can consider writing a batch processing handler, which receives a
>> > >query
>> > >and instead of sending results back, it writes them into a file which is
>> > >then available for streaming (it has its own UUID). I am dumping many
>> GBs
>> > >of data from solr in few minutes - your query + streaming writer can go
>> > >very long way :)
>> > >
>> > >roman
>> > >
>> > >
>> > >On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber 
>> wrote:
>> > >
>> > >> Hello Solr users,
>> > >>
>> > >> Question regarding processing a lot of docs returned from a query; I
>> > >> potentially have millions of documents returned back from a query.
>> What
>> > >>is
>> > >> the common design to deal with this ?
>> > >>
>> > >> 2 ideas I have are:
>> > >> - create a client service that is multithreaded to handled this
>> > >> - Use the Solr "pagination" to retrieve a batch of rows at a time
>> > >>("start,
>> > >> rows" in Solr Admin console )
>> > >>
>> > >> Any other ideas that I may be missing ?
>> > >>
>> > >> Thanks,
>> > >> Matt
>> > >>
>> > >>
>> > >> 
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> NOTE: This message may contain information that is confidential,
>> > >> proprietary, privileged or otherwise protected by law. The message is
>> > >> intended solely for the named addressee. If received in error, please
>> > >> destroy and notify the sender. Any use of this email is prohibited
>> when
>> > >> received in error. Impetus does not represent, warrant and/or
>> guarantee,
>> > >> that the integrity of this communication has been maintained nor that
>> > >>the
>> > >> communication is free of errors, virus, interception or interference.
>> > >>
>> >
>> >
>> > 
>> >
>> >
>> >
>> >
>> >
>> >
>> > NOTE: This message may contain information that is confidential,
>> > proprietary, privileged or otherwise protected by law. The message is
>> > intended solely for the named addressee. If received in error, please
>> > destroy and notify the sender. Any use of this email is prohibited when
>> > received in error. Impetus does not represent, warrant and/or guarantee,
>> > that the integrity of this communication has been maintained nor that the
>> > communication is free of errors, virus, interception or interference.
>> >
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  


Re: Solr 4.4.0 and solrj

2013-07-25 Thread Erick Erickson
"collection1" is the default, so when you enter
http://myserver/solr/, under the covers you get
http://myserver/solr/collection1/.

So go ahead and rename your cores, but address
them specifically as
http://myserver/solr/corename/

Best
Erick

On Thu, Jul 25, 2013 at 7:10 AM, santonel  wrote:
> Hi
>
> I've upgraded my solr server (a single core with single collection) from
> 4.3.1 to 4.4.0, using the new solr.xml
> configuration file from example and setting the new core.properties (with my
> collection name) under the instance dir.
>
> When i check the status of solr via web interface, all is up and going
> smoothly (it find the core with autodiscovery),
> and i can query and get responses from the solr server.
>
> When i try to access solr with an application that i wrote via solrj, using
> the same parameter i was using for past version
> it return an exception:
> RemoteSolrException: Server at http://(my-server-address)/solr returned non
> ok status:404, message:Not Found
> Even a simple call of server.ping() return the same exception.
>
> So i've changed my instance dir name as "collection1" and put the same value
> on core.properties, restarted the solr server
> and the application started to work again.
>
> Is there someghing i'm missing? It's a strange behaviour because via web
> interface everything is regularly, but
> when i try to do some action via solrj with a custom core name it return an
> exception.
>
> Any help is appreciated! Thanks
>
> This is my solr.xml
>
> 
>
>   
> ${host:}
> ${jetty.port:8983}
> ${hostContext:solr}
> ${zkClientTimeout:15000}
> ${genericCoreNodeNames:true}
>   
>
>class="HttpShardHandlerFactory">
> ${socketTimeout:0}
> ${connTimeout:0}
>   
>
> 
>
> And this is the only entry in core.properties, under the instance directory
> (/opt/solr-4.4.0/example/solr/soccerevents):
> name=soccerevents
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-4-4-0-and-solrj-tp4080282.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Shows different result with using 'and' and 'AND'

2013-07-25 Thread Raymond Wiker
The query syntax is case sensitive; "and" is treated as a search term and
not as an operator.


On Thu, Jul 25, 2013 at 1:00 PM, Payal.Mulani <
payal.mul...@highqsolutions.com> wrote:

> Hi,
>
> I am using solr14 and when I search with 'and' the it searches the
> documents
> containing 'and' as a text but If I am searching with 'AND' word then it
> will not search 'and'  as a text and taking as a logical operator so any
> one
> have idea that why this both makes difference.
> Also both giving different result..
>
> Please if any one know let me know..
>
> Thanks.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Shows-different-result-with-using-and-and-AND-tp4080280.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr 4.3.0 - SolrCloud lost all documents when leaders got rebuilt

2013-07-25 Thread Erick Erickson
Picking up on what Donimique mentioned. Your ZK configuration
isn't doing you much good. Not only do you have an even number
6 (which is actually _less_ robust than having 5), but by splitting
them among two data centers you're effectively requiring the data
center with 4 nodes to always be up. If it goes down (or even if the
link between DCs is broken), DC2 will not be able to index
documents since the ZK nodes in DC2 can't find 4 ZK nodes to
work with.

By taking down the ZK quorum, you are effectively "freezing" the Solr
nodes with the snapshot of the system they knew about the last
time there was a quorum. It's a sticky wicket. Let's assume what you're
trying to do was allowed. Now let's assume that instead of the
machines being down you simply lost connectivity between your DCs
so the ZK nodes can't talk to each other. Now they'd each elect their
nodes as leaders. Any incoming indexing requests wold be serviced.

Now the DCs are re-connected. How could the conflicts be resolved?
This is the "split brain" problem, something ZK is specifically designed
to prevent.

Best
Erick

On Wed, Jul 24, 2013 at 6:50 PM, Dominique Bejean
 wrote:
> With 6 zookeeper instances you need at least 4 instances running at the same 
> time. How can you decide to stop 4 instances and have only 2 instances 
> running ? Zookeeper can't work anymore in these conditions.
>
> Dominique
>
> Le 25 juil. 2013 à 00:16, "Joshi, Shital"  a écrit :
>
>> We have SolrCloud cluster (5 shards and 2 replicas) on 10 dynamic compute 
>> boxes (cloud), where 5 machines (leaders) are in datacenter1 and replicas on 
>> datacenter2.  We have 6 zookeeper instances - 4 on datacenter1 and 2 on 
>> datacenter2. The zookeeper instances are on same hosts as Solr nodes. We're 
>> using local disk (/local/data) to store solr index files.
>>
>> Infrastructure team wanted to rebuild dynamic compute boxes on datacenter1 
>> so we handed over all leader hosts to them. By doing so, We lost 4 zookeeper 
>> instances. We were expecting to see all replicas acting as leader. In order 
>> to confirm that, I went to admin console -> cloud page but the page never 
>> returned (kept hanging).  I checked log and saw constant zookeeper host 
>> connection exceptions (the zkHost system property had all 6 zookeeper 
>> instances). I restarted cloud on all replicas but got same error again. This 
>> exception is I think due to the zookeeper bug: 
>> https://issues.apache.org/jira/browse/SOLR-4899 I guess zookeeper never 
>> registered the replicas as leader.
>>
>> After dynamic compute machines were re-built (lost all local data) I 
>> restarted entire cloud (with 6 zookeeper and 10 nodes), the original leaders 
>> were still the leaders (I think zookeeper config never got updated with 
>> replicas being leader, though 2 zookeeper instances were still up). Since 
>> all leaders' /local/data/solr_data was empty, it got replicated to all 
>> replicas and we lost all data in our replica. We lost 26 million documents 
>> on replica. This was very awful.
>>
>> In our start up script (which brings up solr on all nodes one by one), the 
>> leaders are listed first.
>>
>> Any solution to this until Solr 4.4 release?
>>
>> Many Thanks!
>>
>>
>>
>>
>>


Solr 4.4.0 and solrj

2013-07-25 Thread santonel
Hi

I've upgraded my solr server (a single core with single collection) from
4.3.1 to 4.4.0, using the new solr.xml 
configuration file from example and setting the new core.properties (with my
collection name) under the instance dir.

When i check the status of solr via web interface, all is up and going
smoothly (it find the core with autodiscovery), 
and i can query and get responses from the solr server.

When i try to access solr with an application that i wrote via solrj, using
the same parameter i was using for past version
it return an exception: 
RemoteSolrException: Server at http://(my-server-address)/solr returned non
ok status:404, message:Not Found
Even a simple call of server.ping() return the same exception.

So i've changed my instance dir name as "collection1" and put the same value
on core.properties, restarted the solr server
and the application started to work again.

Is there someghing i'm missing? It's a strange behaviour because via web
interface everything is regularly, but 
when i try to do some action via solrj with a custom core name it return an
exception.

Any help is appreciated! Thanks

This is my solr.xml



  
${host:}
${jetty.port:8983}
${hostContext:solr}
${zkClientTimeout:15000}
${genericCoreNodeNames:true}
  

  
${socketTimeout:0}
${connTimeout:0}
  



And this is the only entry in core.properties, under the instance directory
(/opt/solr-4.4.0/example/solr/soccerevents):
name=soccerevents



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-4-0-and-solrj-tp4080282.html
Sent from the Solr - User mailing list archive at Nabble.com.


Performance vs. maxBufferedAddsPerServer=10

2013-07-25 Thread Otis Gospodnetic
Hi,

Context:
* https://issues.apache.org/jira/browse/SOLR-4956
* 
http://search-lucene.com/c/Solr:/core/src/java/org/apache/solr/update/SolrCmdDistributor.java%7C%7CmaxBufferedAddsPerServer

As you can see, maxBufferedAddsPerServer = 10.

We have an app that sends 20K docs to SolrCloud using CloudSolrServer.
We batch 20K docs for performance reasons. But then the receiving node
ends up sending VERY small batches of just 10 docs around for indexing
and we lose the benefit of batching those 20K docs in the first place.

Our app is "add only".

Is there anything one can do to avoid performance loss associated with
maxBufferedAddsPerServer=10?

Thanks,
Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm


Shows different result with using 'and' and 'AND'

2013-07-25 Thread Payal.Mulani
Hi,

I am using solr14 and when I search with 'and' the it searches the documents
containing 'and' as a text but If I am searching with 'AND' word then it
will not search 'and'  as a text and taking as a logical operator so any one
have idea that why this both makes difference.
Also both giving different result..

Please if any one know let me know..

Thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Shows-different-result-with-using-and-and-AND-tp4080280.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto Indexing in Solr

2013-07-25 Thread archit2112
I have to execute this command for full-import

http://localhost:8983/solr/dataimport?command=full-import

Can you explain how do i use the java timer to fire this HTTP request.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Auto-Indexing-in-Solr-tp4080233p4080278.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Auto Indexing in Solr

2013-07-25 Thread Aditya
Hi

You could use Java timer. Trigger your DB import, every X minute. Another
option, You may aware when your DB is updated. When ever DB gets changed,
trigger the request to index the new added data.

Regards
Aditya
www.findbestopensource.com



On Thu, Jul 25, 2013 at 11:42 AM, archit2112  wrote:

> Hi Im using Solr 4's Data Import Utility to index Oracle 10g XE database.
> Im
> using full imports as well as delta imports. I want these processes to be
> automatic. (Eg: The import processes can be timed or should be executed as
> soon any data in the database is modified). I searched for the same online
> and I heard people talk about CRON and scripts. However, Im not able to
> figure out how to implement it. Can you please provide a tutorial like
> explanation? Thanks in advance
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Auto-Indexing-in-Solr-tp4080233.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Fw:

2013-07-25 Thread wiredkel

Hi!   http://mpreprointranet.com/google.com.offers.html



Re: maximum number of documents per shard?

2013-07-25 Thread Dmitry Kan
Nicole,

According to our findings, there is also a limit for the number of shards
depending on the volume of the returned data. See this jira:

https://issues.apache.org/jira/browse/SOLR-4903

Dmitry


On Thu, Jul 25, 2013 at 11:25 AM, Nicole Lacoste wrote:

> Oh found the answer myself.  Its the GET methods URL length that limits the
> number of shards.
>
> Niki
>
>
> On 25 July 2013 10:14, Nicole Lacoste  wrote:
>
> > Is there a limit on the number of shards?
> >
> > Niki
> >
> >
> > On 24 July 2013 01:14, Jack Krupansky  wrote:
> >
> >> 2.1 billion documents (including deleted documents) per Lucene index,
> but
> >> essentially per Solr shard as well.
> >>
> >> But don’t even think about going that high. In fact, don't plan on going
> >> above 100 million unless you do a proof of concept that validates that
> you
> >> get acceptable query and update performance . There is no hard limit
> >> besides that 2.1 billion Lucene limit, but... performance will vary.
> >>
> >> -- Jack Krupansky
> >>
> >> -Original Message- From: Ali, Saqib
> >> Sent: Tuesday, July 23, 2013 6:18 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: maximum number of documents per shard?
> >>
> >> still 2.1 billion documents?
> >>
> >
> >
> >
> > --
> > * *
> >
>
>
>
> --
> * *
>


Spell check SOLR 3.6.1 not working for numbers

2013-07-25 Thread Poornima Jay
Hi,

I using SOLR 3.6.1 and implemented spellcheck. I found that the numbers in the 
spellcheck query does not return any results. Below is my solrconfig.xml and 
schema.xml details. Please any one let me know what needs to be done in order 
to get the spell check for numbers.

solrConfig

     
    default   
    solr.IndexBasedSpellChecker
    spell  
    ./spellchecker   
    0.7    
    true
    .0001
   
  textSpell



  
    
    default   
    
    false
    
    false
    
    10
  
      
      spellcheck
        
  

Schema

         
            
            
            
            
            
            
         
        
         
        
        
        
      
      




   
   
 
   

Thanks,
Poornima

Solr4.2 PostCommit EventListener not working on Replication-Instances

2013-07-25 Thread Dirk Högemann
Hello,

I have implemented a Solr EventListener, which should be fired after
committing.
This works fine on the Solr-Master Instance and  it also worked in Solr 3.5
on any Slave Instance.
I upgraded my installation to Solr 4.2 and now the postCommit event is not
fired any more on the replication (Slave) instances, which is a huge
problem, as other cache have to be invalidated, when replication took place.

This is my configuration solrconfig.xml on the slaves:

  

  1



...


  

...
  

  http://localhost:9101/solr/Core1
  00:03:00

  

Any hints?

Best regards


RE: new field type - enum field

2013-07-25 Thread Elran Dvir
Hi,

I have implemented like Chris described it:
The field is indexed as numeric, but displayed as string, according to 
configuration.
It applies to facet, pivot, group and query.

How do we proceed? How do I contribute it?

Thanks.

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, July 25, 2013 4:40 AM
To: solr-user@lucene.apache.org
Subject: Re: new field type - enum field


: Doable at Lucene level by any chance?

Given how well the Trie fields compress (ByteField and ShortField have been 
deprecated in favor of TrieIntField for this reason) it probably just makes 
sense to treat it as a numeric at the Lucene level.

: > If there's positive feedback, I'll open an issue with a patch for the 
functionality.

I've typically dealt with this sort of thing at the client layer using a simple 
numeric field in Solr, or used an UpdateProcessor to convert the 
String->numeric mapping when indexing & used clinet logic of a
DocTransformer to handle the stored value at query time -- but having a built 
in FieldType that handles that for you automatically (and helps ensure the 
indexed values conform to the enum) would certainly be cool if you'd like to 
contribute it.


-Hoss

Email secured by Check Point


Re: maximum number of documents per shard?

2013-07-25 Thread Nicole Lacoste
Oh found the answer myself.  Its the GET methods URL length that limits the
number of shards.

Niki


On 25 July 2013 10:14, Nicole Lacoste  wrote:

> Is there a limit on the number of shards?
>
> Niki
>
>
> On 24 July 2013 01:14, Jack Krupansky  wrote:
>
>> 2.1 billion documents (including deleted documents) per Lucene index, but
>> essentially per Solr shard as well.
>>
>> But don’t even think about going that high. In fact, don't plan on going
>> above 100 million unless you do a proof of concept that validates that you
>> get acceptable query and update performance . There is no hard limit
>> besides that 2.1 billion Lucene limit, but... performance will vary.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Ali, Saqib
>> Sent: Tuesday, July 23, 2013 6:18 PM
>> To: solr-user@lucene.apache.org
>> Subject: maximum number of documents per shard?
>>
>> still 2.1 billion documents?
>>
>
>
>
> --
> * *
>



-- 
* *


Re: Document Similarity Algorithm at Solr/Lucene

2013-07-25 Thread Furkan KAMACI
BTW, How Solr's MoreLikeThis Component works? Which algorithm does it use
at underlying?


2013/7/24 Roman Chyla 

> This paper contains an excellent algorithm for plagiarism detection, but
> beware the published version had a mistake in the algorithm - look for
> corrections - I can't find them now, but I know they have been published
> (perhaps by one of the co-authors). You could do it with solr, to create an
> index of hashes, with the twist of storing position of the original text
> (source of the hash) together with the token and the solr highlighting
> would do the rest for you :)
>
> roman
>
>
> On Tue, Jul 23, 2013 at 11:07 AM, Shashi Kant  wrote:
>
> > Here is a paper that I found useful:
> > http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
> >
> >
> > On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI 
> > wrote:
> > > Thanks for your comments.
> > >
> > > 2013/7/23 Tommaso Teofili 
> > >
> > >> if you need a specialized algorithm for detecting blogposts
> plagiarism /
> > >> quotations (which are different tasks IMHO) I think you have 2
> options:
> > >> 1. implement a dedicated one based on your features / metrics / domain
> > >> 2. try to fine tune an existing algorithm that is flexible enough
> > >>
> > >> If I were to do it with Solr I'd probably do something like:
> > >> 1. index "original" blogposts in Solr (possibly using Jack's
> suggestion
> > >> about ngrams / shingles)
> > >> 2. do MLT queries with "candidate blogposts copies" text
> > >> 3. get the first, say, 2-3 hits
> > >> 4. mark it as quote / plagiarism
> > >> 5. eventually train a classifier to help you mark other texts as
> quote /
> > >> plagiarism
> > >>
> > >> HTH,
> > >> Tommaso
> > >>
> > >>
> > >>
> > >> 2013/7/23 Furkan KAMACI 
> > >>
> > >> > Actually I need a specialized algorithm. I want to use that
> algorithm
> > to
> > >> > detect duplicate blog posts.
> > >> >
> > >> > 2013/7/23 Tommaso Teofili 
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I you may leverage and / or improve MLT component [1].
> > >> > >
> > >> > > HTH,
> > >> > > Tommaso
> > >> > >
> > >> > > [1] : http://wiki.apache.org/solr/MoreLikeThis
> > >> > >
> > >> > >
> > >> > > 2013/7/23 Furkan KAMACI 
> > >> > >
> > >> > > > Hi;
> > >> > > >
> > >> > > > Sometimes a huge part of a document may exist in another
> > document. As
> > >> > > like
> > >> > > > in student plagiarism or quotation of a blog post at another
> blog
> > >> post.
> > >> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any
> > class
> > >> > to
> > >> > > > detect it?
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>


Re: maximum number of documents per shard?

2013-07-25 Thread Nicole Lacoste
Is there a limit on the number of shards?

Niki


On 24 July 2013 01:14, Jack Krupansky  wrote:

> 2.1 billion documents (including deleted documents) per Lucene index, but
> essentially per Solr shard as well.
>
> But don’t even think about going that high. In fact, don't plan on going
> above 100 million unless you do a proof of concept that validates that you
> get acceptable query and update performance . There is no hard limit
> besides that 2.1 billion Lucene limit, but... performance will vary.
>
> -- Jack Krupansky
>
> -Original Message- From: Ali, Saqib
> Sent: Tuesday, July 23, 2013 6:18 PM
> To: solr-user@lucene.apache.org
> Subject: maximum number of documents per shard?
>
> still 2.1 billion documents?
>



-- 
* *


Re: Querying a specific core in solr cloud

2013-07-25 Thread vicky desai
Hi,

I have also noticed that once I put the core up on both the machine
&distrib=false works well. could this be a possible bug that when a core is
down on one instance &distrib=false doesnt work



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Querying-a-specific-core-in-solr-cloud-tp4079964p4080246.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Querying a specific core in solr cloud

2013-07-25 Thread vicky desai
Hi Erik,


Thanks for the reply

But does &distrib=true work for replicas as well. As i mentioned earliear I
have a set up of 1 leader and 1 replica. If a core is up on either of the
instances querying to both the instances gives me results even with
&distrib=false



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Querying-a-specific-core-in-solr-cloud-tp4079964p4080244.html
Sent from the Solr - User mailing list archive at Nabble.com.