Re: Sorting order of suggested words

2017-01-19 Thread Keiichi MORITA
I found the option `comparatorClass`.

https://wiki.apache.org/solr/SpellCheckComponent#Custom_Comparators_and_the_Lucene_Spell_Checkers_.28IndexBasedSpellChecker.2C_FileBasedSpellChecker.2C_DirectSolrSpellChecker.29

If I want to sort suggested words alphabetically, need to set a custom
comparator.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-order-of-suggested-words-tp4314761p4314990.html
Sent from the Solr - User mailing list archive at Nabble.com.


Information on classifier based key word suggestion

2017-01-19 Thread Shamik Bandopadhyay
Hi,

  I'm exploring a way to suggest keywords/tags based on a text snippet. I
have a fairly small set of the taxonomy of product, release, category,
type, etc. stored in an in-memory database. What I'm looking at is a tool
which will analyze a given text, suggest not only the fields associated
with taxonomy but keywords which it might feel relevant to the text. The
keywords can be leveraged as a mechanism for findability of the document.
As a newbie in this area, I'm a tad overwhelmed at different options and
struggling to find the right approach.To start with I tried GATE, but it
seems to be limited only providing taxonomy data which needs to be provided
as a flat text. Few people suggested using classifiers like Naive Bayes
classifier or other machine learning tools.

I'll appreciate if anyone can provide some direction in this regard.

Thanks,
Shamik


Re: CloudSolrStream can't set the setZkClientTimeout and setZkConnectTimeout properties

2017-01-19 Thread Will Martin
Default behavior. Client - Server negotiate 2/3 of min (max server, max client).


This allows a client time to search for a new leader before all of its time 
consumed.

 zookeeper user @ apache org 

-will

On 1/19/2017 12:59 PM, Yago Riveiro wrote:

I can see some reconnects in my logs, the process of consuming the stream
doesn't broke and continue as normal.

The timeout is 10s but I can see in logs that after 6s the reconnect is
triggered, I don't know if it's the default behaviour or the zk timeout it's
not honoured.



-
Best regards

/Yago
--
View this message in context: 
http://lucene.472066.n3.nabble.com/CloudSolrStream-can-t-set-the-setZkClientTimeout-and-setZkConnectTimeout-properties-tp4313127p4314899.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Will Solr flush docs to disk when ram buffer is full (time of auto commit is not reached yet)?

2017-01-19 Thread Jan Høydahl
It will flush buffer to disk as a new segment without opening a new searcher. I 
guess trans-log will be rotated too, but not sure.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 19. jan. 2017 kl. 21.16 skrev Ziyuan Qin :
> 
> Hi All,
> 
> I'm trying to understand how Solr works with disk IO during and between hard 
> commits. Wish you can help me.
> 
> Let's assume Softcommit is turned off. Autocommit is turned on. 
> 
> Then during a hard commit:
> 
> 1. The tlog is truncated: A new tlog is started. (Disk IO involved)
> 2. The current index segment is closed and flushed. (Disk IO involved)
> 
> Between two hard commits:
> 1. new added documents are hosted in Ram buffer first. (defined in 
> solrconfig, ramBufferSizeMB. No Disk IO involved)
> I assume this buffer actually host an open segment in ram, am I right?
> 2. What will happen when the ram buffer is full and the time of autocommit is 
> not reached yet? Will Solr flush the segment in ram to disk anyway?
> 
> Thank you very much,
> Atom



Will Solr flush docs to disk when ram buffer is full (time of auto commit is not reached yet)?

2017-01-19 Thread Ziyuan Qin
Hi All,

I'm trying to understand how Solr works with disk IO during and between hard 
commits. Wish you can help me.

Let's assume Softcommit is turned off. Autocommit is turned on. 

Then during a hard commit:
 
1. The tlog is truncated: A new tlog is started. (Disk IO involved)
2. The current index segment is closed and flushed. (Disk IO involved)

Between two hard commits:
1. new added documents are hosted in Ram buffer first. (defined in solrconfig, 
ramBufferSizeMB. No Disk IO involved)
I assume this buffer actually host an open segment in ram, am I right?
2. What will happen when the ram buffer is full and the time of autocommit is 
not reached yet? Will Solr flush the segment in ram to disk anyway?

Thank you very much,
Atom


Removing duplicate values from fields filled with copyField...

2017-01-19 Thread Georgios Petasis

Hi all,

It seems that this is a popular request (remove duplicates generated 
from copyField), but I am not sure that I have understood the answer. 
Can somebody point to a correct answer for this issue?


I have understand that this involves "update request processors", but I 
am not sure I understand how to configure them.


Regards,

George



Removing duplicate values from fields filled with copyField...

2017-01-19 Thread Georgios Petasis

Hi all,

It seems that this is a popular request (remove duplicates generated 
from copyField), but I am not sure that I have understood the answer. 
Can somebody point to a correct answer for this issue?


I have understand that this involves "update request processors", but I 
am not sure I understand how to configure them.


Regards,

George



Re: indexing error - 6.3.0

2017-01-19 Thread Joe Obernberger
Another data point - the 5 node cluster does have another collection on 
it that is large (maybe 500G in HDFS) that did have field guessing 
enabled on it, but it is a static collection (I'm not adding data to 
it).  I've just removed that collection and am running the test again - 
it's gotten a lot further along so far.


-Joe


On 1/19/2017 12:59 PM, Joe Obernberger wrote:
Thank you Erick!  For this scenario, I was defining the schema 
manually (editing managed_schema and pushing to zookeeper), but didn't 
realize that I had left the field guessing block in the solrconfig.xml 
file enabled.  I've now disabled the field guessing, but still getting 
errors when indexing many small records.  This error happened on one 
of the 5 servers in the cluster:


Exceptioat 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)exOutOfBoundsException
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)

Caused by: java.lang.ArrayIndexOutOfBoundsException
at 
org.apache.lucene.store.ByteArrayDataInput.readBytes(ByteArrayDataInput.java:165)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.nextLeaf(SegmentTermsEnumFrame.java:284)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.next(SegmentTermsEnumFrame.java:269)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.next(SegmentTermsEnum.java:955)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(SegmentTermsEnum.java:762)
at 
org.apache.lucene.index.BufferedUpdatesStream.applyTermDeletes(BufferedUpdatesStream.java:538)
at 
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:287)
at 
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068)
at 
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026)
at 
org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)



Then 2 seconds later:

2017-01-19 17:45:02.798 ERROR (commitScheduler-32-thread-1) 
[c:Wordline2 s:shard1 r:core_node3 x:Wordline2_shard1_replica1] 
o.a.s.u.CommitTracker auto commit 
error...:org.apache.lucene.index.CorruptIndexException: codec header 
mismatch: actual header=164048902 vs expected header=1071082519 
(resource=_i_Lucene50_0.tip)
at 
org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:196)

at org.apache.lucene.util.fst.FST.(FST.java:327)
at org.apache.lucene.util.fst.FST.(FST.java:313)
at 
org.apache.lucene.codecs.blocktree.FieldReader.(FieldReader.java:91)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.(BlockTreeTermsReader.java:234)
at 
org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:292)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:106)
at 
org.apache.lucene.index.SegmentReader.(SegmentReader.java:74)
at 
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
at 
org.apache.lucene.index.BufferedUpdatesStream$SegmentState.(BufferedUpdatesStream.java:384)
at 
org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
at 
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
at 
org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413)
at 
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399)
at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2987)
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3206)
at 
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3171)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:607)
at 
org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecu

Re: CloudSolrStream can't set the setZkClientTimeout and setZkConnectTimeout properties

2017-01-19 Thread Yago Riveiro
I can see some reconnects in my logs, the process of consuming the stream
doesn't broke and continue as normal.

The timeout is 10s but I can see in logs that after 6s the reconnect is
triggered, I don't know if it's the default behaviour or the zk timeout it's
not honoured.



-
Best regards

/Yago
--
View this message in context: 
http://lucene.472066.n3.nabble.com/CloudSolrStream-can-t-set-the-setZkClientTimeout-and-setZkConnectTimeout-properties-tp4313127p4314899.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing error - 6.3.0

2017-01-19 Thread Joe Obernberger
Thank you Erick!  For this scenario, I was defining the schema manually 
(editing managed_schema and pushing to zookeeper), but didn't realize 
that I had left the field guessing block in the solrconfig.xml file 
enabled.  I've now disabled the field guessing, but still getting errors 
when indexing many small records.  This error happened on one of the 5 
servers in the cluster:


Exceptioat 
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:668)exOutOfBoundsException
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:648)

Caused by: java.lang.ArrayIndexOutOfBoundsException
at 
org.apache.lucene.store.ByteArrayDataInput.readBytes(ByteArrayDataInput.java:165)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.nextLeaf(SegmentTermsEnumFrame.java:284)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.next(SegmentTermsEnumFrame.java:269)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.next(SegmentTermsEnum.java:955)
at 
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(SegmentTermsEnum.java:762)
at 
org.apache.lucene.index.BufferedUpdatesStream.applyTermDeletes(BufferedUpdatesStream.java:538)
at 
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:287)
at 
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068)
at 
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026)

at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880)
at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)



Then 2 seconds later:

2017-01-19 17:45:02.798 ERROR (commitScheduler-32-thread-1) [c:Wordline2 
s:shard1 r:core_node3 x:Wordline2_shard1_replica1] o.a.s.u.CommitTracker 
auto commit error...:org.apache.lucene.index.CorruptIndexException: 
codec header mismatch: actual header=164048902 vs expected 
header=1071082519 (resource=_i_Lucene50_0.tip)
at 
org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:196)

at org.apache.lucene.util.fst.FST.(FST.java:327)
at org.apache.lucene.util.fst.FST.(FST.java:313)
at 
org.apache.lucene.codecs.blocktree.FieldReader.(FieldReader.java:91)
at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.(BlockTreeTermsReader.java:234)
at 
org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.(PerFieldPostingsFormat.java:292)
at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:106)
at 
org.apache.lucene.index.SegmentReader.(SegmentReader.java:74)
at 
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
at 
org.apache.lucene.index.BufferedUpdatesStream$SegmentState.(BufferedUpdatesStream.java:384)
at 
org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
at 
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
at 
org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413)
at 
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399)
at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2987)
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3206)
at 
org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3171)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:607)

at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Then a few mSec later:

2017-01-19 17:45:33.176 ERROR (qtp606548741-122) [c:Wordline2 s:shard1 
r:core_node3 x:Wordline2_shard1_replica1] o.a.s.h.RequestHandlerBase 
org.apache.solr.common.SolrException: Exception writing 

Re: Boolean disjunction with nested documents

2017-01-19 Thread Mikhail Khludnev
It's my pet peeve. Try
?q={!parent which=content_type:activity}(schedule.weekday:1) OR
has_schedules:false&debugQuery=true
vs
?q= {!parent which=content_type:activity}(schedule.weekday:1) OR
has_schedules:false&debugQuery=true
and you'll see how space matters.
The pro's way is to

?q={!parent which=content_type:activity v=$childq} has_schedules:false
&debugQuery=true&childq=schedule.weekday:1

Also check this, it deserves.

https://lucidworks.com/blog/2009/03/31/nested-queries-in-solr/Why Not AND,
OR, And NOT? | Lucidworks.com

http://blog-archive.griddynamics.com/2013/12/grandchildren-and-siblings-with-block.html


On Thu, Jan 19, 2017 at 7:33 PM, Ivan Bianchi  wrote:

> I hope someone can help me because I have spent too many time looking for
> this issue :(
>
> I have 2 kind of documents related with an 1-n relation, in my example this
> is 1 activity has many schedules.
> To achieve this I have some inner child document with schedule fields
> inside the activity document. The identifier field of the document is
> called content_type.
>
> In the activity I have a boolean field to detect if the document has
> children (schedules) called 'has_schedules'.
>
> I'm trying to find the documents whose children have a weekday = 1 OR they
> don't have children. Here's the filter query:
> "{!parent which=content_type:activity}(schedule.weekday:1)" OR
> has_schedules:false
>
> I don't understand how the " char works in here, wherever I enclose the
> condition with " char, Solr ignores it. For example the previous query is
> equal to has_schedules:false, on the contrary I only get the first
> condition. Also, if I use the "(" it gives me a syntax error.
> The other issue here, is that if I remove the " chars, Solr gets the second
> condition (has_schedules:false) as part of the nested child query, which
> implies an error as this is a parent's field.
>
> Thank you for any help,
>
>
> --
> Ivan
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Phonetic Search

2017-01-19 Thread Walter Underwood
Phonetic search will not match “satpuda” and “satpura” because they sound 
different. You want fuzzy search.

To get fuzzy search that is easy to use in edismax, apply the patch in SOLR-629.

https://issues.apache.org/jira/browse/SOLR-629 


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 19, 2017, at 5:14 AM, Vivek Pathak  wrote:
> 
> https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
> 
> didnt work for you?
> 
> 
> On 01/19/2017 05:58 AM, PAVAN wrote:
>> Hi,
>> 
>> I am trying to implement phonetic search in my application. Below are
>> indexed terms in solr.
>> 
>> "satpura private limited"
>> 
>> when user search with "satpuda"  it has to display the above result. Below
>> is the configuration
>> 
>> 
>>  
>>  
>>  
>>  > words="stopwords.txt" />
>>  > nameType="GENERIC"
>>  ruleType="APPROX" concat="true" languageSet="auto" />
>>  
>>   
>> 
>> Please help me how to implement phonetic search.
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Phonetic-Search-tp4314828.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Boolean disjunction with nested documents

2017-01-19 Thread Ivan Bianchi
I hope someone can help me because I have spent too many time looking for
this issue :(

I have 2 kind of documents related with an 1-n relation, in my example this
is 1 activity has many schedules.
To achieve this I have some inner child document with schedule fields
inside the activity document. The identifier field of the document is
called content_type.

In the activity I have a boolean field to detect if the document has
children (schedules) called 'has_schedules'.

I'm trying to find the documents whose children have a weekday = 1 OR they
don't have children. Here's the filter query:
"{!parent which=content_type:activity}(schedule.weekday:1)" OR
has_schedules:false

I don't understand how the " char works in here, wherever I enclose the
condition with " char, Solr ignores it. For example the previous query is
equal to has_schedules:false, on the contrary I only get the first
condition. Also, if I use the "(" it gives me a syntax error.
The other issue here, is that if I remove the " chars, Solr gets the second
condition (has_schedules:false) as part of the nested child query, which
implies an error as this is a parent's field.

Thank you for any help,


-- 
Ivan


Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Hendrik Haddorp
HDFS is like a shared filesystem so every Solr Cloud instance can access 
the data using the same path or URL. The clusterstate.json looks like this:


"shards":{"shard1":{
"range":"8000-7fff",
"state":"active",
"replicas":{
  "core_node1":{
"core":"test1.collection-0_shard1_replica1",
"dataDir":"hdfs://master...:8000/test1.collection-0/core_node1/data/",
"base_url":"http://slave3:9000/solr";,
"node_name":"slave3:9000_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node1/data/tlog"},
  "core_node2":{
"core":"test1.collection-0_shard1_replica2",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node2/data/",
"base_url":"http://slave2:9000/solr";,
"node_name":"slave2:9000_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node2/data/tlog",
"leader":"true"},
  "core_node3":{
"core":"test1.collection-0_shard1_replica3",
"dataDir":"hdfs://master:8000/test1.collection-0/core_node3/data/",
"base_url":"http://slave4:9005/solr";,
"node_name":"slave4:9005_solr",
"state":"active",
"ulogDir":"hdfs://master:8000/test1.collection-0/core_node3/data/tlog"

So every replica is always assigned to one node and this is being stored 
in ZK, pretty much the same as for none HDFS setups. Just as the data is 
not stored locally but on the network and as the path does not contain 
any node information you can of course easily take over the work to a 
different Solr node. You should just need to update the owner of the 
replica in ZK and you should basically be done, I assume. That's why the 
documentation states that an advantage of using HDFS is that a failing 
node can be replaced by a different one. The Overseer just has to move 
the ownership of the replica, which seems like what the code is trying 
to do. There just seems to be a bug in the code so that the core does 
not get created on the target node.


Each data directory also contains a lock file. The documentation states 
that one should use the HdfsLockFactory, which unfortunately can easily 
lead to SOLR-8335, which hopefully will be fixed by SOLR-8169. A manual 
cleanup is however also easily done but seems to require a node restart 
to take effect. But I'm also only recently playing around with all this ;-)


regards,
Hendrik

On 19.01.2017 16:40, Shawn Heisey wrote:

On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:

Given that the data is on HDFS it shouldn't matter if any active
replica is left as the data does not need to get transferred from
another instance but the new core will just take over the existing
data. Thus a replication factor of 1 should also work just in that
case the shard would be down until the new core is up. Anyhow, it
looks like the above call is missing to set the shard id I guess or
some code is checking wrongly.

I know very little about how SolrCloud interacts with HDFS, so although
I'm reasonably certain about what comes below, I could be wrong.

I have not ever heard of SolrCloud being able to automatically take over
an existing index directory when it creates a replica, or even share
index directories unless the admin fools it into doing so without its
knowledge.  Sharing an index directory for replicas with SolrCloud would
NOT work correctly.  Solr must be able to update all replicas
independently, which means that each of them will lock its index
directory and write to it.

It is my understanding (from reading messages on mailing lists) that
when using HDFS, Solr replicas are all separate and consume additional
disk space, just like on a regular filesystem.

I found the code that generates the "No shard id" exception, but my
knowledge of how the zookeeper code in Solr works is not deep enough to
understand what it means or how to fix it.

Thanks,
Shawn





Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Shawn Heisey
On 1/19/2017 4:09 AM, Hendrik Haddorp wrote:
> Given that the data is on HDFS it shouldn't matter if any active
> replica is left as the data does not need to get transferred from
> another instance but the new core will just take over the existing
> data. Thus a replication factor of 1 should also work just in that
> case the shard would be down until the new core is up. Anyhow, it
> looks like the above call is missing to set the shard id I guess or
> some code is checking wrongly. 

I know very little about how SolrCloud interacts with HDFS, so although
I'm reasonably certain about what comes below, I could be wrong.

I have not ever heard of SolrCloud being able to automatically take over
an existing index directory when it creates a replica, or even share
index directories unless the admin fools it into doing so without its
knowledge.  Sharing an index directory for replicas with SolrCloud would
NOT work correctly.  Solr must be able to update all replicas
independently, which means that each of them will lock its index
directory and write to it.

It is my understanding (from reading messages on mailing lists) that
when using HDFS, Solr replicas are all separate and consume additional
disk space, just like on a regular filesystem.

I found the code that generates the "No shard id" exception, but my
knowledge of how the zookeeper code in Solr works is not deep enough to
understand what it means or how to fix it.

Thanks,
Shawn



Re: indexing error - 6.3.0

2017-01-19 Thread Erick Erickson
It looks to me like you're using "field guessing". For production systems I
generally don't recommend this. The problem is that it makes the best estimate
that it can based on the first document for any given field. So it sees a field
with the value 1 and tries to make the field an int. Then 100 docs later a doc
comes through with that field as 1.0 and you get an indexing exception.

Next, if you're sending many docs rapidly through SolrCloud, there are all kinds
of things going on to try to update the configset, reload the cores to
get the latest
configurations down to all of the replicas and the like.

So the very first thing I'd try is to define the schema manually and see if that
cures things.

BTW, the big, scary "DO NOT EDIT THIS FILE" in the managed_schema file
is a bit of overkill. You _can_ edit that file manually, the danger is
that if you
have the field-guessing turned on, already running solr nodes may overwrite
your changes. So it's safe to manually edit that file and push it to Zookeeper
in to situations:
1> you have disabled "field guessing"
or
2> you edit and push when all your Solr nodes are shut down.

Best,
Erick

On Wed, Jan 18, 2017 at 9:11 PM, Joe Obernberger
 wrote:
> Hi All - I've been trying to debug this, but it keeps occurring. Even if I
> do 100 at a time, or 50 at a time, eventually I get the below stack trace.
> I've also adjusted the autoSoftCommit and autoCommit times to a variety of
> values.  It stills fails after a time; typically around 27-50 million
> records, I get this error. This is on a newly created collection (that I've
> been dropping and recreating after each test).
>
> Is there anything I can try that may help debug?  Perhaps my method of
> indexing is incorrect?  Thanks for any ideas!
>
> -Joe
>
>
> On 1/17/2017 10:13 AM, Joe Obernberger wrote:
>>
>> While indexing a large number of records in Solr Cloud 6.3.0 with a 5 node
>> configuration, I received an error.  I'm using java code / solrj to perform
>> the indexing by creating a list of SolrInputDocuments, 1000 at a time, and
>> then calling CloudSolrClient.add(list).  The records are small - about 6
>> fields of short strings and numbers.
>>
>> If I do 100 at a time, I can't replicate the error, but 1000 at a time has
>> consistently causes the below exception to occur.  The index is stored in a
>> shared HDFS.
>>
>> 2017-01-17 04:21:00.022 ERROR (qtp606548741-21) [c:Worldline s:shard5
>> r:core_node1 x:Worldline_shard5_replica1] o.a.s.h.RequestHandlerBase
>> org.apache.solr.common.SolrException: Exception writing document id
>> 6228601a-8756-4b16-bdc3-ad026754b225 to the index; possible analysis error.
>> at
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178)
>> at
>> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
>> at
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>> at
>> org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:335)
>> at
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>> at
>> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>> at
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>> at
>> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>> at
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>> at
>> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>> at
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>> at
>> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>> at
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>> at
>> org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:74)
>> at
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>> at
>> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
>> at
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>> at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:957)
>> at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdatePr

Re: Solr Shard Splitting Issue

2017-01-19 Thread ekta
Hi Anshum,

Thanks for the reply. 

I had the copy of data that i was experimenting on, and anyways i was doing
it later too, after i posted the mail. Some points i want to let you know:-

1. This time i did not change the state of state.json.
2. Rest,I did the same steps as above and still the data got frozen to 24GB 
in both shards(my parent shard had -60GB).
3. Still, the state.json is showing 
3.1 Parent -  Active
3.2 Child   - Construction
4.Yeah i do have logs , i am attaching the file with mail. Please check it
out.
5. I did shard splitting by this command
   
"http://10.1.1.78:4983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1";
in browser, and i got Timeout Exception in browser. I am attaching the file
which contains, what the browser displayed.
6. The Details of the system(Amazon EC2 Instances) for which i am doing
above steps is:
 6.1 30GB RAM
 6.2 4 cores
 6.3 250 GB drive
7. Lastly , i googled about the timeout exception that i got, i found some
reply by you on the post about the same, where  you mentioned to issue the
spilt shard command asynchronously , i tried with that too. As a result no
doubt i did not got time out exception from browser but , rest all was same
as mentioned above.

Please tell me if any further details are required.  solr.log
  
Browser_result.txt
  





 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Shard-Splitting-Issue-tp4314145p4314813.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Phonetic Search

2017-01-19 Thread Vivek Pathak

https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching

didnt work for you?


On 01/19/2017 05:58 AM, PAVAN wrote:

Hi,

I am trying to implement phonetic search in my application. Below are
indexed terms in solr.

"satpura private limited"

when user search with "satpuda"  it has to display the above result. Below
is the configuration








 

Please help me how to implement phonetic search.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Phonetic-Search-tp4314828.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Joining Across Collections

2017-01-19 Thread Moenieb Davids
Hi Guys

Just a quick question on search and join:

I have a few cores which is based on a mainframe extract, 1 core per extracted 
file which resembles a "DB Table"
The cores are all somehow linked via 1 to many fields, with a structure similar 
to a normal ERD

Is it possible to return the result from a query that joins lets say 3 cores in 
the following format:

"core1_id":"XXX",
"_childDocuments_":[
{
  "core2_id":"yyy",
  "core_2_fieldx":"ABC",
  "_childDocuments_":[
  {
"core3_id":"zzz",
"core_3_fieldx":"ABC",
"core3_fieldy":"123",
  {
  "core2_fieldy":"123",
{

Regards
Moenieb Davids

-Original Message-
From: Mikhail Khludnev [mailto:m...@apache.org] 
Sent: 19 January 2017 10:00 AM
To: solr-user; nabil Kouici
Subject: Re: Joining Across Collections

It seems like it can be done by just negating join query or I'm missing 
something.

On Wed, Jan 18, 2017 at 11:32 AM, nabil Kouici 
wrote:

> Hi All,
> I'm using  join across collection feature to do an inner join between 
> 2 collections. It works fine.
> Is it possible to use this feature to compare between fields from 
> different collections. For exemple:
> Collection1 Field1Collection2 Field2
> search document from Collection1 where Field1 != Field2 In sql, this 
> will translated to:
> Select A.* From Collection1 A inner join Collection2 B on  
> A.id=B.idWhere
> A.Field1<>B.Field2
>
> Thank you.
> Regards,NKI.
>



--
Sincerely yours
Mikhail Khludnev










===
GPAA e-mail Disclaimers and confidential note 

This e-mail is intended for the exclusive use of the addressee only.
If you are not the intended recipient, you should not use the contents 
or disclose them to any other person. Please notify the sender immediately 
and delete the e-mail. This e-mail is not intended nor 
shall it be taken to create any legal relations, contractual or otherwise. 
Legally binding obligations can only arise for the GPAA by means of 
a written instrument signed by an authorised signatory.
===


Re: Solr on HDFS: AutoAddReplica does not add a replica

2017-01-19 Thread Hendrik Haddorp

Hi,
I'm seeing the same issue on Solr 6.3 using HDFS and a replication 
factor of 3, even though I believe a replication factor of 1 should work 
the same. When I stop a Solr instance this is detected and Solr actually 
wants to create a replica on a different instance. The command for that 
does however fail:


o.a.s.c.OverseerAutoReplicaFailoverThread Exception trying to create new 
replica on 
http://...:9000/solr:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: 
Error from server at http://...:9000/solr: Error CREATEing SolrCore 
'test2.collection-09_shard1_replica1': Unable to create core 
[test2.collection-09_shard1_replica1] Caused by: No shard id for 
CoreDescriptor[name=test2.collection-09_shard1_replica1;instanceDir=/var/opt/solr/test2.collection-09_shard1_replica1]
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:593)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:251)
at 
org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at 
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.createSolrCore(OverseerAutoReplicaFailoverThread.java:456)
at 
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.lambda$addReplica$0(OverseerAutoReplicaFailoverThread.java:251)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Given that the data is on HDFS it shouldn't matter if any active replica 
is left as the data does not need to get transferred from another 
instance but the new core will just take over the existing data. Thus a 
replication factor of 1 should also work just in that case the shard 
would be down until the new core is up. Anyhow, it looks like the above 
call is missing to set the shard id I guess or some code is checking 
wrongly.


On 14.01.2017 02:44, Shawn Heisey wrote:

On 1/13/2017 5:46 PM, Chetas Joshi wrote:

One of the things I have observed is: if I use the collection API to
create a replica for that shard, it does not complain about the config
which has been set to ReplicationFactor=1. If replication factor was
the issue as suggested by Shawn, shouldn't it complain?

The replicationFactor value is used by exactly two things:  initial
collection creation, and autoAddReplicas.  It will not affect ANY other
command or operation, including ADDREPLICA.  You can create MORE
replicas than replicationFactor indicates, and there will be no error
messages or warnings.

In order to have a replica automatically added, your replicationFactor
must be at least two, and the number of active replicas in the cloud for
a shard must be less than that number.  If that's the case and the
expiration times have been reached without recovery, then Solr will
automatically add replicas until there are at least as many replicas
operational as specified in replicationFactor.


I would also like to mention that I experience some instance dirs
getting deleted and also found this open bug
(https://issues.apache.org/jira/browse/SOLR-8905)

The description on that issue is incomprehensible.  I can't make any
sense out of it.  It mentions the core.properties file, but the error
message shown doesn't talk about the properties file at all.  The error
and issue description seem to have nothing at all to do with the code
lines that were quoted.  Also, it was reported on version 4.10.3 ... but
this is going to be significantly different from current 6.x versions,
and the 4.x versions will NOT be updated with bugfixes.

Thanks,
Shawn





Phonetic Search

2017-01-19 Thread PAVAN
Hi,

I am trying to implement phonetic search in my application. Below are
indexed terms in solr. 

"satpura private limited"

when user search with "satpuda"  it has to display the above result. Below
is the configuration








 

Please help me how to implement phonetic search.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Phonetic-Search-tp4314828.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How does using cacheKey and lookup behave?

2017-01-19 Thread Mikhail Khludnev
You can have left join in nested entity it gives empty child entities rows
in cache.

On Wed, Jan 18, 2017 at 9:59 PM, Kaushik  wrote:

> I use the cacheKey, cacheLookup, SortedMapBackedCache in the Data Import
> Handler of Solr 5.x to join two or more entities. Does this give me an
> equivalent of Sql's inner join? If so, how can I get something similar to
> left join?
>
> Thank you,
> Kaushik
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Joining Across Collections

2017-01-19 Thread Mikhail Khludnev
It seems like it can be done by just negating join query or I'm missing
something.

On Wed, Jan 18, 2017 at 11:32 AM, nabil Kouici 
wrote:

> Hi All,
> I'm using  join across collection feature to do an inner join between 2
> collections. It works fine.
> Is it possible to use this feature to compare between fields from
> different collections. For exemple:
> Collection1 Field1Collection2 Field2
> search document from Collection1 where Field1 != Field2
> In sql, this will translated to:
> Select A.* From Collection1 A inner join Collection2 B on  A.id=B.idWhere
> A.Field1<>B.Field2
>
> Thank you.
> Regards,NKI.
>



-- 
Sincerely yours
Mikhail Khludnev