date:20150721

Hi Mese,

let me try to answer to your 2 questions :

1. What happens if a shard(both leader and replica) goes down. If the
document on the dead shard is updated, will it forward the document to
the
new shard. If so, when the dead shard comes up again, will this not be
considered for the same hask key range?

I see some confusion here.
First of all you need a smart client that will load balance the docs to
index.
Let's say the CloudSolrClient .

A solr document update is always a deletion and a re-insertion.
This means that you get the document from the index ( the stored fields),
and you add the document again.

If the document is on a dead shard, you have lost it, you can not retrieve
it until you have that shard to go up again.
Possibly it's still in the transaction log.

In the case you are re-indexing the doc, the doc will be re-index.
When the shard is up again, there will be 2 versions of the documents.
With some different fields but the same id.
What do you mean with : will this not be
considered for the same hask key range ?

2. Is there a way to fix this[removing duplicates across shards]?

i assume not an easy way.
You could re-index the content applying a Deduplication Update Request
processor.
But it will be costly.

Cheers

2015-07-21 15:01 GMT+01:00 Reitzel, Charles charles.reit...@tiaa-cref.org:

Also, the function used to generate hashes is
org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a
32-bit value. The range of the hash values assigned to each shard are
resident in Zookeeper. Since you are using only a single hash component,
all 32-bits will be used by the entire ID field value.

I.e. I see no routing delimiter (!) in your example ID value:

possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30

Which isn't required, but it means that documents (logs?) will be
distributed in a round-robin fashion over the shards. Not grouped by host
or environment (if I am reading it right).

You might consider the following: environment!hostname!UUID

E.g. intl-staging!possting.mongo-v2.services.com
!c2d2a376-5e4a-11e2-8963-0026b9414f30

This way documents from the same host will be grouped together, most
likely on the same shard. Further, within the same environment, documents
will be grouped on the same subset of shards. This will allow client
applications to set _route_=environment! or
_route_=environment!hostname! and limit queries to those shards
containing relevant data when the corresponding filter queries are applied.

If you were using route delimiters, then the default for a 2-part key (1
delimiter) is to use 16 bits for each part. The default for a 3-part key
(2 delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for
the 3rd part. In any case, the high-order bytes of the hash dominate the
distribution of data.

-Original Message-
From: Reitzel, Charles
Sent: Tuesday, July 21, 2015 9:55 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Cloud: Duplicate documents in multiple shards

When are you generating the UUID exactly? If you set the unique ID field
on an update, and it contains a new UUID, you have effectively created a
new document. Just a thought.

-Original Message-
From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com]
Sent: Tuesday, July 21, 2015 4:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud: Duplicate documents in multiple shards

Unable to delete by passing distrib=false as well. Also it is difficult to
identify those duplicate documents among the 130 million.

Is there a way we can see the generated hash key and mapping them to the
specific shard?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender
immediately and then delete it.

TIAA-CREF
*

--
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Data Import Handler Stays Idle

2015-07-21 Thread Paden

Okay. I'm going to run the index again with specifications that you
recommended. This could take a few hours but I will post the entire trace on
that error when it pops up again and I will let you guys know the results of
increasing the heap size. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218382.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Stays Idle

2015-07-21 Thread Paden

Hey shawn when I use the -m 2g command in my script I get the error a 'cannot
open [path]/server/logs/solr.log for reading: No such file or directory' I
do not see how this would affect that. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218389.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query Performance

SolrMeter mate,

http://code.google.com/p/solrmeter/

Take a look, it will help you a lot !

Cheers

2015-07-21 16:49 GMT+01:00 Nagasharath sharathrayap...@gmail.com:

 Any recommended tool to test the query performance would be of great help.

 Thanks




-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: SOLR nrt read writes


 Could this be due to caching? I have tried to disable all in my solrconfig.


If you mean Solr caches ? NO .
Solr caches live the life of the searcher.
So new searcher, new caches ( possibly warmed with updated results) .

If you mean your application caching or browser caching, you should verify,
i assume you have control on that.

Cheers

2015-07-21 6:02 GMT+01:00 Bhawna Asnani bhawna.asn...@gmail.com:

 Thanks, I tried turning off auto softCommits but that didn't help much.
 Still seeing stale results every now and then. Also load on the server very
 light. We are running this just on a test server with one or two users. I
 don't see any warning in logs whole doing softCommits and it says it
 successfully opened new searcher and registered it as main searcher. Could
 this be due to caching? I have tried to disable all in my solrconfig.

 Sent from my iPhone

  On Jul 20, 2015, at 12:16 PM, Shawn Heisey apa...@elyograg.org wrote:
 
  On 7/20/2015 9:29 AM, Bhawna Asnani wrote:
  Thanks for your suggestions. The requirement is still the same , to be
  able to make a change to some solr documents and be able to see it on
  subsequent search/facet calls.
  I am using softCommit with waitSearcher=true.
 
  Also I am sending reads/writes to a single solr node only.
  I have tried disabling caches and warmup time in logs is '0' but every
  once in a while I do get the document just updated with stale data.
 
  I went through lucene documentation and it seems opening the
  IndexReader with the IndexWriter should make the changes visible to
  the reader.
 
  I checked solr logs no errors. I see this in logs each time
  'Registered new searcher Searcher@x' even before searches that had
  the stale document.
 
  I have attached my solrconfig.xml for reference.
 
  Your attachment made it through the mailing list processing.  Most
  don't, I'm surprised.  Some thoughts:
 
  maxBooleanClauses has been set to 40.  This is a lot.  If you
  actually need a setting that high, then you are sending some MASSIVE
  queries, which probably means that your Solr install is exceptionally
  busy running those queries.
 
  If the server is fairly busy, then you should increase maxTime on
  autoCommit.  I use a value of five minutes (30) ... and my server is
  NOT very busy most of the time.  A commit with openSearcher set to false
  is relatively fast, but it still has somewhat heavy CPU, memory, and
  disk I/O resource requirements.
 
  You have autoSoftCommit set to happen after five seconds.  If updates
  happen frequently or run for very long, this is potentially a LOT of
  committing and opening new searchers.  I guess it's better than trying
  for one second, but anything more frequent than once a minute is likely
  to get you into trouble unless the system load is extremely light ...
  but as already discussed, your system load is probably not light.
 
  For the kind of Near Real Time setup you have mentioned, where you want
  to do one or more updates, commit, and then query for the changes, you
  probably should completely remove autoSoftCommit from the config and
  *only* open new searchers with explicit soft commits.  Let autoCommit
  (with a maxTime of 1 to 5 minutes) handle durability concerns.
 
  A lot of pieces in your config file are set to depend on java system
  properties just like the example does, but since we do not know what
  system properties have been set, we can't tell for sure what those parts
  of the config are doing.
 
  Thanks,
  Shawn
 




-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: solr blocking and client timeout issue

2015-07-21 Thread Jeremy Ashcraft

I did find a dark corner of our application that a dev had left some 
experimental code in that snuck past QA, because it was rarely used.  A 
client discovered and was using it heavily over the past week.  It was 
generating multiple consecutive update/commit requests.  Its been 
disabled and the long GC pauses have nearly stopped (so far).  We did 
see one at about 4am for about 5 minutes.


is there a way to try to mitigate these longer GC, if/when they do 
happen. (FYI, we are upgrading to OpenJDK 1.8 tonight.  its been working 
great in dev/QA, so hopefully it will make enough of a difference)


On 07/20/2015 09:31 PM, Erick Erickson wrote:

bq: the config is set up per the NRT suggestions in the docs.
autoSoftCommit every 2 seconds and autoCommit every 10 minutes.

2 second soft commit is very aggressive, no matter what the NRT
suggestions are. My first question is whether that's really needed.
The soft commits should be as long as you can stand. And don't listen
to  your product manager who says 2 seconds is required, push back
and answer whether that's really necessary. Most people won't notice
the difference.

bq: ...we are noticing a lot higher number of hard commits than usual.

Is a client somewhere issuing a hard commit? This is rarely
recommended... And is openSearcher true or false? False is a
relatively cheap operation, true is quite expensive.

More than you want to know about hard and soft commits:

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick

Best,
Erick

On Mon, Jul 20, 2015 at 12:48 PM, Jeremy Ashcraft jashcr...@edgate.com wrote:

heap is already at 5GB

On 07/20/2015 12:29 PM, Jeremy Ashcraft wrote:

no swapping that I'm seeing, although we are noticing a lot higher number
of hard commits than usual.

the config is set up per the NRT suggestions in the docs.  autoSoftCommit
every 2 seconds and autoCommit every 10 minutes.

there have been 463 updates in the past 2 hours, all followed by hard
commits

INFO  - 2015-07-20 12:26:20.979;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-07-20 12:26:21.021; org.apache.solr.core.SolrDeletionPolicy;
SolrDeletionPolicy.onCommit: commits: num=2

commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_e9nk,generation=665696}

commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_e9nl,generation=665697}
INFO  - 2015-07-20 12:26:21.022; org.apache.solr.core.SolrDeletionPolicy;
newest commit generation = 665697
INFO  - 2015-07-20 12:26:21.026;
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2015-07-20 12:26:21.026;
org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
webapp=/solr path=/update params={omitHeader=falsewt=json}
{add=[8653ea29-a327-4a54-9b00-8468241f2d7c (1507244513403338752),
5cf034a9-d93a-4307-a367-02cb21fa8e35 (1507244513404387328),
816e3a04-9d0e-4587-a3ee-9f9e7b0c7d74 (1507244513405435904)],commit=} 0 50

could that be causing some of the problems?


From: Shawn Heisey apa...@elyograg.org
Sent: Monday, July 20, 2015 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: solr blocking and client timeout issue

On 7/20/2015 11:54 AM, Jeremy Ashcraft wrote:

I'm ugrading to the 1.8 JDK on our dev VM now and testing. Hopefully i
can get production upgraded tonight.

still getting the big GC pauses this morning, even after applying the
GC tuning options.  Everything was fine throughout the weekend.

My biggest concern is that this instance had been running with no
issues for almost 2 years, but these GC issues started just last week.

It's very possible that you're simply going to need a larger heap than
you have needed in the past, either because your index has grown, or
because your query patterns have changed and now your queries need more
memory.  It could even be both of these.

At your current index size, assuming that there's nothing else on this
machine, you should have enough memory to raise your heap to 5GB.

If there ARE other software pieces on this machine, then the long GC
pauses (along with other performance issues) could be explained by too
much memory allocation out of the 8GB total memory, resulting in
swapping at the OS level.

Thanks,
Shawn


--
*jeremy ashcraft*
development manager
EdGate Correlation Services http://correlation.edgate.com
/253.853.7133 x228/


--
*jeremy ashcraft*
development manager
EdGate Correlation Services http://correlation.edgate.com
/253.853.7133 x228/

upgrade clusterstate.json fom 4.10.4 to split state.json in 5.2.1

2015-07-21 Thread Yago Riveiro

Hi,


How can I upgrade the clusterstate.json to be split by collection?


I read this issue https://issues.apache.org/jira/browse/SOLR-5473.


In theory exists a param “stateFormat” that configured to 2 says to use the 
/collections/collection/cluster.son format.


Where can I configure this?

—/Yago Riveiro

Parsing and indexing parts of the input file paths

Dear user and dev lists,

We are loading files from a directory and would like to index a portion of
each file path as a field as well as the text inside the file.

E.g., on HDFS we have this file path:

/user/andrew/1234/1234/file.pdf

And we would like the 1234 token parsed from the file path and indexed as
an additional field that can be searched on.

From my initial searches I can't see how to do this easily, so would I need
to write some custom code, or a plugin?

Thanks!

issue with query boost using qf and edismax

2015-07-21 Thread sandeep bonkra

Hi,

I am implementing searching using SOLR 5.0 and facing very strange problem.
I am having 4 fields Name and address, city and state in the document apart
from a unique ID.

My requirement is that it should give me those results first where there is
a match in name , then address, then state, city

Scenerio 1 : When searching *louis*
My query params is something like below
 q: person_full_name:*louis* OR address1:*louis* OR city:*louis* OR
state_code:*louis*
 qf: person_full_name^5.0 address1^0.8 city^0.7 state_code^1.0
 defType: edismax

 This is not giving results as per boost mentioned in qf param. This is
giving me result where city is getting matched first.
Score is coming as below:

 explain: {
  11470307: \n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n,
  11470282: \n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n,
  11470291: \n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(city:*louis*), product of:\n
  1.0 = boost\n0.0015872642 = queryNorm\n0.09090909 =
coord(1/11)\n,
  11470261: \n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n,
  11470328: \n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n,
  11470331: \n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n
},


Scenerio 2: But when I am matching 2 keywords. *louis cen*


 explain: {
  11470286: \n0.9805807 = (MATCH) product of:\n  1.9611614 =
(MATCH) sum of:\n0.49029034 = (MATCH) max of:\n  0.49029034 =
(MATCH) ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n
 5.0 = boost\n0.09805807 = queryNorm\n0.49029034 =
(MATCH) max of:\n  0.49029034 = (MATCH)
ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n
5.0 = boost\n0.09805807 = queryNorm\n0.49029034 = (MATCH)
max of:\n  0.49029034 = (MATCH)
ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n
5.0 = boost\n0.09805807 = queryNorm\n0.49029034 = (MATCH)
max of:\n  0.49029034 = (MATCH)
ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n
5.0 = boost\n0.09805807 = queryNorm\n  0.5 = coord(4/8)\n,
  11470284: \n0.15689291 = (MATCH) product of:\n  0.31378582 =
(MATCH) sum of:\n0.078446455 = (MATCH) max of:\n  0.078446455
= (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n
0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH)
max of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max
of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max
of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n  0.5 = coord(4/8)\n,
  11470232: \n0.15689291 = (MATCH) product of:\n  0.31378582 =
(MATCH) sum of:\n0.078446455 = (MATCH) max of:\n  0.078446455
= (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n
0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH)
max of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max
of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max
of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n  0.5 = coord(4/8)\n,
  11469707: \n0.15689291 = (MATCH) product of:\n  0.31378582 =
(MATCH) sum of:\n0.078446455 = (MATCH) max of:\n  0.078446455
= (MATCH)

Running SolrJ from Solr's REST API

2015-07-21 Thread Zheng Lin Edwin Yeo

Hi,

Would like to check, as I've created a SorJ program and exported it as an
Runnable JAR, how do I integrate it together with Solr so that I can call
this JAR directly from Solr's REST API?

Currently I can only run it on command prompt using the command java -jar
solrj.jar

I'm using Solr 5.2.1.


Regards,
Edwin

Re: Performance of facet contain search in 5.2.1

2015-07-21 Thread Erick Erickson

contains has to basically examine each and every term to see if it
matches. Say my
facet.contains=bbb. A matching term could be
aaabbbxyz
or
zzzbbbxyz

So there's no way to _know_ when you've found them all without
examining every last
one. So I'd try to redefine the problem to not require that. If it's
absolutely required,
you can do some interesting things but it's going to inflate your index.

For instance, rotate words (assuming word boundaries here). So, for
instance, you have
a text field with my dog has fleas. Index things like
my dog has fleas|my dog has fleas
dog has fleas my|my dog has fleas
has fleas my dog|my dog has fleas
fleas my dog has|my dog has fleas

Literally with the pipe followed by the original text. Now all your
contains clauses are
simple prefix facets, and you can have the UI split the token on the
pipe and display the
original.

Best,
Erick


On Tue, Jul 21, 2015 at 1:16 AM, Lo Dave dav...@hotmail.com wrote:
 I found that facet contain search take much longer time than facet prefix 
 search. Do anyone have idea how to make contain search faster?
 org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select 
 params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.prefix=duty+of+carerows=1wt=jsonfacet=true_=1437462916852}
  hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance] 
 webapp=/solr path=/select 
 params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.contains=duty+of+carerows=1wt=jsonfacet=truefacet.contains.ignoreCase=true}
  hits=1856 status=0 QTime=10951
 As show above, prefix search take 5 but contain search take 10951
 Thanks.

Re: Tips for faster indexing

2015-07-21 Thread Vineeth Dasaraju

Hi,

Thank You Erick for your inputs. I tried creating batches of 1000 objects
and indexing it to solr. The performance is way better than before but I
find that number of indexed documents that is shown in the dashboard is
lesser than the number of documents that I had actually indexed through
solrj. My code is as follows:

private static String SOLR_SERVER_URL = http://localhost:8983/solr/newcore
;
private static String JSON_FILE_PATH = /home/vineeth/week1_fixed.json;
private static JSONParser parser = new JSONParser();
private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);

public static void main(String[] args) throws IOException,
SolrServerException, ParseException {
File file = new File(JSON_FILE_PATH);
Scanner scn=new Scanner(file,UTF-8);
JSONObject object;
int i = 0;
CollectionSolrInputDocument batch = new
ArrayListSolrInputDocument();
while(scn.hasNext()){
object= (JSONObject) parser.parse(scn.nextLine());
SolrInputDocument doc = indexJSON(object);
batch.add(doc);
if(i%1000==0){
System.out.println(Indexed  + (i+1) +  objects. );
solr.add(batch);
batch = new ArrayListSolrInputDocument();
}
i++;
}
solr.add(batch);
solr.commit();
System.out.println(Indexed  + (i+1) +  objects. );
}

public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
ParseException, IOException, SolrServerException {
CollectionSolrInputDocument batch = new
ArrayListSolrInputDocument();

SolrInputDocument mainEvent = new SolrInputDocument();
mainEvent.addField(id, generateID());
mainEvent.addField(RawEventMessage, jsonOBJ.get(RawEventMessage));
mainEvent.addField(EventUid, jsonOBJ.get(EventUid));
mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector));
mainEvent.addField(EventMessageType, jsonOBJ.get(EventMessageType));
mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent));
mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC));

Object obj = parser.parse(jsonOBJ.get(User).toString());
JSONObject userObj = (JSONObject) obj;

SolrInputDocument childUserEvent = new SolrInputDocument();
childUserEvent.addField(id, generateID());
childUserEvent.addField(User, userObj.get(User));

obj = parser.parse(jsonOBJ.get(EventDescription).toString());
JSONObject eventdescriptionObj = (JSONObject) obj;

SolrInputDocument childEventDescEvent = new SolrInputDocument();
childEventDescEvent.addField(id, generateID());
childEventDescEvent.addField(EventApplicationName,
eventdescriptionObj.get(EventApplicationName));
childEventDescEvent.addField(Query, eventdescriptionObj.get(Query));

obj= JSONValue.parse(eventdescriptionObj.get(Information).toString());
JSONArray informationArray = (JSONArray) obj;

for(int i = 0; iinformationArray.size(); i++){
JSONObject domain = (JSONObject) informationArray.get(i);

SolrInputDocument domainDoc = new SolrInputDocument();
domainDoc.addField(id, generateID());
domainDoc.addField(domainName, domain.get(domainName));

String s = domain.get(columns).toString();
obj= JSONValue.parse(s);
JSONArray ColumnsArray = (JSONArray) obj;

SolrInputDocument columnsDoc = new SolrInputDocument();
columnsDoc.addField(id, generateID());

for(int j = 0; jColumnsArray.size(); j++){
JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
SolrInputDocument columnDoc = new SolrInputDocument();
columnDoc.addField(id, generateID());
columnDoc.addField(movieName, ColumnsObj.get(movieName));
columnsDoc.addChildDocument(columnDoc);
}
domainDoc.addChildDocument(columnsDoc);
childEventDescEvent.addChildDocument(domainDoc);
}

mainEvent.addChildDocument(childEventDescEvent);
mainEvent.addChildDocument(childUserEvent);
return mainEvent;
}

I would be grateful if you could let me know what I am missing.

On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson erickerick...@gmail.com
wrote:

 First thing is it looks like you're only sending one document at a
 time, perhaps with child objects. This is not optimal at all. I
 usually batch my docs up in groups of 1,000, and there is anecdotal
 evidence that there may (depending on the docs) be some gains above
 that number. Gotta balance the batch size off against how bug the docs
 are of course.

 Assuming that you really are calling this method for one doc (and
 children) at a time, the far bigger problem other than calling
 server.add for each parent/children is that you're then calling
 solr.commit() every time. This is an anti-pattern. Generally, let the
 autoCommit setting in solrconfig.xml handle the intermediate commits
 while the indexing program is running and only issue a commit at the

IntelliJ setup

I followed the instructions here
https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including `ant
idea`, but I'm still not getting the links in solr classes and methods; do
I need to add libraries, or am I missing something else?

Thanks!

Re: Parsing and indexing parts of the input file paths

Keeping to the user list (the right place for this question).

More information is needed here - how are you getting these documents
into Solr? Are you posting them to /update/extract? Or using DIH, or?

Upayavira

On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
 Dear user and dev lists,
 
 We are loading files from a directory and would like to index a portion
 of
 each file path as a field as well as the text inside the file.
 
 E.g., on HDFS we have this file path:
 
 /user/andrew/1234/1234/file.pdf
 
 And we would like the 1234 token parsed from the file path and indexed
 as
 an additional field that can be searched on.
 
 From my initial searches I can't see how to do this easily, so would I
 need
 to write some custom code, or a plugin?
 
 Thanks!

Re: Parsing and indexing parts of the input file paths

I'm not sure, it's a remote team but will get more info.  For now, assuming
that a certain directory is specified, like /user/andrew/, and a regex is
applied to capture anything two directories below matching */*/*.pdf.

Would there be a way to capture the wild-carded values and index them as
fields?

On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote:

 Keeping to the user list (the right place for this question).

 More information is needed here - how are you getting these documents
 into Solr? Are you posting them to /update/extract? Or using DIH, or?

 Upayavira

 On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
  Dear user and dev lists,
 
  We are loading files from a directory and would like to index a portion
  of
  each file path as a field as well as the text inside the file.
 
  E.g., on HDFS we have this file path:
 
  /user/andrew/1234/1234/file.pdf
 
  And we would like the 1234 token parsed from the file path and indexed
  as
  an additional field that can be searched on.
 
  From my initial searches I can't see how to do this easily, so would I
  need
  to write some custom code, or a plugin?
 
  Thanks!

Re: Parsing and indexing parts of the input file paths

Which can only happen if I post it to a web service, and won't happen if I
do it through config?

On Tue, Jul 21, 2015 at 2:19 PM, Upayavira u...@odoko.co.uk wrote:

 yes, unless it has been added consciously as a separate field.

 On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote:
  Thanks, so by the time we would get to an Analyzer the file path is
  forgotten?
 
  https://cwiki.apache.org/confluence/display/solr/Analyzers
 
  On Tue, Jul 21, 2015 at 1:27 PM, Upayavira u...@odoko.co.uk wrote:
 
   Solr generally does not interact with the file system in that way (with
   the exception of the DIH).
  
   It is the job of the code that pushes a file to Solr to process the
   filename and send that along with the request.
  
   See here for more info:
  
  
 https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
  
   You could provide literal.filename=blah/blah
  
   Upayavira
  
  
   On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
I'm not sure, it's a remote team but will get more info.  For now,
assuming
that a certain directory is specified, like /user/andrew/, and a
 regex
is
applied to capture anything two directories below matching
 */*/*.pdf.
   
Would there be a way to capture the wild-carded values and index
 them as
fields?
   
On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote:
   
 Keeping to the user list (the right place for this question).

 More information is needed here - how are you getting these
 documents
 into Solr? Are you posting them to /update/extract? Or using DIH,
 or?

 Upayavira

 On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
  Dear user and dev lists,
 
  We are loading files from a directory and would like to index a
   portion
  of
  each file path as a field as well as the text inside the file.
 
  E.g., on HDFS we have this file path:
 
  /user/andrew/1234/1234/file.pdf
 
  And we would like the 1234 token parsed from the file path and
   indexed
  as
  an additional field that can be searched on.
 
  From my initial searches I can't see how to do this easily, so
 would
   I
  need
  to write some custom code, or a plugin?
 
  Thanks!

Re: Parsing and indexing parts of the input file paths

Solr generally does not interact with the file system in that way (with
the exception of the DIH).

It is the job of the code that pushes a file to Solr to process the
filename and send that along with the request.

See here for more info:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

You could provide literal.filename=blah/blah

Upayavira

On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
I'm not sure, it's a remote team but will get more info. For now,
assuming
that a certain directory is specified, like /user/andrew/, and a regex
is
applied to capture anything two directories below matching */*/*.pdf.

Would there be a way to capture the wild-carded values and index them as
fields?

On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote:

Keeping to the user list (the right place for this question).

More information is needed here - how are you getting these documents
into Solr? Are you posting them to /update/extract? Or using DIH, or?

Upayavira

On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
Dear user and dev lists,

We are loading files from a directory and would like to index a portion
of
each file path as a field as well as the text inside the file.

E.g., on HDFS we have this file path:

/user/andrew/1234/1234/file.pdf

And we would like the 1234 token parsed from the file path and indexed
as
an additional field that can be searched on.

From my initial searches I can't see how to do this easily, so would I
need
to write some custom code, or a plugin?

Thanks!

Re: Tips for faster indexing

Are you making sure that every document has a unique ID? Index into an
empty Solr, then look at your maxdocs vs numdocs. If they are different
(maxdocs is higher) then some of your documents have been deleted,
meaning some were overwritten.

That might be a place to look.

Upayavira

On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote:
 I can confirm this behavior, seen when sending json docs in batch, never
 happens when sending one by one, but sporadic when sending batches.
 
 Like if sole/jetty drops couple of documents out of the batch.
 
 Regards
 
  On 21 Jul 2015, at 21:38, Vineeth Dasaraju vineeth.ii...@gmail.com wrote:
  
  Hi,
  
  Thank You Erick for your inputs. I tried creating batches of 1000 objects
  and indexing it to solr. The performance is way better than before but I
  find that number of indexed documents that is shown in the dashboard is
  lesser than the number of documents that I had actually indexed through
  solrj. My code is as follows:
  
  private static String SOLR_SERVER_URL = http://localhost:8983/solr/newcore
  ;
  private static String JSON_FILE_PATH = /home/vineeth/week1_fixed.json;
  private static JSONParser parser = new JSONParser();
  private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
  
  public static void main(String[] args) throws IOException,
  SolrServerException, ParseException {
 File file = new File(JSON_FILE_PATH);
 Scanner scn=new Scanner(file,UTF-8);
 JSONObject object;
 int i = 0;
 CollectionSolrInputDocument batch = new
  ArrayListSolrInputDocument();
 while(scn.hasNext()){
 object= (JSONObject) parser.parse(scn.nextLine());
 SolrInputDocument doc = indexJSON(object);
 batch.add(doc);
 if(i%1000==0){
 System.out.println(Indexed  + (i+1) +  objects. );
 solr.add(batch);
 batch = new ArrayListSolrInputDocument();
 }
 i++;
 }
 solr.add(batch);
 solr.commit();
 System.out.println(Indexed  + (i+1) +  objects. );
  }
  
  public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
  ParseException, IOException, SolrServerException {
 CollectionSolrInputDocument batch = new
  ArrayListSolrInputDocument();
  
 SolrInputDocument mainEvent = new SolrInputDocument();
 mainEvent.addField(id, generateID());
 mainEvent.addField(RawEventMessage, jsonOBJ.get(RawEventMessage));
 mainEvent.addField(EventUid, jsonOBJ.get(EventUid));
 mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector));
 mainEvent.addField(EventMessageType, jsonOBJ.get(EventMessageType));
 mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent));
 mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC));
  
 Object obj = parser.parse(jsonOBJ.get(User).toString());
 JSONObject userObj = (JSONObject) obj;
  
 SolrInputDocument childUserEvent = new SolrInputDocument();
 childUserEvent.addField(id, generateID());
 childUserEvent.addField(User, userObj.get(User));
  
 obj = parser.parse(jsonOBJ.get(EventDescription).toString());
 JSONObject eventdescriptionObj = (JSONObject) obj;
  
 SolrInputDocument childEventDescEvent = new SolrInputDocument();
 childEventDescEvent.addField(id, generateID());
 childEventDescEvent.addField(EventApplicationName,
  eventdescriptionObj.get(EventApplicationName));
 childEventDescEvent.addField(Query, eventdescriptionObj.get(Query));
  
 obj= JSONValue.parse(eventdescriptionObj.get(Information).toString());
 JSONArray informationArray = (JSONArray) obj;
  
 for(int i = 0; iinformationArray.size(); i++){
 JSONObject domain = (JSONObject) informationArray.get(i);
  
 SolrInputDocument domainDoc = new SolrInputDocument();
 domainDoc.addField(id, generateID());
 domainDoc.addField(domainName, domain.get(domainName));
  
 String s = domain.get(columns).toString();
 obj= JSONValue.parse(s);
 JSONArray ColumnsArray = (JSONArray) obj;
  
 SolrInputDocument columnsDoc = new SolrInputDocument();
 columnsDoc.addField(id, generateID());
  
 for(int j = 0; jColumnsArray.size(); j++){
 JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
 SolrInputDocument columnDoc = new SolrInputDocument();
 columnDoc.addField(id, generateID());
 columnDoc.addField(movieName, ColumnsObj.get(movieName));
 columnsDoc.addChildDocument(columnDoc);
 }
 domainDoc.addChildDocument(columnsDoc);
 childEventDescEvent.addChildDocument(domainDoc);
 }
  
 mainEvent.addChildDocument(childEventDescEvent);
 mainEvent.addChildDocument(childUserEvent);
 return mainEvent;
  }
  
  I would be grateful if you could let me know what I am missing.
  
  On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson

Re: Parsing and indexing parts of the input file paths

Thanks, so by the time we would get to an Analyzer the file path is
forgotten?

https://cwiki.apache.org/confluence/display/solr/Analyzers

On Tue, Jul 21, 2015 at 1:27 PM, Upayavira u...@odoko.co.uk wrote:

 Solr generally does not interact with the file system in that way (with
 the exception of the DIH).

 It is the job of the code that pushes a file to Solr to process the
 filename and send that along with the request.

 See here for more info:

 https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

 You could provide literal.filename=blah/blah

 Upayavira


 On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
  I'm not sure, it's a remote team but will get more info.  For now,
  assuming
  that a certain directory is specified, like /user/andrew/, and a regex
  is
  applied to capture anything two directories below matching */*/*.pdf.
 
  Would there be a way to capture the wild-carded values and index them as
  fields?
 
  On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote:
 
   Keeping to the user list (the right place for this question).
  
   More information is needed here - how are you getting these documents
   into Solr? Are you posting them to /update/extract? Or using DIH, or?
  
   Upayavira
  
   On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
Dear user and dev lists,
   
We are loading files from a directory and would like to index a
 portion
of
each file path as a field as well as the text inside the file.
   
E.g., on HDFS we have this file path:
   
/user/andrew/1234/1234/file.pdf
   
And we would like the 1234 token parsed from the file path and
 indexed
as
an additional field that can be searched on.
   
From my initial searches I can't see how to do this easily, so would
 I
need
to write some custom code, or a plugin?
   
Thanks!

Re: Tips for faster indexing

2015-07-21 Thread solr . user . 1507

I can confirm this behavior, seen when sending json docs in batch, never 
happens when sending one by one, but sporadic when sending batches.

Like if sole/jetty drops couple of documents out of the batch.

Regards

 On 21 Jul 2015, at 21:38, Vineeth Dasaraju vineeth.ii...@gmail.com wrote:
 
 Hi,
 
 Thank You Erick for your inputs. I tried creating batches of 1000 objects
 and indexing it to solr. The performance is way better than before but I
 find that number of indexed documents that is shown in the dashboard is
 lesser than the number of documents that I had actually indexed through
 solrj. My code is as follows:
 
 private static String SOLR_SERVER_URL = http://localhost:8983/solr/newcore
 ;
 private static String JSON_FILE_PATH = /home/vineeth/week1_fixed.json;
 private static JSONParser parser = new JSONParser();
 private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
 
 public static void main(String[] args) throws IOException,
 SolrServerException, ParseException {
File file = new File(JSON_FILE_PATH);
Scanner scn=new Scanner(file,UTF-8);
JSONObject object;
int i = 0;
CollectionSolrInputDocument batch = new
 ArrayListSolrInputDocument();
while(scn.hasNext()){
object= (JSONObject) parser.parse(scn.nextLine());
SolrInputDocument doc = indexJSON(object);
batch.add(doc);
if(i%1000==0){
System.out.println(Indexed  + (i+1) +  objects. );
solr.add(batch);
batch = new ArrayListSolrInputDocument();
}
i++;
}
solr.add(batch);
solr.commit();
System.out.println(Indexed  + (i+1) +  objects. );
 }
 
 public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
 ParseException, IOException, SolrServerException {
CollectionSolrInputDocument batch = new
 ArrayListSolrInputDocument();
 
SolrInputDocument mainEvent = new SolrInputDocument();
mainEvent.addField(id, generateID());
mainEvent.addField(RawEventMessage, jsonOBJ.get(RawEventMessage));
mainEvent.addField(EventUid, jsonOBJ.get(EventUid));
mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector));
mainEvent.addField(EventMessageType, jsonOBJ.get(EventMessageType));
mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent));
mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC));
 
Object obj = parser.parse(jsonOBJ.get(User).toString());
JSONObject userObj = (JSONObject) obj;
 
SolrInputDocument childUserEvent = new SolrInputDocument();
childUserEvent.addField(id, generateID());
childUserEvent.addField(User, userObj.get(User));
 
obj = parser.parse(jsonOBJ.get(EventDescription).toString());
JSONObject eventdescriptionObj = (JSONObject) obj;
 
SolrInputDocument childEventDescEvent = new SolrInputDocument();
childEventDescEvent.addField(id, generateID());
childEventDescEvent.addField(EventApplicationName,
 eventdescriptionObj.get(EventApplicationName));
childEventDescEvent.addField(Query, eventdescriptionObj.get(Query));
 
obj= JSONValue.parse(eventdescriptionObj.get(Information).toString());
JSONArray informationArray = (JSONArray) obj;
 
for(int i = 0; iinformationArray.size(); i++){
JSONObject domain = (JSONObject) informationArray.get(i);
 
SolrInputDocument domainDoc = new SolrInputDocument();
domainDoc.addField(id, generateID());
domainDoc.addField(domainName, domain.get(domainName));
 
String s = domain.get(columns).toString();
obj= JSONValue.parse(s);
JSONArray ColumnsArray = (JSONArray) obj;
 
SolrInputDocument columnsDoc = new SolrInputDocument();
columnsDoc.addField(id, generateID());
 
for(int j = 0; jColumnsArray.size(); j++){
JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
SolrInputDocument columnDoc = new SolrInputDocument();
columnDoc.addField(id, generateID());
columnDoc.addField(movieName, ColumnsObj.get(movieName));
columnsDoc.addChildDocument(columnDoc);
}
domainDoc.addChildDocument(columnsDoc);
childEventDescEvent.addChildDocument(domainDoc);
}
 
mainEvent.addChildDocument(childEventDescEvent);
mainEvent.addChildDocument(childUserEvent);
return mainEvent;
 }
 
 I would be grateful if you could let me know what I am missing.
 
 On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
 First thing is it looks like you're only sending one document at a
 time, perhaps with child objects. This is not optimal at all. I
 usually batch my docs up in groups of 1,000, and there is anecdotal
 evidence that there may (depending on the docs) be some gains above
 that number. Gotta balance the batch size off against how bug the docs
 are of course.
 
 Assuming that you really are calling this method for one doc (and

Re: Issue with using createNodeSet in Solr Cloud

2015-07-21 Thread Savvas Andreas Moysidis

Ah, nice tip, thanks! This could also make scripts more portable too.

Cheers,
Savvas

On 21 July 2015 at 08:40, Upayavira u...@odoko.co.uk wrote:

Note, when you start up the instances, you can pass in a hostname to use
instead of the IP address. If you are using bin/solr (which you should
be!!) then you can use bin/solr -h my-host-name and that'll be used in
place of the IP.

Upayavira

On Tue, Jul 21, 2015, at 05:45 AM, Erick Erickson wrote:
Glad you found a solution

Best,
Erick

On Mon, Jul 20, 2015 at 3:21 AM, Savvas Andreas Moysidis
savvas.andreas.moysi...@gmail.com wrote:
Erick, spot on!

The nodes had been registered in zookeeper under my network
interface's IP
address...after specifying those the command worked just fine.

It was indeed the thing I thought was true that wasn't... :)

Many thanks,
Savvas

On 18 July 2015 at 20:47, Erick Erickson erickerick...@gmail.com
wrote:

P.S.

It ain't the things ya don't know that'll kill ya, it's the things ya
_do_ know that ain't so...

On Sat, Jul 18, 2015 at 12:46 PM, Erick Erickson
erickerick...@gmail.com wrote:
Could you post your clusterstate.json? Or at least the live nodes
section of your ZK config? (adminUIcloudtreelive_nodes. The
addresses of my nodes are things like 192.168.1.201:8983_solr. I'm
wondering if you're taking your node names from the information ZK
records or assuming it's 127.0.0.1

On Sat, Jul 18, 2015 at 8:56 AM, Savvas Andreas Moysidis
savvas.andreas.moysi...@gmail.com wrote:
Thanks Eric,

The strange thing is that although I have set the log level to
ALL I
see
no error messages in the logs (apart from the line saying that the
response
is a 400 one).

I'm quite confident the configset does exist as the collection gets
created
fine if I don't specify the createNodeSet param.

Complete mystery..! I'll keep on troubleshooting and report back
with my
findings.

Cheers,
Savvas

On 17 July 2015 at 02:14, Erick Erickson erickerick...@gmail.com
wrote:

There were a couple of cases where the no live servers was being
returned when the error was something completely different. Does
the
Solr log show something more useful? And are you sure you have a
configset named collection_A?

'cause this works (admittedly on 5.x) fine for me, and I'm quite
sure
there are bunches of automated tests that would be failing so I
suspect it's just a misleading error being returned.

Best,
Erick

On Thu, Jul 16, 2015 at 2:22 AM, Savvas Andreas Moysidis
savvas.andreas.moysi...@gmail.com wrote:
Hello There,

I am trying to use the createNodeSet parameter when creating a
new
collection but I'm getting an error when doing so.

More specifically, I have four Solr instances running locally in
separate
JVMs (127.0.0.1:8983, 127.0.0.1:8984, 127.0.0.1:8985,
127.0.0.1:8986
)
and a
standalone Zookeeper instance which all Solr instances point
to. The
four
Solr instances have no collections added to them and are all up
and
running
(I can access the admin page in all of them).

Now, I want to create a collections in only two of these four
instances (
127.0.0.1:8983, 127.0.0.1:8984) but when I hit one instance
with the
following URL:

http://localhost:8983/solr/admin/collections?action=CREATEname=collection_AnumShards=1replicationFactor=2maxShardsPerNode=1createNodeSet=127.0.0.1:8983_solr,127.0.0.1:8984_solrcollection.configName=collection_A

I am getting the following response:

response
lst name=responseHeader
int name=status400/int
int name=QTime3503/int
/lst
str name=Operation createcollection caused exception:

org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Cannot create collection collection_A. No live Solr-instances
among
Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
127.0.0.1:8984
_solr
/str
lst name=exception
str name=msg
Cannot create collection collection_A. No live Solr-instances
among
Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
127.0.0.1:8984
_solr
/str
int name=rspCode400/int
/lst
lst name=error
str name=msg
Cannot create collection collection_A. No live Solr-instances
among
Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
127.0.0.1:8984
_solr
/str
int name=code400/int
/lst
/response

The instances are definitely up and running (at least the admin
console
can
be accessed as mentioned) and if I remove the createNodeSet
parameter the
collection is created as expected.

Am I missing something obvious or is this a bug?

Re: Parsing and indexing parts of the input file paths

yes, unless it has been added consciously as a separate field.

On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote:
 Thanks, so by the time we would get to an Analyzer the file path is
 forgotten?
 
 https://cwiki.apache.org/confluence/display/solr/Analyzers
 
 On Tue, Jul 21, 2015 at 1:27 PM, Upayavira u...@odoko.co.uk wrote:
 
  Solr generally does not interact with the file system in that way (with
  the exception of the DIH).
 
  It is the job of the code that pushes a file to Solr to process the
  filename and send that along with the request.
 
  See here for more info:
 
  https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
 
  You could provide literal.filename=blah/blah
 
  Upayavira
 
 
  On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
   I'm not sure, it's a remote team but will get more info.  For now,
   assuming
   that a certain directory is specified, like /user/andrew/, and a regex
   is
   applied to capture anything two directories below matching */*/*.pdf.
  
   Would there be a way to capture the wild-carded values and index them as
   fields?
  
   On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote:
  
Keeping to the user list (the right place for this question).
   
More information is needed here - how are you getting these documents
into Solr? Are you posting them to /update/extract? Or using DIH, or?
   
Upayavira
   
On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
 Dear user and dev lists,

 We are loading files from a directory and would like to index a
  portion
 of
 each file path as a field as well as the text inside the file.

 E.g., on HDFS we have this file path:

 /user/andrew/1234/1234/file.pdf

 And we would like the 1234 token parsed from the file path and
  indexed
 as
 an additional field that can be searched on.

 From my initial searches I can't see how to do this easily, so would
  I
 need
 to write some custom code, or a plugin?

 Thanks!

Re: Tips for faster indexing

2015-07-21 Thread Vineeth Dasaraju

Hi Upayavira,

I guess that is the problem. I am currently using a function for generating
an ID. It takes the current date and time to milliseconds and generates the
id. This is the function.

public static String generateID(){
Date dNow = new Date();
SimpleDateFormat ft = new SimpleDateFormat(yyMMddhhmmssMs);
String datetime = ft.format(dNow);
return datetime;
}


I believe that despite having a millisecond precision in the id generation,
multiple objects are being assigned the same ID. Can you suggest a better
way to generate the ID?

Regards,
Vineeth


On Tue, Jul 21, 2015 at 1:29 PM, Upayavira u...@odoko.co.uk wrote:

 Are you making sure that every document has a unique ID? Index into an
 empty Solr, then look at your maxdocs vs numdocs. If they are different
 (maxdocs is higher) then some of your documents have been deleted,
 meaning some were overwritten.

 That might be a place to look.

 Upayavira

 On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote:
  I can confirm this behavior, seen when sending json docs in batch, never
  happens when sending one by one, but sporadic when sending batches.
 
  Like if sole/jetty drops couple of documents out of the batch.
 
  Regards
 
   On 21 Jul 2015, at 21:38, Vineeth Dasaraju vineeth.ii...@gmail.com
 wrote:
  
   Hi,
  
   Thank You Erick for your inputs. I tried creating batches of 1000
 objects
   and indexing it to solr. The performance is way better than before but
 I
   find that number of indexed documents that is shown in the dashboard is
   lesser than the number of documents that I had actually indexed through
   solrj. My code is as follows:
  
   private static String SOLR_SERVER_URL = 
 http://localhost:8983/solr/newcore
   ;
   private static String JSON_FILE_PATH =
 /home/vineeth/week1_fixed.json;
   private static JSONParser parser = new JSONParser();
   private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
  
   public static void main(String[] args) throws IOException,
   SolrServerException, ParseException {
  File file = new File(JSON_FILE_PATH);
  Scanner scn=new Scanner(file,UTF-8);
  JSONObject object;
  int i = 0;
  CollectionSolrInputDocument batch = new
   ArrayListSolrInputDocument();
  while(scn.hasNext()){
  object= (JSONObject) parser.parse(scn.nextLine());
  SolrInputDocument doc = indexJSON(object);
  batch.add(doc);
  if(i%1000==0){
  System.out.println(Indexed  + (i+1) +  objects. );
  solr.add(batch);
  batch = new ArrayListSolrInputDocument();
  }
  i++;
  }
  solr.add(batch);
  solr.commit();
  System.out.println(Indexed  + (i+1) +  objects. );
   }
  
   public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
   ParseException, IOException, SolrServerException {
  CollectionSolrInputDocument batch = new
   ArrayListSolrInputDocument();
  
  SolrInputDocument mainEvent = new SolrInputDocument();
  mainEvent.addField(id, generateID());
  mainEvent.addField(RawEventMessage,
 jsonOBJ.get(RawEventMessage));
  mainEvent.addField(EventUid, jsonOBJ.get(EventUid));
  mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector));
  mainEvent.addField(EventMessageType,
 jsonOBJ.get(EventMessageType));
  mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent));
  mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC));
  
  Object obj = parser.parse(jsonOBJ.get(User).toString());
  JSONObject userObj = (JSONObject) obj;
  
  SolrInputDocument childUserEvent = new SolrInputDocument();
  childUserEvent.addField(id, generateID());
  childUserEvent.addField(User, userObj.get(User));
  
  obj = parser.parse(jsonOBJ.get(EventDescription).toString());
  JSONObject eventdescriptionObj = (JSONObject) obj;
  
  SolrInputDocument childEventDescEvent = new SolrInputDocument();
  childEventDescEvent.addField(id, generateID());
  childEventDescEvent.addField(EventApplicationName,
   eventdescriptionObj.get(EventApplicationName));
  childEventDescEvent.addField(Query,
 eventdescriptionObj.get(Query));
  
  obj=
 JSONValue.parse(eventdescriptionObj.get(Information).toString());
  JSONArray informationArray = (JSONArray) obj;
  
  for(int i = 0; iinformationArray.size(); i++){
  JSONObject domain = (JSONObject) informationArray.get(i);
  
  SolrInputDocument domainDoc = new SolrInputDocument();
  domainDoc.addField(id, generateID());
  domainDoc.addField(domainName, domain.get(domainName));
  
  String s = domain.get(columns).toString();
  obj= JSONValue.parse(s);
  JSONArray ColumnsArray = (JSONArray) obj;
  
  SolrInputDocument columnsDoc = new SolrInputDocument();
  columnsDoc.addField(id,

Re: Tips for faster indexing

2015-07-21 Thread Fadi Mohsen

In Java: UUID.randomUUID();

That is what I'm using.

Regards

 On 21 Jul 2015, at 22:38, Vineeth Dasaraju vineeth.ii...@gmail.com wrote:
 
 Hi Upayavira,
 
 I guess that is the problem. I am currently using a function for generating
 an ID. It takes the current date and time to milliseconds and generates the
 id. This is the function.
 
 public static String generateID(){
Date dNow = new Date();
SimpleDateFormat ft = new SimpleDateFormat(yyMMddhhmmssMs);
String datetime = ft.format(dNow);
return datetime;
}
 
 
 I believe that despite having a millisecond precision in the id generation,
 multiple objects are being assigned the same ID. Can you suggest a better
 way to generate the ID?
 
 Regards,
 Vineeth
 
 
 On Tue, Jul 21, 2015 at 1:29 PM, Upayavira u...@odoko.co.uk wrote:
 
 Are you making sure that every document has a unique ID? Index into an
 empty Solr, then look at your maxdocs vs numdocs. If they are different
 (maxdocs is higher) then some of your documents have been deleted,
 meaning some were overwritten.
 
 That might be a place to look.
 
 Upayavira
 
 On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote:
 I can confirm this behavior, seen when sending json docs in batch, never
 happens when sending one by one, but sporadic when sending batches.
 
 Like if sole/jetty drops couple of documents out of the batch.
 
 Regards
 
 On 21 Jul 2015, at 21:38, Vineeth Dasaraju vineeth.ii...@gmail.com
 wrote:
 
 Hi,
 
 Thank You Erick for your inputs. I tried creating batches of 1000
 objects
 and indexing it to solr. The performance is way better than before but
 I
 find that number of indexed documents that is shown in the dashboard is
 lesser than the number of documents that I had actually indexed through
 solrj. My code is as follows:
 
 private static String SOLR_SERVER_URL = 
 http://localhost:8983/solr/newcore
 ;
 private static String JSON_FILE_PATH =
 /home/vineeth/week1_fixed.json;
 private static JSONParser parser = new JSONParser();
 private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
 
 public static void main(String[] args) throws IOException,
 SolrServerException, ParseException {
   File file = new File(JSON_FILE_PATH);
   Scanner scn=new Scanner(file,UTF-8);
   JSONObject object;
   int i = 0;
   CollectionSolrInputDocument batch = new
 ArrayListSolrInputDocument();
   while(scn.hasNext()){
   object= (JSONObject) parser.parse(scn.nextLine());
   SolrInputDocument doc = indexJSON(object);
   batch.add(doc);
   if(i%1000==0){
   System.out.println(Indexed  + (i+1) +  objects. );
   solr.add(batch);
   batch = new ArrayListSolrInputDocument();
   }
   i++;
   }
   solr.add(batch);
   solr.commit();
   System.out.println(Indexed  + (i+1) +  objects. );
 }
 
 public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
 ParseException, IOException, SolrServerException {
   CollectionSolrInputDocument batch = new
 ArrayListSolrInputDocument();
 
   SolrInputDocument mainEvent = new SolrInputDocument();
   mainEvent.addField(id, generateID());
   mainEvent.addField(RawEventMessage,
 jsonOBJ.get(RawEventMessage));
   mainEvent.addField(EventUid, jsonOBJ.get(EventUid));
   mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector));
   mainEvent.addField(EventMessageType,
 jsonOBJ.get(EventMessageType));
   mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent));
   mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC));
 
   Object obj = parser.parse(jsonOBJ.get(User).toString());
   JSONObject userObj = (JSONObject) obj;
 
   SolrInputDocument childUserEvent = new SolrInputDocument();
   childUserEvent.addField(id, generateID());
   childUserEvent.addField(User, userObj.get(User));
 
   obj = parser.parse(jsonOBJ.get(EventDescription).toString());
   JSONObject eventdescriptionObj = (JSONObject) obj;
 
   SolrInputDocument childEventDescEvent = new SolrInputDocument();
   childEventDescEvent.addField(id, generateID());
   childEventDescEvent.addField(EventApplicationName,
 eventdescriptionObj.get(EventApplicationName));
   childEventDescEvent.addField(Query,
 eventdescriptionObj.get(Query));
 
   obj=
 JSONValue.parse(eventdescriptionObj.get(Information).toString());
   JSONArray informationArray = (JSONArray) obj;
 
   for(int i = 0; iinformationArray.size(); i++){
   JSONObject domain = (JSONObject) informationArray.get(i);
 
   SolrInputDocument domainDoc = new SolrInputDocument();
   domainDoc.addField(id, generateID());
   domainDoc.addField(domainName, domain.get(domainName));
 
   String s = domain.get(columns).toString();
   obj= JSONValue.parse(s);
   JSONArray ColumnsArray = (JSONArray) obj;
 
   SolrInputDocument columnsDoc = new SolrInputDocument();
   columnsDoc.addField(id, generateID());
 
   for(int j = 0;

Re: IntelliJ setup

2015-07-21 Thread Konstantin Gribov

Try invalidate caches and restart in IDEA, remove .idea directory in
lucene-solr dir. After that run ant idea and re-open project.

Also, you have to, at least, close project, run ant idea and re-open it
if switching between too diverged branches (e.g., 4.10 and 5_x).

вт, 21 июля 2015 г. в 21:53, Andrew Musselman andrew.mussel...@gmail.com:

 I followed the instructions here
 https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including `ant
 idea`, but I'm still not getting the links in solr classes and methods; do
 I need to add libraries, or am I missing something else?

 Thanks!

-- 
Best regards,
Konstantin Gribov

Re: IntelliJ setup

Bingo, thanks!

On Tue, Jul 21, 2015 at 4:12 PM, Konstantin Gribov gros...@gmail.com
wrote:

 Try invalidate caches and restart in IDEA, remove .idea directory in
 lucene-solr dir. After that run ant idea and re-open project.

 Also, you have to, at least, close project, run ant idea and re-open it
 if switching between too diverged branches (e.g., 4.10 and 5_x).

 вт, 21 июля 2015 г. в 21:53, Andrew Musselman andrew.mussel...@gmail.com
 :

  I followed the instructions here
  https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including
 `ant
  idea`, but I'm still not getting the links in solr classes and methods;
 do
  I need to add libraries, or am I missing something else?
 
  Thanks!
 
 --
 Best regards,
 Konstantin Gribov

Re: WordDelimiterFilter Leading Trailing Special Character

2015-07-21 Thread Sathiya N Sundararajan

Upayavira,

thanks for the helpful suggestion, that works. I was looking for an option
to turn off/circumvent that particular WordDelimiterFilter's behavior
completely. Since our indexes are hundred's of Terabytes, every time we
find a term that needs to be added, it will be a cumbersome process to
reload all the cores.


thanks

On Tue, Jul 21, 2015 at 12:57 AM, Upayavira u...@odoko.co.uk wrote:

 Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
 this config:

  fieldType name=text_wd class=solr.TextField
  positionIncrementGap=100
analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.WordDelimiterFilterFactory
  protected=protectedword.txt
  preserveOriginal=0 splitOnNumerics=1
  splitOnCaseChange=1
  catenateWords=0 catenateNumbers=0 catenateAll=0
  generateWordParts=1 generateNumberParts=1
  stemEnglishPossessive=1
  types=wdfftypes.txt /
/analyzer
  /fieldType

 Note the protected=x attribute. I suspect if you put Yahoo! into a
 file referenced by that attribute, it may survive analysis. I'd be
 curious to hear whether it works.

 Upayavira

 On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
  Question about WordDelimiterFilter. The search behavior that we
  experience
  with WordDelimiterFilter satisfies well, except for the case where there
  is
  a special character either at the leading or trailing end of the term.
 
  For instance:
 
  *‘db’ *  —  Works as expected. Finds all docs with ‘db’.
  *‘p!nk’*  —  Works fine as above.
 
  But on cases when, there is a special character towards the trailing end
  of
  the term, like ‘Yahoo!’
 
  *‘yahoo!’* — Turns out to be a search for just *‘yahoo’* with the
  special
  character *‘!’* stripped out.  This WordDelimiterFilter behavior is
  documented
 
 http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
 
  What I would like to have is, the search performed without stripping out
  the leading  trailing special character. Is there a way to achieve this
  behavior with WordDelimiterFilter.
 
  This is current config that we have for the field:
 
  fieldType name=text_wdf class=solr.TextField
  positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.WordDelimiterFilterFactory
  splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0
  catenateWords=0 catenateNumbers=0 catenateAll=0
  preserveOriginal=1
  types=specialchartypes.txt/
  filter class=solr.LowerCaseFilterFactory /
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.WordDelimiterFilterFactory
  splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0
  catenateWords=0 catenateNumbers=0 catenateAll=0
  preserveOriginal=1
  types=specialchartypes.txt/
  filter class=solr.LowerCaseFilterFactory /
  /analyzer
  /fieldType
 
 
  thanks

Re: WordDelimiterFilter Leading Trailing Special Character

2015-07-21 Thread Jack Krupansky

You can also use the types attribute to change the type of specific
characters, such as to treat the ! or  as an ALPHA.

-- Jack Krupansky

On Tue, Jul 21, 2015 at 7:43 PM, Sathiya N Sundararajan ausat...@gmail.com
wrote:

 Upayavira,

 thanks for the helpful suggestion, that works. I was looking for an option
 to turn off/circumvent that particular WordDelimiterFilter's behavior
 completely. Since our indexes are hundred's of Terabytes, every time we
 find a term that needs to be added, it will be a cumbersome process to
 reload all the cores.


 thanks

 On Tue, Jul 21, 2015 at 12:57 AM, Upayavira u...@odoko.co.uk wrote:

  Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
  this config:
 
   fieldType name=text_wd class=solr.TextField
   positionIncrementGap=100
 analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.WordDelimiterFilterFactory
   protected=protectedword.txt
   preserveOriginal=0 splitOnNumerics=1
   splitOnCaseChange=1
   catenateWords=0 catenateNumbers=0 catenateAll=0
   generateWordParts=1 generateNumberParts=1
   stemEnglishPossessive=1
   types=wdfftypes.txt /
 /analyzer
   /fieldType
 
  Note the protected=x attribute. I suspect if you put Yahoo! into a
  file referenced by that attribute, it may survive analysis. I'd be
  curious to hear whether it works.
 
  Upayavira
 
  On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
   Question about WordDelimiterFilter. The search behavior that we
   experience
   with WordDelimiterFilter satisfies well, except for the case where
 there
   is
   a special character either at the leading or trailing end of the term.
  
   For instance:
  
   *‘db’ *  —  Works as expected. Finds all docs with ‘db’.
   *‘p!nk’*  —  Works fine as above.
  
   But on cases when, there is a special character towards the trailing
 end
   of
   the term, like ‘Yahoo!’
  
   *‘yahoo!’* — Turns out to be a search for just *‘yahoo’* with the
   special
   character *‘!’* stripped out.  This WordDelimiterFilter behavior is
   documented
  
 
 http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
  
   What I would like to have is, the search performed without stripping
 out
   the leading  trailing special character. Is there a way to achieve
 this
   behavior with WordDelimiterFilter.
  
   This is current config that we have for the field:
  
   fieldType name=text_wdf class=solr.TextField
   positionIncrementGap=100
   analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0
   catenateWords=0 catenateNumbers=0 catenateAll=0
   preserveOriginal=1
   types=specialchartypes.txt/
   filter class=solr.LowerCaseFilterFactory /
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory /
   filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0
   catenateWords=0 catenateNumbers=0 catenateAll=0
   preserveOriginal=1
   types=specialchartypes.txt/
   filter class=solr.LowerCaseFilterFactory /
   /analyzer
   /fieldType
  
  
   thanks

Re: Issue with using createNodeSet in Solr Cloud

Upayavira

On Tue, Jul 21, 2015, at 05:45 AM, Erick Erickson wrote:
Glad you found a solution

Best,
Erick

On Mon, Jul 20, 2015 at 3:21 AM, Savvas Andreas Moysidis
savvas.andreas.moysi...@gmail.com wrote:
Erick, spot on!

The nodes had been registered in zookeeper under my network interface's IP
address...after specifying those the command worked just fine.

It was indeed the thing I thought was true that wasn't... :)

Many thanks,
Savvas

On 18 July 2015 at 20:47, Erick Erickson erickerick...@gmail.com wrote:

P.S.

It ain't the things ya don't know that'll kill ya, it's the things ya
_do_ know that ain't so...

On Sat, Jul 18, 2015 at 8:56 AM, Savvas Andreas Moysidis
savvas.andreas.moysi...@gmail.com wrote:
Thanks Eric,

The strange thing is that although I have set the log level to ALL I
see
no error messages in the logs (apart from the line saying that the
response
is a 400 one).

I'm quite confident the configset does exist as the collection gets
created
fine if I don't specify the createNodeSet param.

Complete mystery..! I'll keep on troubleshooting and report back with my
findings.

Cheers,
Savvas

On 17 July 2015 at 02:14, Erick Erickson erickerick...@gmail.com
wrote:

There were a couple of cases where the no live servers was being
returned when the error was something completely different. Does the
Solr log show something more useful? And are you sure you have a
configset named collection_A?

'cause this works (admittedly on 5.x) fine for me, and I'm quite sure
there are bunches of automated tests that would be failing so I
suspect it's just a misleading error being returned.

Best,
Erick

On Thu, Jul 16, 2015 at 2:22 AM, Savvas Andreas Moysidis
savvas.andreas.moysi...@gmail.com wrote:
Hello There,

I am trying to use the createNodeSet parameter when creating a new
collection but I'm getting an error when doing so.

More specifically, I have four Solr instances running locally in
separate
JVMs (127.0.0.1:8983, 127.0.0.1:8984, 127.0.0.1:8985, 127.0.0.1:8986
)
and a
standalone Zookeeper instance which all Solr instances point to. The
four
Solr instances have no collections added to them and are all up and
running
(I can access the admin page in all of them).

Now, I want to create a collections in only two of these four
instances (
127.0.0.1:8983, 127.0.0.1:8984) but when I hit one instance with the
following URL:

I am getting the following response:

response
lst name=responseHeader
int name=status400/int
int name=QTime3503/int
/lst
str name=Operation createcollection caused exception:

org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Cannot create collection collection_A. No live Solr-instances among
Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
127.0.0.1:8984
_solr
/str
lst name=exception
str name=msg
Cannot create collection collection_A. No live Solr-instances among
Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
127.0.0.1:8984
_solr
/str
int name=rspCode400/int
/lst
lst name=error
str name=msg
Cannot create collection collection_A. No live Solr-instances among
Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
127.0.0.1:8984
_solr
/str
int name=code400/int
/lst
/response

The instances are definitely up and running (at least the admin
console
can
be accessed as mentioned) and if I remove the createNodeSet
parameter the
collection is created as expected.

Am I missing something obvious or is this a bug?

The exact Solr version I'm using is 4.9.1.

Any pointers would be much appreciated.

Thanks,
Savvas

Re: SOLR nrt read writes

Bhawna,

I think you need to reconcile yourself to the fact that what you want to
achieve is not going to be possible.

Solr (and Lucene underneath it) is HEAVILY optimised for high read/low
write situations, and that leads to some latency in content reaching the
index. If you wanted to change this, you'd have to get into some heavy
Java/Lucene coding, as I believe Twitter have done on Lucene itself.

I'd say, rather than attempting to change this, I'd say you need to work
out a way in your UI to handle this situation. E.g. have a refresh on
stale results button, or not seeing your data, try here. Or, if a
user submits data, then wants to search for it in the same session, have
your UI enforce a minimum 10s delay before it sends a request to Solr,
or something like that. Efforts to solve this at the Solr end, without
spending substantial sums and effort on it, will be futile as it isn't
what Solr/Lucene are designed for.

Upayavira

On Tue, Jul 21, 2015, at 06:02 AM, Bhawna Asnani wrote:
 Thanks, I tried turning off auto softCommits but that didn't help much.
 Still seeing stale results every now and then. Also load on the server
 very light. We are running this just on a test server with one or two
 users. I don't see any warning in logs whole doing softCommits and it
 says it successfully opened new searcher and registered it as main
 searcher. Could this be due to caching? I have tried to disable all in my
 solrconfig.
 
 Sent from my iPhone
 
  On Jul 20, 2015, at 12:16 PM, Shawn Heisey apa...@elyograg.org wrote:
  
  On 7/20/2015 9:29 AM, Bhawna Asnani wrote:
  Thanks for your suggestions. The requirement is still the same , to be
  able to make a change to some solr documents and be able to see it on
  subsequent search/facet calls.
  I am using softCommit with waitSearcher=true.
  
  Also I am sending reads/writes to a single solr node only.
  I have tried disabling caches and warmup time in logs is '0' but every
  once in a while I do get the document just updated with stale data.
  
  I went through lucene documentation and it seems opening the
  IndexReader with the IndexWriter should make the changes visible to
  the reader.
  
  I checked solr logs no errors. I see this in logs each time
  'Registered new searcher Searcher@x' even before searches that had
  the stale document. 
  
  I have attached my solrconfig.xml for reference.
  
  Your attachment made it through the mailing list processing.  Most
  don't, I'm surprised.  Some thoughts:
  
  maxBooleanClauses has been set to 40.  This is a lot.  If you
  actually need a setting that high, then you are sending some MASSIVE
  queries, which probably means that your Solr install is exceptionally
  busy running those queries.
  
  If the server is fairly busy, then you should increase maxTime on
  autoCommit.  I use a value of five minutes (30) ... and my server is
  NOT very busy most of the time.  A commit with openSearcher set to false
  is relatively fast, but it still has somewhat heavy CPU, memory, and
  disk I/O resource requirements.
  
  You have autoSoftCommit set to happen after five seconds.  If updates
  happen frequently or run for very long, this is potentially a LOT of
  committing and opening new searchers.  I guess it's better than trying
  for one second, but anything more frequent than once a minute is likely
  to get you into trouble unless the system load is extremely light ...
  but as already discussed, your system load is probably not light.
  
  For the kind of Near Real Time setup you have mentioned, where you want
  to do one or more updates, commit, and then query for the changes, you
  probably should completely remove autoSoftCommit from the config and
  *only* open new searchers with explicit soft commits.  Let autoCommit
  (with a maxTime of 1 to 5 minutes) handle durability concerns.
  
  A lot of pieces in your config file are set to depend on java system
  properties just like the example does, but since we do not know what
  system properties have been set, we can't tell for sure what those parts
  of the config are doing.
  
  Thanks,
  Shawn

Re: WordDelimiterFilter Leading Trailing Special Character

Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
this config:

 fieldType name=text_wd class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 protected=protectedword.txt
 preserveOriginal=0 splitOnNumerics=1
 splitOnCaseChange=1
 catenateWords=0 catenateNumbers=0 catenateAll=0
 generateWordParts=1 generateNumberParts=1
 stemEnglishPossessive=1
 types=wdfftypes.txt /
   /analyzer
 /fieldType

Note the protected=x attribute. I suspect if you put Yahoo! into a
file referenced by that attribute, it may survive analysis. I'd be
curious to hear whether it works.

Upayavira

On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
 Question about WordDelimiterFilter. The search behavior that we
 experience
 with WordDelimiterFilter satisfies well, except for the case where there
 is
 a special character either at the leading or trailing end of the term.
 
 For instance:
 
 *‘db’ *  —  Works as expected. Finds all docs with ‘db’.
 *‘p!nk’*  —  Works fine as above.
 
 But on cases when, there is a special character towards the trailing end
 of
 the term, like ‘Yahoo!’
 
 *‘yahoo!’* — Turns out to be a search for just *‘yahoo’* with the
 special
 character *‘!’* stripped out.  This WordDelimiterFilter behavior is
 documented
 http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
 
 What I would like to have is, the search performed without stripping out
 the leading  trailing special character. Is there a way to achieve this
 behavior with WordDelimiterFilter.
 
 This is current config that we have for the field:
 
 fieldType name=text_wdf class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory /
 filter class=solr.WordDelimiterFilterFactory
 splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0
 catenateWords=0 catenateNumbers=0 catenateAll=0
 preserveOriginal=1
 types=specialchartypes.txt/
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory /
 filter class=solr.WordDelimiterFilterFactory
 splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0
 catenateWords=0 catenateNumbers=0 catenateAll=0
 preserveOriginal=1
 types=specialchartypes.txt/
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType
 
 
 thanks

Re: Solr Cloud: Duplicate documents in multiple shards

I suspect you can delete a document from the wrong shard by using
update?distrib=false.

I also suspect there are people here who would like to help you debug
this, because it has been reported before, but we haven't yet been able
to see whether it occurred due to human or software error.

Upayavira

On Tue, Jul 21, 2015, at 05:51 AM, mesenthil1 wrote:
 Thanks Erick for clarifying ..
 We are not explicitly setting the compositeId. We are using numShards=5
 alone as part of the server start up. We are using uuid as unique field.
 
 One sample id is :
 
 possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30
 
 
 Not sure how it would have gone to multiple shards.  Do you have any
 suggestion for fixing this. Or we need to completely rebuild the index.
 When the routing key is compositeId, should we explicitly set ! with
 shard
 key? 
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218296.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr blocking and client timeout issue

2015-07-21 Thread Daniel Collins

We have a similar situation: production runs Java 7u10 (yes, we know its
old!), and has custom GC options (G1 works well for us), and a 40Gb heap.
We are a heavy user of NRT (sub-second soft-commits!), so that may be the
common factor here.

Every time we have tried a later Java 7 or Java 8, the heap blows up in no
time at all.  We are still investigating the root cause (we do need to
migrate to Java 8), but I'm thinking that very high commit rates seem to be
the common link here (and its not a common Solr use case I admit).

I don't have any silver bullet answers to offer yet, but my
suspicion/conjecture (no real evidence yet, I admit) is that the frequent
commits are leaving temporary objects around (which they are entitled to
do), and something has changed in the GC in later Java 7/8 which means they
are slower to get rid of those, hence the overall heap usage is higher
under this use case.

@Jeremy, you don't have a lot of head room, but try a higher heap size?
Could you go to 6Gb and see if that at least delays the issue?

Erick is correct though, if you can reduce the commit rate, I'm sure that
would alleviate the issue.

On 21 July 2015 at 05:31, Erick Erickson erickerick...@gmail.com wrote:

 bq: the config is set up per the NRT suggestions in the docs.
 autoSoftCommit every 2 seconds and autoCommit every 10 minutes.

 2 second soft commit is very aggressive, no matter what the NRT
 suggestions are. My first question is whether that's really needed.
 The soft commits should be as long as you can stand. And don't listen
 to  your product manager who says 2 seconds is required, push back
 and answer whether that's really necessary. Most people won't notice
 the difference.

 bq: ...we are noticing a lot higher number of hard commits than usual.

 Is a client somewhere issuing a hard commit? This is rarely
 recommended... And is openSearcher true or false? False is a
 relatively cheap operation, true is quite expensive.

 More than you want to know about hard and soft commits:


 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

 Best,
 Erick

 Best,
 Erick

 On Mon, Jul 20, 2015 at 12:48 PM, Jeremy Ashcraft jashcr...@edgate.com
 wrote:
  heap is already at 5GB
 
  On 07/20/2015 12:29 PM, Jeremy Ashcraft wrote:
 
  no swapping that I'm seeing, although we are noticing a lot higher
 number
  of hard commits than usual.
 
  the config is set up per the NRT suggestions in the docs.
 autoSoftCommit
  every 2 seconds and autoCommit every 10 minutes.
 
  there have been 463 updates in the past 2 hours, all followed by hard
  commits
 
  INFO  - 2015-07-20 12:26:20.979;
  org.apache.solr.update.DirectUpdateHandler2; start
 
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
  INFO  - 2015-07-20 12:26:21.021;
 org.apache.solr.core.SolrDeletionPolicy;
  SolrDeletionPolicy.onCommit: commits: num=2
 
  commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
 /opt/solr/solr/collection1/data/index
  lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd;
  maxCacheMB=48.0
 maxMergeSizeMB=4.0),segFN=segments_e9nk,generation=665696}
 
  commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
 /opt/solr/solr/collection1/data/index
  lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd;
  maxCacheMB=48.0
 maxMergeSizeMB=4.0),segFN=segments_e9nl,generation=665697}
  INFO  - 2015-07-20 12:26:21.022;
 org.apache.solr.core.SolrDeletionPolicy;
  newest commit generation = 665697
  INFO  - 2015-07-20 12:26:21.026;
  org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
  INFO  - 2015-07-20 12:26:21.026;
  org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
  webapp=/solr path=/update params={omitHeader=falsewt=json}
  {add=[8653ea29-a327-4a54-9b00-8468241f2d7c (1507244513403338752),
  5cf034a9-d93a-4307-a367-02cb21fa8e35 (1507244513404387328),
  816e3a04-9d0e-4587-a3ee-9f9e7b0c7d74 (1507244513405435904)],commit=} 0
 50
 
  could that be causing some of the problems?
 
  
  From: Shawn Heisey apa...@elyograg.org
  Sent: Monday, July 20, 2015 11:44 AM
  To: solr-user@lucene.apache.org
  Subject: Re: solr blocking and client timeout issue
 
  On 7/20/2015 11:54 AM, Jeremy Ashcraft wrote:
 
  I'm ugrading to the 1.8 JDK on our dev VM now and testing. Hopefully i
  can get production upgraded tonight.
 
  still getting the big GC pauses this morning, even after applying the
  GC tuning options.  Everything was fine throughout the weekend.
 
  My biggest concern is that this instance had been running with no
  issues for almost 2 years, but these GC issues started just last week.
 
  It's very possible that you're simply going to need a larger heap than
  you have needed in the past, either because your index has grown, or
  because your query patterns have changed and now your queries need more
  memory.

Performance of facet contain search in 5.2.1

2015-07-21 Thread Lo Dave

I found that facet contain search take much longer time than facet prefix 
search. Do anyone have idea how to make contain search faster?
org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select 
params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.prefix=duty+of+carerows=1wt=jsonfacet=true_=1437462916852}
 hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance] 
webapp=/solr path=/select 
params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.contains=duty+of+carerows=1wt=jsonfacet=truefacet.contains.ignoreCase=true}
 hits=1856 status=0 QTime=10951 
As show above, prefix search take 5 but contain search take 10951
Thanks.

Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-21 Thread Ali Nazemian

Dear Erick,
I found another thing, I did check the number of unique terms for this
field using schema browser, It reported 1683404 number of terms! Does it
exceed the maximum number of unique terms for fcs facet method? I read
somewhere it should be more than 16m does it true?!

Best regards.


On Tue, Jul 21, 2015 at 10:00 AM, Ali Nazemian alinazem...@gmail.com
wrote:

 Dear Erick,

 Actually faceting on this field is not a user wanted application. I did
 that for the purpose of testing the customized normalizer and charfilter
 which I used. Therefore it just used for the purpose of testing. Anyway I
 did some googling on this error and It seems that changing facet method to
 enum works in other similar cases too. I dont know the differences between
 fcs and enum methods on calculating facet behind the scene, but it seems
 that enum works better in my case.

 Best regards.

 On Tue, Jul 21, 2015 at 9:08 AM, Erick Erickson erickerick...@gmail.com
 wrote:

 This really seems like an XY problem. _Why_ are you faceting on a
 tokenized field?
 What are you really trying to accomplish? Because faceting on a
 generalized
 content field that's an analyzed field is often A Bad Thing. Try going
 into the
 admin UI Schema Browser for that field, and you'll see how many unique
 terms
 you have in that field. Faceting on that many unique terms is rarely
 useful to the
 end user, so my suspicion is that you're not doing what you think you
 are. Or you
 have an unusual use-case. Either way, we need to understand what use-case
 you're trying to support in order to respond helpfully.

 You say that using facet.enum works, this is very surprising. That method
 uses
 the filterCache to create a bitset for each unique term. Which is totally
 incompatible with the uninverted field error you're reporting, so I
 clearly don't
 understand something about your setup. Are you _sure_?

 Best,
 Erick

 On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian alinazem...@gmail.com
 wrote:
  Dear Toke and Davidphilip,
  Hi,
  The fieldtype text_fa has some custom language specific normalizer and
  charfilter, here is the schema.xml value related for this field:
  fieldType name=text_fa class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  charFilter
  class=com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory/
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter
  class=com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=lang/stopwords_fa.txt /
/analyzer
analyzer type=query
  charFilter
  class=com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory/
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter
  class=com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=lang/stopwords_fa.txt /
/analyzer
  /fieldType
 
  I did try the facet.method=enum and it works fine. Did you mean that
  actually applying facet on analyzed field is wrong?
 
  Best regards.
 
  On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen t...@statsbiblioteket.dk
 
  wrote:
 
  Ali Nazemian alinazem...@gmail.com wrote:
   I have a collection of 1.6m documents in Solr 5.2.1.
   [...]
   Caused by: java.lang.IllegalStateException: Too many values for
   UnInvertedField faceting on field content
   [...]
   field name=content type=text_fa stored=true indexed=true
   default=noval termVectors=true termPositions=true
   termOffsets=true/
 
  You are hitting an internal limit in Solr. As davidphilip tells you,
 the
  solution is docValues, but they cannot be enabled for text fields. You
 need
  String fields, but the name of your field suggests that you need
  analyzation  tokenization, which cannot be done on String fields.
 
   Would you please help me to solve this problem?
 
  With the information we have, it does not seem to be easy to solve: It
  seems like you want to facet on all terms in your index. As they need
 to be
  String (to use docValues), you would have to do all the splitting on
 white
  space, normalization etc. outside of Solr.
 
  - Toke Eskildsen
 
 
 
 
  --
  A.Nazemian




 --
 A.Nazemian




-- 
A.Nazemian

Re: Performance of facet contain search in 5.2.1

Hi Dave,
generally giving terms in a dictionary, it's much more efficient to run
prefix queries than contain queries.
Talking about using docValues, if I remember well when they are loaded in
memory they are skipList, so you can use two operators on them :

- next() that simply gives you ht next field value for the field doc values
loaded
- advance ( ByteRef term) which jump to the term of the greatest term if
the one searched is missing.

Using the facet prefix we can jump to the point we want and basically
iterate the values that are matching.

To verify the contains, it is simply used on each term in the docValues,
term by term, using the StringUtil.contains() .
How many different unique terms do you have in the index for that field ?

So the difference in performance could make sense ( we are basically moving
to logarithmic to linear to simplify) .

I read the name of the field as facet.field=autocomplete, it's legit to
ask you if you are using faceting to obtain infix auto completion ?
In the case, can you help us, better identifying the problem and maybe
provide you with a better solution ?

Cheers



2015-07-21 9:16 GMT+01:00 Lo Dave dav...@hotmail.com:

 I found that facet contain search take much longer time than facet prefix
 search. Do anyone have idea how to make contain search faster?
 org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select
 params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.prefix=duty+of+carerows=1wt=jsonfacet=true_=1437462916852}
 hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance]
 webapp=/solr path=/select
 params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.contains=duty+of+carerows=1wt=jsonfacet=truefacet.contains.ignoreCase=true}
 hits=1856 status=0 QTime=10951
 As show above, prefix search take 5 but contain search take 10951
 Thanks.





-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

RE: Programmatically find out if node is overseer

2015-07-21 Thread Markus Jelsma

Hello - this approach not only solves the problem but also allows me to run 
different processing threads on other nodes.

Thanks!
Markus

-Original message-
 From:Chris Hostetter hossman_luc...@fucit.org
 Sent: Saturday 18th July 2015 1:00
 To: solr-user solr-user@lucene.apache.org
 Subject: Re: Programmatically find out if node is overseer

 : Hello - i need to run a thread on a single instance of a cloud so need 
 : to find out if current node is the overseer. I know we can already 
 : programmatically find out if this replica is the leader of a shard via 
 : isLeader(). I have looked everywhere but i cannot find an isOverseer. I 

 At one point, i woked up a utility method to give internal plugins 
 access to an isOverseer() type utility method...

    https://issues.apache.org/jira/browse/SOLR-5823

 ...but ultimately i abandoned this because i was completley forgetting 
 (until much much too late) that there's really no reason to assume that 
 any/all collections will have a single shard on the same node as the 
 overseer -- so having a plugin that only does stuff if it's running on the 
 overseer node is a really bad idea, because it might not run at all. (even 
 if it's configured in every collection)

 what i ultimately wound up doing (see SOLR-5795) is implementing a 
 solution where every core (of each collection configured to want this 
 functionality) has a thread running (a TimedExecutor) which would do 
 nothing unless...
  * my slice is active? (ie: not in the process of being shut down)
  * my slice is 'first' in a sorted list of slices?
  * i am currently the leader of my slice?

 ...that way when the timer goes off ever X minutes, at *most* one thread 
 fires (we might sporadically get no evens triggered if/when there is 
 leader election in progress for the slice that matters)

 the choice of first slice name alphabetically is purely becuase it's 
 something cheap to compute and garunteeded to be unique.

 If you truly want exactly one thread for the entire cluster, regardless of 
 collection, you could do the same basic idea by just adding a my 
 collection is 'first' in a sorted list of collection names?

 -Hoss
 http://www.lucidworks.com/

Re: Use REST API URL to update field

curl is just a command line HTTP client. You can use HTTP POST to send
the JSON that you are mentioning below via any means that works for you
- the file does not need to exist on disk - it just needs to be added to
the body of the POST request. 

I'd say review how to do HTTP POST requests from your chosen programming
language and you should see how to do this.

Upayavira

On Tue, Jul 21, 2015, at 04:12 AM, Zheng Lin Edwin Yeo wrote:
 Hi Shawn,
 
 So it means that if my following is in a text file called update.txt,
 
 {id:testing_0001,
 
 popularity:{inc:1}
 
 This text file must still exist if I use the URL? Or can this information
 in the text file be put directly onto the URL?
 
 Regards,
 Edwin
 
 
 On 20 July 2015 at 22:04, Shawn Heisey apa...@elyograg.org wrote:
 
  On 7/20/2015 2:06 AM, Zheng Lin Edwin Yeo wrote:
   I'm using Solr 5.2.1, and I would like to check, is there a way to update
   certain field by using REST API URL directly instead of using curl?
  
   For example, I would like to increase the popularity field in my index
   each time a user click on the record.
  
   Currently, it can work with the curl command by having this in my text
  file
   to be read by curl (the id is hard-coded here for example purpose)
  
   {id:testing_0001,
  
   popularity:{inc:1}
  
  
   Is there a REST API URL that I can call to achieve the same purpose?
 
  The URL that you would use with curl *IS* the URL that you would use for
  a REST-like call.
 
  Thanks,
  Shawn

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread mesenthil1

Unable to delete by passing distrib=false as well. Also it is difficult to
identify those duplicate documents among the 130 million. 

Is there a way we can see the generated hash key and mapping them to the
specific shard?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Reitzel, Charles

When are you generating the UUID exactly?   If you set the unique ID field on 
an update, and it contains a new UUID, you have effectively created a new 
document.   Just a thought.

-Original Message-
From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] 
Sent: Tuesday, July 21, 2015 4:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud: Duplicate documents in multiple shards

Unable to delete by passing distrib=false as well. Also it is difficult to 
identify those duplicate documents among the 130 million. 

Is there a way we can see the generated hash key and mapping them to the 
specific shard?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*

RE: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Reitzel, Charles

Also, the function used to generate hashes is 
org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a 32-bit 
value.   The range of the hash values assigned to each shard are resident in 
Zookeeper.   Since you are using only a single hash component, all 32-bits will 
be used by the entire ID field value.   

I.e. I see no routing delimiter (!) in your example ID value:

possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30

Which isn't required, but it means that documents (logs?) will be distributed 
in a round-robin fashion over the shards.  Not grouped by host or environment 
(if I am reading it right).

You might consider the following:  environment!hostname!UUID

E.g. 
intl-staging!possting.mongo-v2.services.com!c2d2a376-5e4a-11e2-8963-0026b9414f30

This way documents from the same host will be grouped together, most likely on 
the same shard.  Further, within the same environment, documents will be 
grouped on the same subset of shards. This will allow client applications to 
set _route_=environment!  or _route_=environment!hostname! and limit 
queries to those shards containing relevant data when the corresponding filter 
queries are applied.

If you were using route delimiters, then the default for a 2-part key (1 
delimiter) is to use 16 bits for each part.  The default for a 3-part key (2 
delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for the 3rd 
part.   In any case, the high-order bytes of the hash dominate the distribution 
of data.

-Original Message-
From: Reitzel, Charles 
Sent: Tuesday, July 21, 2015 9:55 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Cloud: Duplicate documents in multiple shards

When are you generating the UUID exactly?   If you set the unique ID field on 
an update, and it contains a new UUID, you have effectively created a new 
document.   Just a thought.

-Original Message-
From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] 
Sent: Tuesday, July 21, 2015 4:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud: Duplicate documents in multiple shards

Unable to delete by passing distrib=false as well. Also it is difficult to 
identify those duplicate documents among the 130 million. 

Is there a way we can see the generated hash key and mapping them to the 
specific shard?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*

Re: Installing Banana on Solr 5.2.1