Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-21 Thread Ali Nazemian
Dear Yonik,
Hi,
Really thanks for you response.
Best regards.

On Tue, Jul 21, 2015 at 5:42 PM, Yonik Seeley  wrote:

> On Tue, Jul 21, 2015 at 3:09 AM, Ali Nazemian 
> wrote:
> > Dear Erick,
> > I found another thing, I did check the number of unique terms for this
> > field using schema browser, It reported 1683404 number of terms! Does it
> > exceed the maximum number of unique terms for "fcs" facet method?
>
> The real limit is not simple since the data is not stored in a simple
> way (it's compressed).
>
> > I read
> > somewhere it should be more than 16m does it true?!
>
> More like 16MB of delta-coded terms per block of documents (the index
> is split up into 256 blocks for this purpose)
>
> See DocTermOrds.java if you want more details than that.
>
> -Yonik
>



-- 
A.Nazemian


Optimizing Solr indexing over WAN

2015-07-21 Thread Ali Nazemian
Dears,
Hi,
I know that there are lots of tips about how to make the Solr indexing
faster. Probably some of the most important ones which are considered in
client side are choosing batch indexing and multi-thread indexing. There
are other important factors that are server side which I dont want to
mentioned here. Anyway my question would be is there any best practice for
number of client threads and the size of batch available over WAN network?
Since the client and servers are connected over WAN network probably some
of the performance conditions such as network latency, bandwidth and etc.
are different from LAN network. Another think that is matter for me is the
fact that document sizes are might be different in diverse scenarios. For
example when you want to index web-pages the size of document might be from
1KB to 200KB. In such case choosing batch size according to the number of
documents is probably not the best way of optimizing index performance.
Probably choosing based on the size of batch size in KB/MB would be better
from the network point of view. However, from the Solr side document
numbers matter.
So if I want to summarize my questions here what am I looking for:
1- Is there any best practice available for Solr client side performance
tuning over WAN network for the purpose of indexing/reindexing/updating?
Does it different from LAN network?
2- Which one is matter: number of documents or the total size of documents
in batch?

Best regards.

-- 
A.Nazemian


issue with query boost using qf and edismax

2015-07-21 Thread sandeep bonkra
Hi,

I am implementing searching using SOLR 5.0 and facing very strange problem.
I am having 4 fields Name and address, city and state in the document apart
from a unique ID.

My requirement is that it should give me those results first where there is
a match in name , then address, then state, city

Scenerio 1 : When searching *louis*
My query params is something like below
 q: person_full_name:*louis* OR address1:*louis* OR city:*louis* OR
state_code:*louis*
 qf: person_full_name^5.0 address1^0.8 city^0.7 state_code^1.0
 defType: edismax

 This is not giving results as per boost mentioned in qf param. This is
giving me result where city is getting matched first.
Score is coming as below:

 "explain": {
  "11470307": "\n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n",
  "11470282": "\n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n",
  "11470291": "\n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(city:*louis*), product of:\n
  1.0 = boost\n0.0015872642 = queryNorm\n0.09090909 =
coord(1/11)\n",
  "11470261": "\n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n",
  "11470328": "\n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n",
  "11470331": "\n1.4429675E-4 = (MATCH) sum of:\n  1.4429675E-4 =
(MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n
0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*),
product of:\n1.0 = boost\n0.0015872642 = queryNorm\n
 0.09090909 = coord(1/11)\n"
},


Scenerio 2: But when I am matching 2 keywords. *louis cen*


 "explain": {
  "11470286": "\n0.9805807 = (MATCH) product of:\n  1.9611614 =
(MATCH) sum of:\n0.49029034 = (MATCH) max of:\n  0.49029034 =
(MATCH) ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n
 5.0 = boost\n0.09805807 = queryNorm\n0.49029034 =
(MATCH) max of:\n  0.49029034 = (MATCH)
ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n
5.0 = boost\n0.09805807 = queryNorm\n0.49029034 = (MATCH)
max of:\n  0.49029034 = (MATCH)
ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n
5.0 = boost\n0.09805807 = queryNorm\n0.49029034 = (MATCH)
max of:\n  0.49029034 = (MATCH)
ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n
5.0 = boost\n0.09805807 = queryNorm\n  0.5 = coord(4/8)\n",
  "11470284": "\n0.15689291 = (MATCH) product of:\n  0.31378582 =
(MATCH) sum of:\n0.078446455 = (MATCH) max of:\n  0.078446455
= (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n
0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH)
max of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max
of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max
of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n  0.5 = coord(4/8)\n",
  "11470232": "\n0.15689291 = (MATCH) product of:\n  0.31378582 =
(MATCH) sum of:\n0.078446455 = (MATCH) max of:\n  0.078446455
= (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n
0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH)
max of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max
of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max
of:\n  0.078446455 = (MATCH)
ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 =
boost\n0.09805807 = queryNorm\n  0.5 = coord(4/8)\n",
  "11469707": "\n0.15689291 = (MATCH) product of:\n  0.31378582 =
(MATCH) sum of:\n0.078446455 = (MATCH) max of:\n  0.078446455

Running SolrJ from Solr's REST API

2015-07-21 Thread Zheng Lin Edwin Yeo
Hi,

Would like to check, as I've created a SorJ program and exported it as an
Runnable JAR, how do I integrate it together with Solr so that I can call
this JAR directly from Solr's REST API?

Currently I can only run it on command prompt using the command java -jar
solrj.jar

I'm using Solr 5.2.1.


Regards,
Edwin


Re: WordDelimiterFilter Leading & Trailing Special Character

2015-07-21 Thread Jack Krupansky
You can also use the types attribute to change the type of specific
characters, such as to treat the "!" or "&" as an .

-- Jack Krupansky

On Tue, Jul 21, 2015 at 7:43 PM, Sathiya N Sundararajan 
wrote:

> Upayavira,
>
> thanks for the helpful suggestion, that works. I was looking for an option
> to turn off/circumvent that particular WordDelimiterFilter's behavior
> completely. Since our indexes are hundred's of Terabytes, every time we
> find a term that needs to be added, it will be a cumbersome process to
> reload all the cores.
>
>
> thanks
>
> On Tue, Jul 21, 2015 at 12:57 AM, Upayavira  wrote:
>
> > Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
> > this config:
> >
> >   >  positionIncrementGap="100">
> >
> >  
> >   >  protected="protectedword.txt"
> >  preserveOriginal="0" splitOnNumerics="1"
> >  splitOnCaseChange="1"
> >  catenateWords="0" catenateNumbers="0" catenateAll="0"
> >  generateWordParts="1" generateNumberParts="1"
> >  stemEnglishPossessive="1"
> >  types="wdfftypes.txt" />
> >
> >  
> >
> > Note the protected="x" attribute. I suspect if you put Yahoo! into a
> > file referenced by that attribute, it may survive analysis. I'd be
> > curious to hear whether it works.
> >
> > Upayavira
> >
> > On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
> > > Question about WordDelimiterFilter. The search behavior that we
> > > experience
> > > with WordDelimiterFilter satisfies well, except for the case where
> there
> > > is
> > > a special character either at the leading or trailing end of the term.
> > >
> > > For instance:
> > >
> > > *‘d&b’ *  —>  Works as expected. Finds all docs with ‘d&b’.
> > > *‘p!nk’*  —>  Works fine as above.
> > >
> > > But on cases when, there is a special character towards the trailing
> end
> > > of
> > > the term, like ‘Yahoo!’
> > >
> > > *‘yahoo!’* —> Turns out to be a search for just *‘yahoo’* with the
> > > special
> > > character *‘!’* stripped out.  This WordDelimiterFilter behavior is
> > > documented
> > >
> >
> http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
> > >
> > > What I would like to have is, the search performed without stripping
> out
> > > the leading & trailing special character. Is there a way to achieve
> this
> > > behavior with WordDelimiterFilter.
> > >
> > > This is current config that we have for the field:
> > >
> > >  > > positionIncrementGap="100">
> > > 
> > > 
> > >  > > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > > preserveOriginal="1"
> > > types="specialchartypes.txt"/>
> > > 
> > > 
> > > 
> > > 
> > >  > > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > > preserveOriginal="1"
> > > types="specialchartypes.txt"/>
> > > 
> > > 
> > > 
> > >
> > >
> > > thanks
> >
>


Re: WordDelimiterFilter Leading & Trailing Special Character

2015-07-21 Thread Sathiya N Sundararajan
Upayavira,

thanks for the helpful suggestion, that works. I was looking for an option
to turn off/circumvent that particular WordDelimiterFilter's behavior
completely. Since our indexes are hundred's of Terabytes, every time we
find a term that needs to be added, it will be a cumbersome process to
reload all the cores.


thanks

On Tue, Jul 21, 2015 at 12:57 AM, Upayavira  wrote:

> Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
> this config:
>
>positionIncrementGap="100">
>
>  
>protected="protectedword.txt"
>  preserveOriginal="0" splitOnNumerics="1"
>  splitOnCaseChange="1"
>  catenateWords="0" catenateNumbers="0" catenateAll="0"
>  generateWordParts="1" generateNumberParts="1"
>  stemEnglishPossessive="1"
>  types="wdfftypes.txt" />
>
>  
>
> Note the protected="x" attribute. I suspect if you put Yahoo! into a
> file referenced by that attribute, it may survive analysis. I'd be
> curious to hear whether it works.
>
> Upayavira
>
> On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
> > Question about WordDelimiterFilter. The search behavior that we
> > experience
> > with WordDelimiterFilter satisfies well, except for the case where there
> > is
> > a special character either at the leading or trailing end of the term.
> >
> > For instance:
> >
> > *‘d&b’ *  —>  Works as expected. Finds all docs with ‘d&b’.
> > *‘p!nk’*  —>  Works fine as above.
> >
> > But on cases when, there is a special character towards the trailing end
> > of
> > the term, like ‘Yahoo!’
> >
> > *‘yahoo!’* —> Turns out to be a search for just *‘yahoo’* with the
> > special
> > character *‘!’* stripped out.  This WordDelimiterFilter behavior is
> > documented
> >
> http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
> >
> > What I would like to have is, the search performed without stripping out
> > the leading & trailing special character. Is there a way to achieve this
> > behavior with WordDelimiterFilter.
> >
> > This is current config that we have for the field:
> >
> >  > positionIncrementGap="100">
> > 
> > 
> >  > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > preserveOriginal="1"
> > types="specialchartypes.txt"/>
> > 
> > 
> > 
> > 
> >  > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > preserveOriginal="1"
> > types="specialchartypes.txt"/>
> > 
> > 
> > 
> >
> >
> > thanks
>


Re: IntelliJ setup

2015-07-21 Thread Andrew Musselman
Bingo, thanks!

On Tue, Jul 21, 2015 at 4:12 PM, Konstantin Gribov 
wrote:

> Try "invalidate caches and restart" in IDEA, remove .idea directory in
> lucene-solr dir. After that run "ant idea" and re-open project.
>
> Also, you have to, at least, close project, run "ant idea" and re-open it
> if switching between too diverged branches (e.g., 4.10 and 5_x).
>
> вт, 21 июля 2015 г. в 21:53, Andrew Musselman  >:
>
> > I followed the instructions here
> > https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including
> `ant
> > idea`, but I'm still not getting the links in solr classes and methods;
> do
> > I need to add libraries, or am I missing something else?
> >
> > Thanks!
> >
> --
> Best regards,
> Konstantin Gribov
>


Re: IntelliJ setup

2015-07-21 Thread Konstantin Gribov
Try "invalidate caches and restart" in IDEA, remove .idea directory in
lucene-solr dir. After that run "ant idea" and re-open project.

Also, you have to, at least, close project, run "ant idea" and re-open it
if switching between too diverged branches (e.g., 4.10 and 5_x).

вт, 21 июля 2015 г. в 21:53, Andrew Musselman :

> I followed the instructions here
> https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including `ant
> idea`, but I'm still not getting the links in solr classes and methods; do
> I need to add libraries, or am I missing something else?
>
> Thanks!
>
-- 
Best regards,
Konstantin Gribov


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Andrew Musselman
Which can only happen if I post it to a web service, and won't happen if I
do it through config?

On Tue, Jul 21, 2015 at 2:19 PM, Upayavira  wrote:

> yes, unless it has been added consciously as a separate field.
>
> On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote:
> > Thanks, so by the time we would get to an Analyzer the file path is
> > forgotten?
> >
> > https://cwiki.apache.org/confluence/display/solr/Analyzers
> >
> > On Tue, Jul 21, 2015 at 1:27 PM, Upayavira  wrote:
> >
> > > Solr generally does not interact with the file system in that way (with
> > > the exception of the DIH).
> > >
> > > It is the job of the code that pushes a file to Solr to process the
> > > filename and send that along with the request.
> > >
> > > See here for more info:
> > >
> > >
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> > >
> > > You could provide literal.filename=blah/blah
> > >
> > > Upayavira
> > >
> > >
> > > On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > > > I'm not sure, it's a remote team but will get more info.  For now,
> > > > assuming
> > > > that a certain directory is specified, like "/user/andrew/", and a
> regex
> > > > is
> > > > applied to capture anything two directories below matching
> "*/*/*.pdf".
> > > >
> > > > Would there be a way to capture the wild-carded values and index
> them as
> > > > fields?
> > > >
> > > > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:
> > > >
> > > > > Keeping to the user list (the right place for this question).
> > > > >
> > > > > More information is needed here - how are you getting these
> documents
> > > > > into Solr? Are you posting them to /update/extract? Or using DIH,
> or?
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > > > Dear user and dev lists,
> > > > > >
> > > > > > We are loading files from a directory and would like to index a
> > > portion
> > > > > > of
> > > > > > each file path as a field as well as the text inside the file.
> > > > > >
> > > > > > E.g., on HDFS we have this file path:
> > > > > >
> > > > > > /user/andrew/1234/1234/file.pdf
> > > > > >
> > > > > > And we would like the "1234" token parsed from the file path and
> > > indexed
> > > > > > as
> > > > > > an additional field that can be searched on.
> > > > > >
> > > > > > From my initial searches I can't see how to do this easily, so
> would
> > > I
> > > > > > need
> > > > > > to write some custom code, or a plugin?
> > > > > >
> > > > > > Thanks!
> > > > >
> > >
>


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Upayavira
yes, unless it has been added consciously as a separate field.

On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote:
> Thanks, so by the time we would get to an Analyzer the file path is
> forgotten?
> 
> https://cwiki.apache.org/confluence/display/solr/Analyzers
> 
> On Tue, Jul 21, 2015 at 1:27 PM, Upayavira  wrote:
> 
> > Solr generally does not interact with the file system in that way (with
> > the exception of the DIH).
> >
> > It is the job of the code that pushes a file to Solr to process the
> > filename and send that along with the request.
> >
> > See here for more info:
> >
> > https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
> >
> > You could provide literal.filename=blah/blah
> >
> > Upayavira
> >
> >
> > On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > > I'm not sure, it's a remote team but will get more info.  For now,
> > > assuming
> > > that a certain directory is specified, like "/user/andrew/", and a regex
> > > is
> > > applied to capture anything two directories below matching "*/*/*.pdf".
> > >
> > > Would there be a way to capture the wild-carded values and index them as
> > > fields?
> > >
> > > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:
> > >
> > > > Keeping to the user list (the right place for this question).
> > > >
> > > > More information is needed here - how are you getting these documents
> > > > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> > > >
> > > > Upayavira
> > > >
> > > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > > Dear user and dev lists,
> > > > >
> > > > > We are loading files from a directory and would like to index a
> > portion
> > > > > of
> > > > > each file path as a field as well as the text inside the file.
> > > > >
> > > > > E.g., on HDFS we have this file path:
> > > > >
> > > > > /user/andrew/1234/1234/file.pdf
> > > > >
> > > > > And we would like the "1234" token parsed from the file path and
> > indexed
> > > > > as
> > > > > an additional field that can be searched on.
> > > > >
> > > > > From my initial searches I can't see how to do this easily, so would
> > I
> > > > > need
> > > > > to write some custom code, or a plugin?
> > > > >
> > > > > Thanks!
> > > >
> >


Re: Tips for faster indexing

2015-07-21 Thread Fadi Mohsen
In Java: UUID.randomUUID();

That is what I'm using.

Regards

> On 21 Jul 2015, at 22:38, Vineeth Dasaraju  wrote:
> 
> Hi Upayavira,
> 
> I guess that is the problem. I am currently using a function for generating
> an ID. It takes the current date and time to milliseconds and generates the
> id. This is the function.
> 
> public static String generateID(){
>Date dNow = new Date();
>SimpleDateFormat ft = new SimpleDateFormat("yyMMddhhmmssMs");
>String datetime = ft.format(dNow);
>return datetime;
>}
> 
> 
> I believe that despite having a millisecond precision in the id generation,
> multiple objects are being assigned the same ID. Can you suggest a better
> way to generate the ID?
> 
> Regards,
> Vineeth
> 
> 
>> On Tue, Jul 21, 2015 at 1:29 PM, Upayavira  wrote:
>> 
>> Are you making sure that every document has a unique ID? Index into an
>> empty Solr, then look at your maxdocs vs numdocs. If they are different
>> (maxdocs is higher) then some of your documents have been deleted,
>> meaning some were overwritten.
>> 
>> That might be a place to look.
>> 
>> Upayavira
>> 
>>> On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote:
>>> I can confirm this behavior, seen when sending json docs in batch, never
>>> happens when sending one by one, but sporadic when sending batches.
>>> 
>>> Like if sole/jetty drops couple of documents out of the batch.
>>> 
>>> Regards
>>> 
 On 21 Jul 2015, at 21:38, Vineeth Dasaraju 
>> wrote:
 
 Hi,
 
 Thank You Erick for your inputs. I tried creating batches of 1000
>> objects
 and indexing it to solr. The performance is way better than before but
>> I
 find that number of indexed documents that is shown in the dashboard is
 lesser than the number of documents that I had actually indexed through
 solrj. My code is as follows:
 
 private static String SOLR_SERVER_URL = "
>> http://localhost:8983/solr/newcore
 ";
 private static String JSON_FILE_PATH =
>> "/home/vineeth/week1_fixed.json";
 private static JSONParser parser = new JSONParser();
 private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
 
 public static void main(String[] args) throws IOException,
 SolrServerException, ParseException {
   File file = new File(JSON_FILE_PATH);
   Scanner scn=new Scanner(file,"UTF-8");
   JSONObject object;
   int i = 0;
   Collection batch = new
 ArrayList();
   while(scn.hasNext()){
   object= (JSONObject) parser.parse(scn.nextLine());
   SolrInputDocument doc = indexJSON(object);
   batch.add(doc);
   if(i%1000==0){
   System.out.println("Indexed " + (i+1) + " objects." );
   solr.add(batch);
   batch = new ArrayList();
   }
   i++;
   }
   solr.add(batch);
   solr.commit();
   System.out.println("Indexed " + (i+1) + " objects." );
 }
 
 public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
 ParseException, IOException, SolrServerException {
   Collection batch = new
 ArrayList();
 
   SolrInputDocument mainEvent = new SolrInputDocument();
   mainEvent.addField("id", generateID());
   mainEvent.addField("RawEventMessage",
>> jsonOBJ.get("RawEventMessage"));
   mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
   mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
   mainEvent.addField("EventMessageType",
>> jsonOBJ.get("EventMessageType"));
   mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
   mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
 
   Object obj = parser.parse(jsonOBJ.get("User").toString());
   JSONObject userObj = (JSONObject) obj;
 
   SolrInputDocument childUserEvent = new SolrInputDocument();
   childUserEvent.addField("id", generateID());
   childUserEvent.addField("User", userObj.get("User"));
 
   obj = parser.parse(jsonOBJ.get("EventDescription").toString());
   JSONObject eventdescriptionObj = (JSONObject) obj;
 
   SolrInputDocument childEventDescEvent = new SolrInputDocument();
   childEventDescEvent.addField("id", generateID());
   childEventDescEvent.addField("EventApplicationName",
 eventdescriptionObj.get("EventApplicationName"));
   childEventDescEvent.addField("Query",
>> eventdescriptionObj.get("Query"));
 
   obj=
>> JSONValue.parse(eventdescriptionObj.get("Information").toString());
   JSONArray informationArray = (JSONArray) obj;
 
   for(int i = 0; i>>>   JSONObject domain = (JSONObject) informationArray.get(i);
 
   SolrInputDocument domainDoc = new SolrInputDocument();
   domainDoc.addField("id", generateID());
   domainDoc.addField("domainName", domain.get("domainN

Re: Issue with using createNodeSet in Solr Cloud

2015-07-21 Thread Savvas Andreas Moysidis
Ah, nice tip, thanks! This could also make scripts more portable too.

Cheers,
Savvas

On 21 July 2015 at 08:40, Upayavira  wrote:

> Note, when you start up the instances, you can pass in a hostname to use
> instead of the IP address. If you are using bin/solr (which you should
> be!!) then you can use bin/solr -h my-host-name and that'll be used in
> place of the IP.
>
> Upayavira
>
> On Tue, Jul 21, 2015, at 05:45 AM, Erick Erickson wrote:
> > Glad you found a solution
> >
> > Best,
> > Erick
> >
> > On Mon, Jul 20, 2015 at 3:21 AM, Savvas Andreas Moysidis
> >  wrote:
> > > Erick, spot on!
> > >
> > > The nodes had been registered in zookeeper under my network
> interface's IP
> > > address...after specifying those the command worked just fine.
> > >
> > > It was indeed the thing I thought was true that wasn't... :)
> > >
> > > Many thanks,
> > > Savvas
> > >
> > > On 18 July 2015 at 20:47, Erick Erickson 
> wrote:
> > >
> > >> P.S.
> > >>
> > >> "It ain't the things ya don't know that'll kill ya, it's the things ya
> > >> _do_ know that ain't so"...
> > >>
> > >> On Sat, Jul 18, 2015 at 12:46 PM, Erick Erickson
> > >>  wrote:
> > >> > Could you post your clusterstate.json? Or at least the "live nodes"
> > >> > section of your ZK config? (adminUI>>cloud>>tree>>live_nodes. The
> > >> > addresses of my nodes are things like 192.168.1.201:8983_solr. I'm
> > >> > wondering if you're taking your node names from the information ZK
> > >> > records or assuming it's 127.0.0.1
> > >> >
> > >> > On Sat, Jul 18, 2015 at 8:56 AM, Savvas Andreas Moysidis
> > >> >  wrote:
> > >> >> Thanks Eric,
> > >> >>
> > >> >> The strange thing is that although I have set the log level to
> "ALL" I
> > >> see
> > >> >> no error messages in the logs (apart from the line saying that the
> > >> response
> > >> >> is a 400 one).
> > >> >>
> > >> >> I'm quite confident the configset does exist as the collection gets
> > >> created
> > >> >> fine if I don't specify the createNodeSet param.
> > >> >>
> > >> >> Complete mystery..! I'll keep on troubleshooting and report back
> with my
> > >> >> findings.
> > >> >>
> > >> >> Cheers,
> > >> >> Savvas
> > >> >>
> > >> >> On 17 July 2015 at 02:14, Erick Erickson 
> > >> wrote:
> > >> >>
> > >> >>> There were a couple of cases where the "no live servers" was being
> > >> >>> returned when the error was something completely different. Does
> the
> > >> >>> Solr log show something more useful? And are you sure you have a
> > >> >>> configset named collection_A?
> > >> >>>
> > >> >>> 'cause this works (admittedly on 5.x) fine for me, and I'm quite
> sure
> > >> >>> there are bunches of automated tests that would be failing so I
> > >> >>> suspect it's just a misleading error being returned.
> > >> >>>
> > >> >>> Best,
> > >> >>> Erick
> > >> >>>
> > >> >>> On Thu, Jul 16, 2015 at 2:22 AM, Savvas Andreas Moysidis
> > >> >>>  wrote:
> > >> >>> > Hello There,
> > >> >>> >
> > >> >>> > I am trying to use the createNodeSet parameter when creating a
> new
> > >> >>> > collection but I'm getting an error when doing so.
> > >> >>> >
> > >> >>> > More specifically, I have four Solr instances running locally in
> > >> separate
> > >> >>> > JVMs (127.0.0.1:8983, 127.0.0.1:8984, 127.0.0.1:8985,
> 127.0.0.1:8986
> > >> )
> > >> >>> and a
> > >> >>> > standalone Zookeeper instance which all Solr instances point
> to. The
> > >> four
> > >> >>> > Solr instances have no collections added to them and are all up
> and
> > >> >>> running
> > >> >>> > (I can access the admin page in all of them).
> > >> >>> >
> > >> >>> > Now, I want to create a collections in only two of these four
> > >> instances (
> > >> >>> > 127.0.0.1:8983, 127.0.0.1:8984) but when I hit one instance
> with the
> > >> >>> > following URL:
> > >> >>> >
> > >> >>> >
> > >> >>>
> > >>
> http://localhost:8983/solr/admin/collections?action=CREATE&name=collection_A&numShards=1&replicationFactor=2&maxShardsPerNode=1&createNodeSet=127.0.0.1:8983_solr,127.0.0.1:8984_solr&collection.configName=collection_A
> > >> >>> >
> > >> >>> > I am getting the following response:
> > >> >>> >
> > >> >>> > 
> > >> >>> > 
> > >> >>> > 400
> > >> >>> > 3503
> > >> >>> > 
> > >> >>> > 
> > >> >>> >
> > >> >>>
> > >>
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> > >> >>> > Cannot create collection collection_A. No live Solr-instances
> among
> > >> >>> > Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
> > >> >>> 127.0.0.1:8984
> > >> >>> > _solr
> > >> >>> > 
> > >> >>> > 
> > >> >>> > 
> > >> >>> > Cannot create collection collection_A. No live Solr-instances
> among
> > >> >>> > Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
> > >> >>> 127.0.0.1:8984
> > >> >>> > _solr
> > >> >>> > 
> > >> >>> > 400
> > >> >>> > 
> > >> >>> > 
> > >> >>> > 
> > >> >>> > Cannot create collection collection_A. No live Solr-instances
> among
> > >> >>> > Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,

Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Andrew Musselman
Thanks, so by the time we would get to an Analyzer the file path is
forgotten?

https://cwiki.apache.org/confluence/display/solr/Analyzers

On Tue, Jul 21, 2015 at 1:27 PM, Upayavira  wrote:

> Solr generally does not interact with the file system in that way (with
> the exception of the DIH).
>
> It is the job of the code that pushes a file to Solr to process the
> filename and send that along with the request.
>
> See here for more info:
>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>
> You could provide literal.filename=blah/blah
>
> Upayavira
>
>
> On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> > I'm not sure, it's a remote team but will get more info.  For now,
> > assuming
> > that a certain directory is specified, like "/user/andrew/", and a regex
> > is
> > applied to capture anything two directories below matching "*/*/*.pdf".
> >
> > Would there be a way to capture the wild-carded values and index them as
> > fields?
> >
> > On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:
> >
> > > Keeping to the user list (the right place for this question).
> > >
> > > More information is needed here - how are you getting these documents
> > > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> > >
> > > Upayavira
> > >
> > > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > > Dear user and dev lists,
> > > >
> > > > We are loading files from a directory and would like to index a
> portion
> > > > of
> > > > each file path as a field as well as the text inside the file.
> > > >
> > > > E.g., on HDFS we have this file path:
> > > >
> > > > /user/andrew/1234/1234/file.pdf
> > > >
> > > > And we would like the "1234" token parsed from the file path and
> indexed
> > > > as
> > > > an additional field that can be searched on.
> > > >
> > > > From my initial searches I can't see how to do this easily, so would
> I
> > > > need
> > > > to write some custom code, or a plugin?
> > > >
> > > > Thanks!
> > >
>


Re: Tips for faster indexing

2015-07-21 Thread Vineeth Dasaraju
Hi Upayavira,

I guess that is the problem. I am currently using a function for generating
an ID. It takes the current date and time to milliseconds and generates the
id. This is the function.

public static String generateID(){
Date dNow = new Date();
SimpleDateFormat ft = new SimpleDateFormat("yyMMddhhmmssMs");
String datetime = ft.format(dNow);
return datetime;
}


I believe that despite having a millisecond precision in the id generation,
multiple objects are being assigned the same ID. Can you suggest a better
way to generate the ID?

Regards,
Vineeth


On Tue, Jul 21, 2015 at 1:29 PM, Upayavira  wrote:

> Are you making sure that every document has a unique ID? Index into an
> empty Solr, then look at your maxdocs vs numdocs. If they are different
> (maxdocs is higher) then some of your documents have been deleted,
> meaning some were overwritten.
>
> That might be a place to look.
>
> Upayavira
>
> On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote:
> > I can confirm this behavior, seen when sending json docs in batch, never
> > happens when sending one by one, but sporadic when sending batches.
> >
> > Like if sole/jetty drops couple of documents out of the batch.
> >
> > Regards
> >
> > > On 21 Jul 2015, at 21:38, Vineeth Dasaraju 
> wrote:
> > >
> > > Hi,
> > >
> > > Thank You Erick for your inputs. I tried creating batches of 1000
> objects
> > > and indexing it to solr. The performance is way better than before but
> I
> > > find that number of indexed documents that is shown in the dashboard is
> > > lesser than the number of documents that I had actually indexed through
> > > solrj. My code is as follows:
> > >
> > > private static String SOLR_SERVER_URL = "
> http://localhost:8983/solr/newcore
> > > ";
> > > private static String JSON_FILE_PATH =
> "/home/vineeth/week1_fixed.json";
> > > private static JSONParser parser = new JSONParser();
> > > private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
> > >
> > > public static void main(String[] args) throws IOException,
> > > SolrServerException, ParseException {
> > >File file = new File(JSON_FILE_PATH);
> > >Scanner scn=new Scanner(file,"UTF-8");
> > >JSONObject object;
> > >int i = 0;
> > >Collection batch = new
> > > ArrayList();
> > >while(scn.hasNext()){
> > >object= (JSONObject) parser.parse(scn.nextLine());
> > >SolrInputDocument doc = indexJSON(object);
> > >batch.add(doc);
> > >if(i%1000==0){
> > >System.out.println("Indexed " + (i+1) + " objects." );
> > >solr.add(batch);
> > >batch = new ArrayList();
> > >}
> > >i++;
> > >}
> > >solr.add(batch);
> > >solr.commit();
> > >System.out.println("Indexed " + (i+1) + " objects." );
> > > }
> > >
> > > public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
> > > ParseException, IOException, SolrServerException {
> > >Collection batch = new
> > > ArrayList();
> > >
> > >SolrInputDocument mainEvent = new SolrInputDocument();
> > >mainEvent.addField("id", generateID());
> > >mainEvent.addField("RawEventMessage",
> jsonOBJ.get("RawEventMessage"));
> > >mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
> > >mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
> > >mainEvent.addField("EventMessageType",
> jsonOBJ.get("EventMessageType"));
> > >mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
> > >mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
> > >
> > >Object obj = parser.parse(jsonOBJ.get("User").toString());
> > >JSONObject userObj = (JSONObject) obj;
> > >
> > >SolrInputDocument childUserEvent = new SolrInputDocument();
> > >childUserEvent.addField("id", generateID());
> > >childUserEvent.addField("User", userObj.get("User"));
> > >
> > >obj = parser.parse(jsonOBJ.get("EventDescription").toString());
> > >JSONObject eventdescriptionObj = (JSONObject) obj;
> > >
> > >SolrInputDocument childEventDescEvent = new SolrInputDocument();
> > >childEventDescEvent.addField("id", generateID());
> > >childEventDescEvent.addField("EventApplicationName",
> > > eventdescriptionObj.get("EventApplicationName"));
> > >childEventDescEvent.addField("Query",
> eventdescriptionObj.get("Query"));
> > >
> > >obj=
> JSONValue.parse(eventdescriptionObj.get("Information").toString());
> > >JSONArray informationArray = (JSONArray) obj;
> > >
> > >for(int i = 0; i > >JSONObject domain = (JSONObject) informationArray.get(i);
> > >
> > >SolrInputDocument domainDoc = new SolrInputDocument();
> > >domainDoc.addField("id", generateID());
> > >domainDoc.addField("domainName", domain.get("domainName"));
> > >
> > >String s = domain.get("columns").toString();
> > > 

Re: Tips for faster indexing

2015-07-21 Thread Upayavira
Are you making sure that every document has a unique ID? Index into an
empty Solr, then look at your maxdocs vs numdocs. If they are different
(maxdocs is higher) then some of your documents have been deleted,
meaning some were overwritten.

That might be a place to look.

Upayavira

On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote:
> I can confirm this behavior, seen when sending json docs in batch, never
> happens when sending one by one, but sporadic when sending batches.
> 
> Like if sole/jetty drops couple of documents out of the batch.
> 
> Regards
> 
> > On 21 Jul 2015, at 21:38, Vineeth Dasaraju  wrote:
> > 
> > Hi,
> > 
> > Thank You Erick for your inputs. I tried creating batches of 1000 objects
> > and indexing it to solr. The performance is way better than before but I
> > find that number of indexed documents that is shown in the dashboard is
> > lesser than the number of documents that I had actually indexed through
> > solrj. My code is as follows:
> > 
> > private static String SOLR_SERVER_URL = "http://localhost:8983/solr/newcore
> > ";
> > private static String JSON_FILE_PATH = "/home/vineeth/week1_fixed.json";
> > private static JSONParser parser = new JSONParser();
> > private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
> > 
> > public static void main(String[] args) throws IOException,
> > SolrServerException, ParseException {
> >File file = new File(JSON_FILE_PATH);
> >Scanner scn=new Scanner(file,"UTF-8");
> >JSONObject object;
> >int i = 0;
> >Collection batch = new
> > ArrayList();
> >while(scn.hasNext()){
> >object= (JSONObject) parser.parse(scn.nextLine());
> >SolrInputDocument doc = indexJSON(object);
> >batch.add(doc);
> >if(i%1000==0){
> >System.out.println("Indexed " + (i+1) + " objects." );
> >solr.add(batch);
> >batch = new ArrayList();
> >}
> >i++;
> >}
> >solr.add(batch);
> >solr.commit();
> >System.out.println("Indexed " + (i+1) + " objects." );
> > }
> > 
> > public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
> > ParseException, IOException, SolrServerException {
> >Collection batch = new
> > ArrayList();
> > 
> >SolrInputDocument mainEvent = new SolrInputDocument();
> >mainEvent.addField("id", generateID());
> >mainEvent.addField("RawEventMessage", jsonOBJ.get("RawEventMessage"));
> >mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
> >mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
> >mainEvent.addField("EventMessageType", jsonOBJ.get("EventMessageType"));
> >mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
> >mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
> > 
> >Object obj = parser.parse(jsonOBJ.get("User").toString());
> >JSONObject userObj = (JSONObject) obj;
> > 
> >SolrInputDocument childUserEvent = new SolrInputDocument();
> >childUserEvent.addField("id", generateID());
> >childUserEvent.addField("User", userObj.get("User"));
> > 
> >obj = parser.parse(jsonOBJ.get("EventDescription").toString());
> >JSONObject eventdescriptionObj = (JSONObject) obj;
> > 
> >SolrInputDocument childEventDescEvent = new SolrInputDocument();
> >childEventDescEvent.addField("id", generateID());
> >childEventDescEvent.addField("EventApplicationName",
> > eventdescriptionObj.get("EventApplicationName"));
> >childEventDescEvent.addField("Query", eventdescriptionObj.get("Query"));
> > 
> >obj= JSONValue.parse(eventdescriptionObj.get("Information").toString());
> >JSONArray informationArray = (JSONArray) obj;
> > 
> >for(int i = 0; i >JSONObject domain = (JSONObject) informationArray.get(i);
> > 
> >SolrInputDocument domainDoc = new SolrInputDocument();
> >domainDoc.addField("id", generateID());
> >domainDoc.addField("domainName", domain.get("domainName"));
> > 
> >String s = domain.get("columns").toString();
> >obj= JSONValue.parse(s);
> >JSONArray ColumnsArray = (JSONArray) obj;
> > 
> >SolrInputDocument columnsDoc = new SolrInputDocument();
> >columnsDoc.addField("id", generateID());
> > 
> >for(int j = 0; j >JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
> >SolrInputDocument columnDoc = new SolrInputDocument();
> >columnDoc.addField("id", generateID());
> >columnDoc.addField("movieName", ColumnsObj.get("movieName"));
> >columnsDoc.addChildDocument(columnDoc);
> >}
> >domainDoc.addChildDocument(columnsDoc);
> >childEventDescEvent.addChildDocument(domainDoc);
> >}
> > 
> >mainEvent.addChildDocument(childEventDescEvent);
> >mainEvent.addChildDocument(childUserEvent);
> >return mainEvent;
> > }
> > 

Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Upayavira
Solr generally does not interact with the file system in that way (with
the exception of the DIH).

It is the job of the code that pushes a file to Solr to process the
filename and send that along with the request.

See here for more info:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

You could provide literal.filename=blah/blah

Upayavira


On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote:
> I'm not sure, it's a remote team but will get more info.  For now,
> assuming
> that a certain directory is specified, like "/user/andrew/", and a regex
> is
> applied to capture anything two directories below matching "*/*/*.pdf".
> 
> Would there be a way to capture the wild-carded values and index them as
> fields?
> 
> On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:
> 
> > Keeping to the user list (the right place for this question).
> >
> > More information is needed here - how are you getting these documents
> > into Solr? Are you posting them to /update/extract? Or using DIH, or?
> >
> > Upayavira
> >
> > On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > > Dear user and dev lists,
> > >
> > > We are loading files from a directory and would like to index a portion
> > > of
> > > each file path as a field as well as the text inside the file.
> > >
> > > E.g., on HDFS we have this file path:
> > >
> > > /user/andrew/1234/1234/file.pdf
> > >
> > > And we would like the "1234" token parsed from the file path and indexed
> > > as
> > > an additional field that can be searched on.
> > >
> > > From my initial searches I can't see how to do this easily, so would I
> > > need
> > > to write some custom code, or a plugin?
> > >
> > > Thanks!
> >


Re: Tips for faster indexing

2015-07-21 Thread solr . user . 1507
I can confirm this behavior, seen when sending json docs in batch, never 
happens when sending one by one, but sporadic when sending batches.

Like if sole/jetty drops couple of documents out of the batch.

Regards

> On 21 Jul 2015, at 21:38, Vineeth Dasaraju  wrote:
> 
> Hi,
> 
> Thank You Erick for your inputs. I tried creating batches of 1000 objects
> and indexing it to solr. The performance is way better than before but I
> find that number of indexed documents that is shown in the dashboard is
> lesser than the number of documents that I had actually indexed through
> solrj. My code is as follows:
> 
> private static String SOLR_SERVER_URL = "http://localhost:8983/solr/newcore
> ";
> private static String JSON_FILE_PATH = "/home/vineeth/week1_fixed.json";
> private static JSONParser parser = new JSONParser();
> private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);
> 
> public static void main(String[] args) throws IOException,
> SolrServerException, ParseException {
>File file = new File(JSON_FILE_PATH);
>Scanner scn=new Scanner(file,"UTF-8");
>JSONObject object;
>int i = 0;
>Collection batch = new
> ArrayList();
>while(scn.hasNext()){
>object= (JSONObject) parser.parse(scn.nextLine());
>SolrInputDocument doc = indexJSON(object);
>batch.add(doc);
>if(i%1000==0){
>System.out.println("Indexed " + (i+1) + " objects." );
>solr.add(batch);
>batch = new ArrayList();
>}
>i++;
>}
>solr.add(batch);
>solr.commit();
>System.out.println("Indexed " + (i+1) + " objects." );
> }
> 
> public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
> ParseException, IOException, SolrServerException {
>Collection batch = new
> ArrayList();
> 
>SolrInputDocument mainEvent = new SolrInputDocument();
>mainEvent.addField("id", generateID());
>mainEvent.addField("RawEventMessage", jsonOBJ.get("RawEventMessage"));
>mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
>mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
>mainEvent.addField("EventMessageType", jsonOBJ.get("EventMessageType"));
>mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
>mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));
> 
>Object obj = parser.parse(jsonOBJ.get("User").toString());
>JSONObject userObj = (JSONObject) obj;
> 
>SolrInputDocument childUserEvent = new SolrInputDocument();
>childUserEvent.addField("id", generateID());
>childUserEvent.addField("User", userObj.get("User"));
> 
>obj = parser.parse(jsonOBJ.get("EventDescription").toString());
>JSONObject eventdescriptionObj = (JSONObject) obj;
> 
>SolrInputDocument childEventDescEvent = new SolrInputDocument();
>childEventDescEvent.addField("id", generateID());
>childEventDescEvent.addField("EventApplicationName",
> eventdescriptionObj.get("EventApplicationName"));
>childEventDescEvent.addField("Query", eventdescriptionObj.get("Query"));
> 
>obj= JSONValue.parse(eventdescriptionObj.get("Information").toString());
>JSONArray informationArray = (JSONArray) obj;
> 
>for(int i = 0; iJSONObject domain = (JSONObject) informationArray.get(i);
> 
>SolrInputDocument domainDoc = new SolrInputDocument();
>domainDoc.addField("id", generateID());
>domainDoc.addField("domainName", domain.get("domainName"));
> 
>String s = domain.get("columns").toString();
>obj= JSONValue.parse(s);
>JSONArray ColumnsArray = (JSONArray) obj;
> 
>SolrInputDocument columnsDoc = new SolrInputDocument();
>columnsDoc.addField("id", generateID());
> 
>for(int j = 0; jJSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j);
>SolrInputDocument columnDoc = new SolrInputDocument();
>columnDoc.addField("id", generateID());
>columnDoc.addField("movieName", ColumnsObj.get("movieName"));
>columnsDoc.addChildDocument(columnDoc);
>}
>domainDoc.addChildDocument(columnsDoc);
>childEventDescEvent.addChildDocument(domainDoc);
>}
> 
>mainEvent.addChildDocument(childEventDescEvent);
>mainEvent.addChildDocument(childUserEvent);
>return mainEvent;
> }
> 
> I would be grateful if you could let me know what I am missing.
> 
> On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson 
> wrote:
> 
>> First thing is it looks like you're only sending one document at a
>> time, perhaps with child objects. This is not optimal at all. I
>> usually batch my docs up in groups of 1,000, and there is anecdotal
>> evidence that there may (depending on the docs) be some gains above
>> that number. Gotta balance the batch size off against how bug the docs
>> are of course.
>> 
>> Assuming that you really are calling this method for one doc

Re: Tips for faster indexing

2015-07-21 Thread Vineeth Dasaraju
Hi,

Thank You Erick for your inputs. I tried creating batches of 1000 objects
and indexing it to solr. The performance is way better than before but I
find that number of indexed documents that is shown in the dashboard is
lesser than the number of documents that I had actually indexed through
solrj. My code is as follows:

private static String SOLR_SERVER_URL = "http://localhost:8983/solr/newcore
";
private static String JSON_FILE_PATH = "/home/vineeth/week1_fixed.json";
private static JSONParser parser = new JSONParser();
private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL);

public static void main(String[] args) throws IOException,
SolrServerException, ParseException {
File file = new File(JSON_FILE_PATH);
Scanner scn=new Scanner(file,"UTF-8");
JSONObject object;
int i = 0;
Collection batch = new
ArrayList();
while(scn.hasNext()){
object= (JSONObject) parser.parse(scn.nextLine());
SolrInputDocument doc = indexJSON(object);
batch.add(doc);
if(i%1000==0){
System.out.println("Indexed " + (i+1) + " objects." );
solr.add(batch);
batch = new ArrayList();
}
i++;
}
solr.add(batch);
solr.commit();
System.out.println("Indexed " + (i+1) + " objects." );
}

public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws
ParseException, IOException, SolrServerException {
Collection batch = new
ArrayList();

SolrInputDocument mainEvent = new SolrInputDocument();
mainEvent.addField("id", generateID());
mainEvent.addField("RawEventMessage", jsonOBJ.get("RawEventMessage"));
mainEvent.addField("EventUid", jsonOBJ.get("EventUid"));
mainEvent.addField("EventCollector", jsonOBJ.get("EventCollector"));
mainEvent.addField("EventMessageType", jsonOBJ.get("EventMessageType"));
mainEvent.addField("TimeOfEvent", jsonOBJ.get("TimeOfEvent"));
mainEvent.addField("TimeOfEventUTC", jsonOBJ.get("TimeOfEventUTC"));

Object obj = parser.parse(jsonOBJ.get("User").toString());
JSONObject userObj = (JSONObject) obj;

SolrInputDocument childUserEvent = new SolrInputDocument();
childUserEvent.addField("id", generateID());
childUserEvent.addField("User", userObj.get("User"));

obj = parser.parse(jsonOBJ.get("EventDescription").toString());
JSONObject eventdescriptionObj = (JSONObject) obj;

SolrInputDocument childEventDescEvent = new SolrInputDocument();
childEventDescEvent.addField("id", generateID());
childEventDescEvent.addField("EventApplicationName",
eventdescriptionObj.get("EventApplicationName"));
childEventDescEvent.addField("Query", eventdescriptionObj.get("Query"));

obj= JSONValue.parse(eventdescriptionObj.get("Information").toString());
JSONArray informationArray = (JSONArray) obj;

for(int i = 0; i
wrote:

> First thing is it looks like you're only sending one document at a
> time, perhaps with child objects. This is not optimal at all. I
> usually batch my docs up in groups of 1,000, and there is anecdotal
> evidence that there may (depending on the docs) be some gains above
> that number. Gotta balance the batch size off against how bug the docs
> are of course.
>
> Assuming that you really are calling this method for one doc (and
> children) at a time, the far bigger problem other than calling
> server.add for each parent/children is that you're then calling
> solr.commit() every time. This is an anti-pattern. Generally, let the
> autoCommit setting in solrconfig.xml handle the intermediate commits
> while the indexing program is running and only issue a commit at the
> very end of the job if at all.
>
> Best,
> Erick
>
> On Sun, Jul 19, 2015 at 12:08 PM, Vineeth Dasaraju
>  wrote:
> > Hi,
> >
> > I am trying to index JSON objects (which contain nested JSON objects and
> > Arrays in them) into solr.
> >
> > My JSON Object looks like the following (This is fake data that I am
> using
> > for this example):
> >
> > {
> > "RawEventMessage": "Lorem ipsum dolor sit amet, consectetur
> adipiscing
> > elit. Aliquam dolor orci, placerat ac pretium a, tincidunt consectetur
> > mauris. Etiam sollicitudin sapien id odio tempus, non sodales odio
> iaculis.
> > Donec fringilla diam at placerat interdum. Proin vitae arcu non augue
> > facilisis auctor id non neque. Integer non nibh sit amet justo facilisis
> > semper a vel ligula. Pellentesque commodo vulputate consequat. ",
> > "EventUid": "1279706565",
> > "TimeOfEvent": "2015-05-01-08-07-13",
> > "TimeOfEventUTC": "2015-05-01-01-07-13",
> > "EventCollector": "kafka",
> > "EventMessageType": "kafka-@column",
> > "User": {
> > "User": "Lorem ipsum",
> > "UserGroup": "Manager",
> > "Location": "consectetur adipiscing",
> > "Department": "Legal"
> > },
> > "EventDescription": {
> > "EventApplicationName": ""

Re: Performance of facet contain search in 5.2.1

2015-07-21 Thread Erick Erickson
"contains" has to basically examine each and every term to see if it
matches. Say my
facet.contains=bbb. A matching term could be
aaabbbxyz
or
zzzbbbxyz

So there's no way to _know_ when you've found them all without
examining every last
one. So I'd try to redefine the problem to not require that. If it's
absolutely required,
you can do some interesting things but it's going to inflate your index.

For instance, "rotate" words (assuming word boundaries here). So, for
instance, you have
a text field with "my dog has fleas". Index things like
my dog has fleas|my dog has fleas
dog has fleas my|my dog has fleas
has fleas my dog|my dog has fleas
fleas my dog has|my dog has fleas

Literally with the pipe followed by the original text. Now all your
contains clauses are
simple prefix facets, and you can have the UI split the token on the
pipe and display the
original.

Best,
Erick


On Tue, Jul 21, 2015 at 1:16 AM, Lo Dave  wrote:
> I found that facet contain search take much longer time than facet prefix 
> search. Do anyone have idea how to make contain search faster?
> org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select 
> params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.prefix=duty+of+care&rows=1&wt=json&facet=true&_=1437462916852}
>  hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance] 
> webapp=/solr path=/select 
> params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.contains=duty+of+care&rows=1&wt=json&facet=true&facet.contains.ignoreCase=true}
>  hits=1856 status=0 QTime=10951
> As show above, prefix search take 5 but contain search take 10951
> Thanks.
>


IntelliJ setup

2015-07-21 Thread Andrew Musselman
I followed the instructions here
https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including `ant
idea`, but I'm still not getting the links in solr classes and methods; do
I need to add libraries, or am I missing something else?

Thanks!


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Andrew Musselman
I'm not sure, it's a remote team but will get more info.  For now, assuming
that a certain directory is specified, like "/user/andrew/", and a regex is
applied to capture anything two directories below matching "*/*/*.pdf".

Would there be a way to capture the wild-carded values and index them as
fields?

On Tue, Jul 21, 2015 at 11:20 AM, Upayavira  wrote:

> Keeping to the user list (the right place for this question).
>
> More information is needed here - how are you getting these documents
> into Solr? Are you posting them to /update/extract? Or using DIH, or?
>
> Upayavira
>
> On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> > Dear user and dev lists,
> >
> > We are loading files from a directory and would like to index a portion
> > of
> > each file path as a field as well as the text inside the file.
> >
> > E.g., on HDFS we have this file path:
> >
> > /user/andrew/1234/1234/file.pdf
> >
> > And we would like the "1234" token parsed from the file path and indexed
> > as
> > an additional field that can be searched on.
> >
> > From my initial searches I can't see how to do this easily, so would I
> > need
> > to write some custom code, or a plugin?
> >
> > Thanks!
>


Re: Parsing and indexing parts of the input file paths

2015-07-21 Thread Upayavira
Keeping to the user list (the right place for this question).

More information is needed here - how are you getting these documents
into Solr? Are you posting them to /update/extract? Or using DIH, or?

Upayavira

On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote:
> Dear user and dev lists,
> 
> We are loading files from a directory and would like to index a portion
> of
> each file path as a field as well as the text inside the file.
> 
> E.g., on HDFS we have this file path:
> 
> /user/andrew/1234/1234/file.pdf
> 
> And we would like the "1234" token parsed from the file path and indexed
> as
> an additional field that can be searched on.
> 
> From my initial searches I can't see how to do this easily, so would I
> need
> to write some custom code, or a plugin?
> 
> Thanks!


Parsing and indexing parts of the input file paths

2015-07-21 Thread Andrew Musselman
Dear user and dev lists,

We are loading files from a directory and would like to index a portion of
each file path as a field as well as the text inside the file.

E.g., on HDFS we have this file path:

/user/andrew/1234/1234/file.pdf

And we would like the "1234" token parsed from the file path and indexed as
an additional field that can be searched on.

>From my initial searches I can't see how to do this easily, so would I need
to write some custom code, or a plugin?

Thanks!


Re: solr blocking and client timeout issue

2015-07-21 Thread Jeremy Ashcraft
I did find a dark corner of our application that a dev had left some 
experimental code in that snuck past QA, because it was rarely used.  A 
client discovered and was using it heavily over the past week.  It was 
generating multiple consecutive update/commit requests.  Its been 
disabled and the long GC pauses have nearly stopped (so far).  We did 
see one at about 4am for about 5 minutes.


is there a way to try to mitigate these longer GC, if/when they do 
happen. (FYI, we are upgrading to OpenJDK 1.8 tonight.  its been working 
great in dev/QA, so hopefully it will make enough of a difference)


On 07/20/2015 09:31 PM, Erick Erickson wrote:

bq: the config is set up per the NRT suggestions in the docs.
autoSoftCommit every 2 seconds and autoCommit every 10 minutes.

2 second soft commit is very aggressive, no matter what the NRT
suggestions are. My first question is whether that's really needed.
The soft commits should be as long as you can stand. And don't listen
to  your product manager who says "2 seconds is required", push back
and answer whether that's really necessary. Most people won't notice
the difference.

bq: ...we are noticing a lot higher number of hard commits than usual.

Is a client somewhere issuing a hard commit? This is rarely
recommended... And is openSearcher true or false? False is a
relatively cheap operation, true is quite expensive.

More than you want to know about hard and soft commits:

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best,
Erick

Best,
Erick

On Mon, Jul 20, 2015 at 12:48 PM, Jeremy Ashcraft  wrote:

heap is already at 5GB

On 07/20/2015 12:29 PM, Jeremy Ashcraft wrote:

no swapping that I'm seeing, although we are noticing a lot higher number
of hard commits than usual.

the config is set up per the NRT suggestions in the docs.  autoSoftCommit
every 2 seconds and autoCommit every 10 minutes.

there have been 463 updates in the past 2 hours, all followed by hard
commits

INFO  - 2015-07-20 12:26:20.979;
org.apache.solr.update.DirectUpdateHandler2; start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-07-20 12:26:21.021; org.apache.solr.core.SolrDeletionPolicy;
SolrDeletionPolicy.onCommit: commits: num=2

commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_e9nk,generation=665696}

commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd;
maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_e9nl,generation=665697}
INFO  - 2015-07-20 12:26:21.022; org.apache.solr.core.SolrDeletionPolicy;
newest commit generation = 665697
INFO  - 2015-07-20 12:26:21.026;
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2015-07-20 12:26:21.026;
org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
webapp=/solr path=/update params={omitHeader=false&wt=json}
{add=[8653ea29-a327-4a54-9b00-8468241f2d7c (1507244513403338752),
5cf034a9-d93a-4307-a367-02cb21fa8e35 (1507244513404387328),
816e3a04-9d0e-4587-a3ee-9f9e7b0c7d74 (1507244513405435904)],commit=} 0 50

could that be causing some of the problems?


From: Shawn Heisey 
Sent: Monday, July 20, 2015 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: solr blocking and client timeout issue

On 7/20/2015 11:54 AM, Jeremy Ashcraft wrote:

I'm ugrading to the 1.8 JDK on our dev VM now and testing. Hopefully i
can get production upgraded tonight.

still getting the big GC pauses this morning, even after applying the
GC tuning options.  Everything was fine throughout the weekend.

My biggest concern is that this instance had been running with no
issues for almost 2 years, but these GC issues started just last week.

It's very possible that you're simply going to need a larger heap than
you have needed in the past, either because your index has grown, or
because your query patterns have changed and now your queries need more
memory.  It could even be both of these.

At your current index size, assuming that there's nothing else on this
machine, you should have enough memory to raise your heap to 5GB.

If there ARE other software pieces on this machine, then the long GC
pauses (along with other performance issues) could be explained by too
much memory allocation out of the 8GB total memory, resulting in
swapping at the OS level.

Thanks,
Shawn


--
*jeremy ashcraft*
development manager
EdGate Correlation Services 
/253.853.7133 x228/


--
*jeremy ashcraft*
development manager
EdGate Correlation Services 
/253.853.7133 x228/


upgrade clusterstate.json fom 4.10.4 to split state.json in 5.2.1

2015-07-21 Thread Yago Riveiro
Hi,


How can I upgrade the clusterstate.json to be split by collection?


I read this issue https://issues.apache.org/jira/browse/SOLR-5473.


In theory exists a param “stateFormat” that configured to 2 says to use the 
/collections/collection/cluster.son format.


Where can I configure this?

—/Yago Riveiro

Re: SOLR nrt read writes

2015-07-21 Thread Alessandro Benedetti
>
> Could this be due to caching? I have tried to disable all in my solrconfig.


If you mean Solr caches ? NO .
Solr caches live the life of the searcher.
So new searcher, new caches ( possibly warmed with updated results) .

If you mean your application caching or browser caching, you should verify,
i assume you have control on that.

Cheers

2015-07-21 6:02 GMT+01:00 Bhawna Asnani :

> Thanks, I tried turning off auto softCommits but that didn't help much.
> Still seeing stale results every now and then. Also load on the server very
> light. We are running this just on a test server with one or two users. I
> don't see any warning in logs whole doing softCommits and it says it
> successfully opened new searcher and registered it as main searcher. Could
> this be due to caching? I have tried to disable all in my solrconfig.
>
> Sent from my iPhone
>
> > On Jul 20, 2015, at 12:16 PM, Shawn Heisey  wrote:
> >
> >> On 7/20/2015 9:29 AM, Bhawna Asnani wrote:
> >> Thanks for your suggestions. The requirement is still the same , to be
> >> able to make a change to some solr documents and be able to see it on
> >> subsequent search/facet calls.
> >> I am using softCommit with waitSearcher=true.
> >>
> >> Also I am sending reads/writes to a single solr node only.
> >> I have tried disabling caches and warmup time in logs is '0' but every
> >> once in a while I do get the document just updated with stale data.
> >>
> >> I went through lucene documentation and it seems opening the
> >> IndexReader with the IndexWriter should make the changes visible to
> >> the reader.
> >>
> >> I checked solr logs no errors. I see this in logs each time
> >> 'Registered new searcher Searcher@x' even before searches that had
> >> the stale document.
> >>
> >> I have attached my solrconfig.xml for reference.
> >
> > Your attachment made it through the mailing list processing.  Most
> > don't, I'm surprised.  Some thoughts:
> >
> > maxBooleanClauses has been set to 40.  This is a lot.  If you
> > actually need a setting that high, then you are sending some MASSIVE
> > queries, which probably means that your Solr install is exceptionally
> > busy running those queries.
> >
> > If the server is fairly busy, then you should increase maxTime on
> > autoCommit.  I use a value of five minutes (30) ... and my server is
> > NOT very busy most of the time.  A commit with openSearcher set to false
> > is relatively fast, but it still has somewhat heavy CPU, memory, and
> > disk I/O resource requirements.
> >
> > You have autoSoftCommit set to happen after five seconds.  If updates
> > happen frequently or run for very long, this is potentially a LOT of
> > committing and opening new searchers.  I guess it's better than trying
> > for one second, but anything more frequent than once a minute is likely
> > to get you into trouble unless the system load is extremely light ...
> > but as already discussed, your system load is probably not light.
> >
> > For the kind of Near Real Time setup you have mentioned, where you want
> > to do one or more updates, commit, and then query for the changes, you
> > probably should completely remove autoSoftCommit from the config and
> > *only* open new searchers with explicit soft commits.  Let autoCommit
> > (with a maxTime of 1 to 5 minutes) handle durability concerns.
> >
> > A lot of pieces in your config file are set to depend on java system
> > properties just like the example does, but since we do not know what
> > system properties have been set, we can't tell for sure what those parts
> > of the config are doing.
> >
> > Thanks,
> > Shawn
> >
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Alessandro Benedetti
Hi Mese,

let me try to answer to your 2 questions :


1. What happens if a shard(both leader and replica) goes down. If the
> document on the "dead shard" is updated, will it forward the document to
> the
> new shard. If so, when the "dead shard" comes up again, will this not be
> considered for the same hask key range?
>

I see some confusion here.
First of all you need a smart client that will load balance the docs to
index.
Let's say the CloudSolrClient .

A solr document update is always a deletion and a re-insertion.
This means that you get the document from the index ( the stored fields),
and you add the document again.

If the document is on a dead shard, you have lost it, you can not retrieve
it until you have that shard to go up again.
Possibly it's still in the transaction log.

In the case you are re-indexing the doc, the doc will be re-index.
When the shard is up again, there will be 2 versions of the documents.
With some different fields but the same id.
What do you mean with : "will this not be
considered for the same hask key range " ?



> 2. Is there a way to fix this[removing duplicates across shards]?


 i assume not an easy way.
You could re-index the content applying a Deduplication Update Request
processor.
But it will be costly.

Cheers

2015-07-21 15:01 GMT+01:00 Reitzel, Charles :

> Also, the function used to generate hashes is
> org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a
> 32-bit value.   The range of the hash values assigned to each shard are
> resident in Zookeeper.   Since you are using only a single hash component,
> all 32-bits will be used by the entire ID field value.
>
> I.e. I see no routing delimiter (!) in your example ID value:
>
>
> "possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30"
>
> Which isn't required, but it means that documents (logs?) will be
> distributed in a round-robin fashion over the shards.  Not grouped by host
> or environment (if I am reading it right).
>
> You might consider the following:  !!UUID
>
> E.g. "intl-staging!possting.mongo-v2.services.com
> !c2d2a376-5e4a-11e2-8963-0026b9414f30"
>
> This way documents from the same host will be grouped together, most
> likely on the same shard.  Further, within the same environment, documents
> will be grouped on the same subset of shards. This will allow client
> applications to set _route_=!  or
> _route_=!! and limit queries to those shards
> containing relevant data when the corresponding filter queries are applied.
>
> If you were using route delimiters, then the default for a 2-part key (1
> delimiter) is to use 16 bits for each part.  The default for a 3-part key
> (2 delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for
> the 3rd part.   In any case, the high-order bytes of the hash dominate the
> distribution of data.
>
> -Original Message-
> From: Reitzel, Charles
> Sent: Tuesday, July 21, 2015 9:55 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr Cloud: Duplicate documents in multiple shards
>
> When are you generating the UUID exactly?   If you set the unique ID field
> on an "update", and it contains a new UUID, you have effectively created a
> new document.   Just a thought.
>
> -Original Message-
> From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com]
> Sent: Tuesday, July 21, 2015 4:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Cloud: Duplicate documents in multiple shards
>
> Unable to delete by passing distrib=false as well. Also it is difficult to
> identify those duplicate documents among the 130 million.
>
> Is there a way we can see the generated hash key and mapping them to the
> specific shard?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender
> immediately and then delete it.
>
> TIAA-CREF
> *
>
>


-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Query Performance

2015-07-21 Thread Nagasharath
I tried using SolrMeter but for some reason it does not detect my url and 
throws solr server exception

Sent from my iPhone

> On 21-Jul-2015, at 10:58 am, Alessandro Benedetti 
>  wrote:
> 
> SolrMeter mate,
> 
> http://code.google.com/p/solrmeter/
> 
> Take a look, it will help you a lot !
> 
> Cheers
> 
> 2015-07-21 16:49 GMT+01:00 Nagasharath :
> 
>> Any recommended tool to test the query performance would be of great help.
>> 
>> Thanks
> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England


Re: Query Performance

2015-07-21 Thread Alessandro Benedetti
SolrMeter mate,

http://code.google.com/p/solrmeter/

Take a look, it will help you a lot !

Cheers

2015-07-21 16:49 GMT+01:00 Nagasharath :

> Any recommended tool to test the query performance would be of great help.
>
> Thanks
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Migrating junit tests from Solr 4.5.1 to Solr 5.2.1

2015-07-21 Thread Rich Hume
I am migrating from Solr 4.5.1 to Solr 5.2.1 on a Windows platform.  I am using 
multi-core, but not Solr cloud.  I am having issues with my suite of junit 
tests.  My tests currently use code I found in SOLR-4502.

I was wondering whether anyone could point me at best-practice examples of 
multi-core junit tests for Solr 5.2.1?

Thanks
Rich



Query Performance

2015-07-21 Thread Nagasharath
Any recommended tool to test the query performance would be of great help.

Thanks


Re: Use REST API URL to update field

2015-07-21 Thread Zheng Lin Edwin Yeo
Ok. Thanks for your advice.

Regards,
Edwin

On 21 July 2015 at 15:37, Upayavira  wrote:

> curl is just a command line HTTP client. You can use HTTP POST to send
> the JSON that you are mentioning below via any means that works for you
> - the file does not need to exist on disk - it just needs to be added to
> the body of the POST request.
>
> I'd say review how to do HTTP POST requests from your chosen programming
> language and you should see how to do this.
>
> Upayavira
>
> On Tue, Jul 21, 2015, at 04:12 AM, Zheng Lin Edwin Yeo wrote:
> > Hi Shawn,
> >
> > So it means that if my following is in a text file called update.txt,
> >
> > {"id":"testing_0001",
> >
> > "popularity":{"inc":1}
> >
> > This text file must still exist if I use the URL? Or can this information
> > in the text file be put directly onto the URL?
> >
> > Regards,
> > Edwin
> >
> >
> > On 20 July 2015 at 22:04, Shawn Heisey  wrote:
> >
> > > On 7/20/2015 2:06 AM, Zheng Lin Edwin Yeo wrote:
> > > > I'm using Solr 5.2.1, and I would like to check, is there a way to
> update
> > > > certain field by using REST API URL directly instead of using curl?
> > > >
> > > > For example, I would like to increase the "popularity" field in my
> index
> > > > each time a user click on the record.
> > > >
> > > > Currently, it can work with the curl command by having this in my
> text
> > > file
> > > > to be read by curl (the "id" is hard-coded here for example purpose)
> > > >
> > > > {"id":"testing_0001",
> > > >
> > > > "popularity":{"inc":1}
> > > >
> > > >
> > > > Is there a REST API URL that I can call to achieve the same purpose?
> > >
> > > The URL that you would use with curl *IS* the URL that you would use
> for
> > > a REST-like call.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
>


Re: Data Import Handler Stays Idle

2015-07-21 Thread Paden
Hey shawn when I use the -m 2g command in my script I get the error a 'cannot
open [path]/server/logs/solr.log for reading: No such file or directory' I
do not see how this would affect that. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218389.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Data Import Handler Stays Idle

2015-07-21 Thread Paden
Okay. I'm going to run the index again with specifications that you
recommended. This could take a few hours but I will post the entire trace on
that error when it pops up again and I will let you guys know the results of
increasing the heap size. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218382.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Data Import Handler Stays Idle

2015-07-21 Thread Shawn Heisey
On 7/21/2015 8:17 AM, Paden wrote:
> There are some zip files inside the directory and have been addressed to in
> the database. I'm thinking those are the one's it's jumping right over. They
> are not the issue. At least I'm 95% sure. And Shawn if you're still watching
> I'm sorry I'm using solr-5.1.0.

Have you started Solr with a larger heap than the default 512MB in Solr
5.x?  Tika can require a lot of memory.  I would have expected there to
be OutOfMemoryError exceptions in the log if that were the problem, though.

You may need to use the "-m" option on the startup scripts to increase
the max heap.  Starting with "-m 2g" would be a good idea.

Also, seeing the entire multi-line IOException from the log (which may
be dozens of lines) could be important.

Thanks,
Shawn



Re: Data Import Handler Stays Idle

2015-07-21 Thread Paden
There are some zip files inside the directory and have been addressed to in
the database. I'm thinking those are the one's it's jumping right over. They
are not the issue. At least I'm 95% sure. And Shawn if you're still watching
I'm sorry I'm using solr-5.1.0.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218371.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Reitzel, Charles
Also, the function used to generate hashes is 
org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a 32-bit 
value.   The range of the hash values assigned to each shard are resident in 
Zookeeper.   Since you are using only a single hash component, all 32-bits will 
be used by the entire ID field value.   

I.e. I see no routing delimiter (!) in your example ID value:

"possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30"

Which isn't required, but it means that documents (logs?) will be distributed 
in a round-robin fashion over the shards.  Not grouped by host or environment 
(if I am reading it right).

You might consider the following:  !!UUID

E.g. 
"intl-staging!possting.mongo-v2.services.com!c2d2a376-5e4a-11e2-8963-0026b9414f30"

This way documents from the same host will be grouped together, most likely on 
the same shard.  Further, within the same environment, documents will be 
grouped on the same subset of shards. This will allow client applications to 
set _route_=!  or _route_=!! and limit 
queries to those shards containing relevant data when the corresponding filter 
queries are applied.

If you were using route delimiters, then the default for a 2-part key (1 
delimiter) is to use 16 bits for each part.  The default for a 3-part key (2 
delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for the 3rd 
part.   In any case, the high-order bytes of the hash dominate the distribution 
of data.

-Original Message-
From: Reitzel, Charles 
Sent: Tuesday, July 21, 2015 9:55 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Cloud: Duplicate documents in multiple shards

When are you generating the UUID exactly?   If you set the unique ID field on 
an "update", and it contains a new UUID, you have effectively created a new 
document.   Just a thought.

-Original Message-
From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] 
Sent: Tuesday, July 21, 2015 4:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud: Duplicate documents in multiple shards

Unable to delete by passing distrib=false as well. Also it is difficult to 
identify those duplicate documents among the 130 million. 

Is there a way we can see the generated hash key and mapping them to the 
specific shard?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*



RE: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Reitzel, Charles
When are you generating the UUID exactly?   If you set the unique ID field on 
an "update", and it contains a new UUID, you have effectively created a new 
document.   Just a thought.

-Original Message-
From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] 
Sent: Tuesday, July 21, 2015 4:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud: Duplicate documents in multiple shards

Unable to delete by passing distrib=false as well. Also it is difficult to 
identify those duplicate documents among the 130 million. 

Is there a way we can see the generated hash key and mapping them to the 
specific shard?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*



Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-21 Thread Yonik Seeley
On Tue, Jul 21, 2015 at 3:09 AM, Ali Nazemian  wrote:
> Dear Erick,
> I found another thing, I did check the number of unique terms for this
> field using schema browser, It reported 1683404 number of terms! Does it
> exceed the maximum number of unique terms for "fcs" facet method?

The real limit is not simple since the data is not stored in a simple
way (it's compressed).

> I read
> somewhere it should be more than 16m does it true?!

More like 16MB of delta-coded terms per block of documents (the index
is split up into 256 blocks for this purpose)

See DocTermOrds.java if you want more details than that.

-Yonik


RE: Programmatically find out if node is overseer

2015-07-21 Thread Markus Jelsma
Hello - this approach not only solves the problem but also allows me to run 
different processing threads on other nodes.

Thanks!
Markus
 
-Original message-
> From:Chris Hostetter 
> Sent: Saturday 18th July 2015 1:00
> To: solr-user 
> Subject: Re: Programmatically find out if node is overseer
> 
> 
> : Hello - i need to run a thread on a single instance of a cloud so need 
> : to find out if current node is the overseer. I know we can already 
> : programmatically find out if this replica is the leader of a shard via 
> : isLeader(). I have looked everywhere but i cannot find an isOverseer. I 
> 
> At one point, i woked up a utility method to give internal plugins 
> access to an "isOverseer()" type utility method...
> 
>    https://issues.apache.org/jira/browse/SOLR-5823
> 
> ...but ultimately i abandoned this because i was completley forgetting 
> (until much much too late) that there's really no reason to assume that 
> any/all collections will have a single shard on the same node as the 
> overseer -- so having a plugin that only does stuff if it's running on the 
> overseer node is a really bad idea, because it might not run at all. (even 
> if it's configured in every collection)
> 
> 
> what i ultimately wound up doing (see SOLR-5795) is implementing a 
> solution where every core (of each collection configured to want this 
> functionality) has a thread running (a TimedExecutor) which would do 
> nothing unless...
>  * my slice is active? (ie: not in the process of being shut down)
>  * my slice is 'first' in a sorted list of slices?
>  * i am currently the leader of my slice?
> 
> ...that way when the timer goes off ever X minutes, at *most* one thread 
> fires (we might sporadically get no evens triggered if/when there is 
> leader election in progress for the slice that matters)
> 
> the choice of "first" slice name alphabetically is purely becuase it's 
> something cheap to compute and garunteeded to be unique.
> 
> 
> If you truly want exactly one thread for the entire cluster, regardless of 
> collection, you could do the same basic idea by just adding a "my 
> collection is 'first' in a sorted list of collection names?"
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 


Re: Performance of facet contain search in 5.2.1

2015-07-21 Thread Alessandro Benedetti
Hi Dave,
generally giving terms in a dictionary, it's much more efficient to run
prefix queries than "contain" queries.
Talking about using docValues, if I remember well when they are loaded in
memory they are skipList, so you can use two operators on them :

- next() that simply gives you ht next field value for the field doc values
loaded
- advance ( ByteRef term) which jump to the term of the greatest term if
the one searched is missing.

Using the facet prefix we can jump to the point we want and basically
iterate the values that are matching.

To verify the contains, it is simply used on each term in the docValues,
term by term, using the StringUtil.contains() .
How many different unique terms do you have in the index for that field ?

So the difference in performance could make sense ( we are basically moving
to logarithmic to linear to simplify) .

I read the name of the field as "facet.field=autocomplete", it's legit to
ask you if you are using faceting to obtain infix auto completion ?
In the case, can you help us, better identifying the problem and maybe
provide you with a better solution ?

Cheers



2015-07-21 9:16 GMT+01:00 Lo Dave :

> I found that facet contain search take much longer time than facet prefix
> search. Do anyone have idea how to make contain search faster?
> org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select
> params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.prefix=duty+of+care&rows=1&wt=json&facet=true&_=1437462916852}
> hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance]
> webapp=/solr path=/select
> params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.contains=duty+of+care&rows=1&wt=json&facet=true&facet.contains.ignoreCase=true}
> hits=1856 status=0 QTime=10951
> As show above, prefix search take 5 but contain search take 10951
> Thanks.
>




-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Performance of facet contain search in 5.2.1

2015-07-21 Thread Lo Dave
I found that facet contain search take much longer time than facet prefix 
search. Do anyone have idea how to make contain search faster?
org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select 
params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.prefix=duty+of+care&rows=1&wt=json&facet=true&_=1437462916852}
 hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance] 
webapp=/solr path=/select 
params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.contains=duty+of+care&rows=1&wt=json&facet=true&facet.contains.ignoreCase=true}
 hits=1856 status=0 QTime=10951 
As show above, prefix search take 5 but contain search take 10951
Thanks.
  

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread mesenthil1
Unable to delete by passing distrib=false as well. Also it is difficult to
identify those duplicate documents among the 130 million. 

Is there a way we can see the generated hash key and mapping them to the
specific shard?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr blocking and client timeout issue

2015-07-21 Thread Daniel Collins
We have a similar situation: production runs Java 7u10 (yes, we know its
old!), and has custom GC options (G1 works well for us), and a 40Gb heap.
We are a heavy user of NRT (sub-second soft-commits!), so that may be the
common factor here.

Every time we have tried a later Java 7 or Java 8, the heap blows up in no
time at all.  We are still investigating the root cause (we do need to
migrate to Java 8), but I'm thinking that very high commit rates seem to be
the common link here (and its not a common Solr use case I admit).

I don't have any silver bullet answers to offer yet, but my
suspicion/conjecture (no real evidence yet, I admit) is that the frequent
commits are leaving temporary objects around (which they are entitled to
do), and something has changed in the GC in later Java 7/8 which means they
are slower to get rid of those, hence the overall heap usage is higher
under this use case.

@Jeremy, you don't have a lot of head room, but try a higher heap size?
Could you go to 6Gb and see if that at least delays the issue?

Erick is correct though, if you can reduce the commit rate, I'm sure that
would alleviate the issue.

On 21 July 2015 at 05:31, Erick Erickson  wrote:

> bq: the config is set up per the NRT suggestions in the docs.
> autoSoftCommit every 2 seconds and autoCommit every 10 minutes.
>
> 2 second soft commit is very aggressive, no matter what the NRT
> suggestions are. My first question is whether that's really needed.
> The soft commits should be as long as you can stand. And don't listen
> to  your product manager who says "2 seconds is required", push back
> and answer whether that's really necessary. Most people won't notice
> the difference.
>
> bq: ...we are noticing a lot higher number of hard commits than usual.
>
> Is a client somewhere issuing a hard commit? This is rarely
> recommended... And is openSearcher true or false? False is a
> relatively cheap operation, true is quite expensive.
>
> More than you want to know about hard and soft commits:
>
>
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> Best,
> Erick
>
> Best,
> Erick
>
> On Mon, Jul 20, 2015 at 12:48 PM, Jeremy Ashcraft 
> wrote:
> > heap is already at 5GB
> >
> > On 07/20/2015 12:29 PM, Jeremy Ashcraft wrote:
> >>
> >> no swapping that I'm seeing, although we are noticing a lot higher
> number
> >> of hard commits than usual.
> >>
> >> the config is set up per the NRT suggestions in the docs.
> autoSoftCommit
> >> every 2 seconds and autoCommit every 10 minutes.
> >>
> >> there have been 463 updates in the past 2 hours, all followed by hard
> >> commits
> >>
> >> INFO  - 2015-07-20 12:26:20.979;
> >> org.apache.solr.update.DirectUpdateHandler2; start
> >>
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> >> INFO  - 2015-07-20 12:26:21.021;
> org.apache.solr.core.SolrDeletionPolicy;
> >> SolrDeletionPolicy.onCommit: commits: num=2
> >>
> >> commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
> /opt/solr/solr/collection1/data/index
> >> lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd;
> >> maxCacheMB=48.0
> maxMergeSizeMB=4.0),segFN=segments_e9nk,generation=665696}
> >>
> >> commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
> /opt/solr/solr/collection1/data/index
> >> lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd;
> >> maxCacheMB=48.0
> maxMergeSizeMB=4.0),segFN=segments_e9nl,generation=665697}
> >> INFO  - 2015-07-20 12:26:21.022;
> org.apache.solr.core.SolrDeletionPolicy;
> >> newest commit generation = 665697
> >> INFO  - 2015-07-20 12:26:21.026;
> >> org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> >> INFO  - 2015-07-20 12:26:21.026;
> >> org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> >> webapp=/solr path=/update params={omitHeader=false&wt=json}
> >> {add=[8653ea29-a327-4a54-9b00-8468241f2d7c (1507244513403338752),
> >> 5cf034a9-d93a-4307-a367-02cb21fa8e35 (1507244513404387328),
> >> 816e3a04-9d0e-4587-a3ee-9f9e7b0c7d74 (1507244513405435904)],commit=} 0
> 50
> >>
> >> could that be causing some of the problems?
> >>
> >> 
> >> From: Shawn Heisey 
> >> Sent: Monday, July 20, 2015 11:44 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: solr blocking and client timeout issue
> >>
> >> On 7/20/2015 11:54 AM, Jeremy Ashcraft wrote:
> >>>
> >>> I'm ugrading to the 1.8 JDK on our dev VM now and testing. Hopefully i
> >>> can get production upgraded tonight.
> >>>
> >>> still getting the big GC pauses this morning, even after applying the
> >>> GC tuning options.  Everything was fine throughout the weekend.
> >>>
> >>> My biggest concern is that this instance had been running with no
> >>> issues for almost 2 years, but these GC issues started just last week.
> >>
> >> It's very possible that you're simply going to need a larger heap than

Re: WordDelimiterFilter Leading & Trailing Special Character

2015-07-21 Thread Upayavira
Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
this config:

 
   
 
 
   
 

Note the protected="x" attribute. I suspect if you put Yahoo! into a
file referenced by that attribute, it may survive analysis. I'd be
curious to hear whether it works.

Upayavira

On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
> Question about WordDelimiterFilter. The search behavior that we
> experience
> with WordDelimiterFilter satisfies well, except for the case where there
> is
> a special character either at the leading or trailing end of the term.
> 
> For instance:
> 
> *‘d&b’ *  —>  Works as expected. Finds all docs with ‘d&b’.
> *‘p!nk’*  —>  Works fine as above.
> 
> But on cases when, there is a special character towards the trailing end
> of
> the term, like ‘Yahoo!’
> 
> *‘yahoo!’* —> Turns out to be a search for just *‘yahoo’* with the
> special
> character *‘!’* stripped out.  This WordDelimiterFilter behavior is
> documented
> http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
> 
> What I would like to have is, the search performed without stripping out
> the leading & trailing special character. Is there a way to achieve this
> behavior with WordDelimiterFilter.
> 
> This is current config that we have for the field:
> 
>  positionIncrementGap="100">
> 
> 
>  splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> preserveOriginal="1"
> types="specialchartypes.txt"/>
> 
> 
> 
> 
>  splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> preserveOriginal="1"
> types="specialchartypes.txt"/>
> 
> 
> 
> 
> 
> thanks


Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Upayavira
I suspect you can delete a document from the wrong shard by using
update?distrib=false.

I also suspect there are people here who would like to help you debug
this, because it has been reported before, but we haven't yet been able
to see whether it occurred due to human or software error.

Upayavira

On Tue, Jul 21, 2015, at 05:51 AM, mesenthil1 wrote:
> Thanks Erick for clarifying ..
> We are not explicitly setting the compositeId. We are using numShards=5
> alone as part of the server start up. We are using uuid as unique field.
> 
> One sample id is :
> 
> possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30
> 
> 
> Not sure how it would have gone to multiple shards.  Do you have any
> suggestion for fixing this. Or we need to completely rebuild the index.
> When the routing key is compositeId, should we explicitly set "!" with
> shard
> key? 
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218296.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR nrt read writes

2015-07-21 Thread Upayavira
Bhawna,

I think you need to reconcile yourself to the fact that what you want to
achieve is not going to be possible.

Solr (and Lucene underneath it) is HEAVILY optimised for high read/low
write situations, and that leads to some latency in content reaching the
index. If you wanted to change this, you'd have to get into some heavy
Java/Lucene coding, as I believe Twitter have done on Lucene itself.

I'd say, rather than attempting to change this, I'd say you need to work
out a way in your UI to handle this situation. E.g. have a "refresh on
stale results" button, or "not seeing your data, try here". Or, if a
user submits data, then wants to search for it in the same session, have
your UI enforce a minimum 10s delay before it sends a request to Solr,
or something like that. Efforts to solve this at the Solr end, without
spending substantial sums and effort on it, will be futile as it isn't
what Solr/Lucene are designed for.

Upayavira

On Tue, Jul 21, 2015, at 06:02 AM, Bhawna Asnani wrote:
> Thanks, I tried turning off auto softCommits but that didn't help much.
> Still seeing stale results every now and then. Also load on the server
> very light. We are running this just on a test server with one or two
> users. I don't see any warning in logs whole doing softCommits and it
> says it successfully opened new searcher and registered it as main
> searcher. Could this be due to caching? I have tried to disable all in my
> solrconfig.
> 
> Sent from my iPhone
> 
> > On Jul 20, 2015, at 12:16 PM, Shawn Heisey  wrote:
> > 
> >> On 7/20/2015 9:29 AM, Bhawna Asnani wrote:
> >> Thanks for your suggestions. The requirement is still the same , to be
> >> able to make a change to some solr documents and be able to see it on
> >> subsequent search/facet calls.
> >> I am using softCommit with waitSearcher=true.
> >> 
> >> Also I am sending reads/writes to a single solr node only.
> >> I have tried disabling caches and warmup time in logs is '0' but every
> >> once in a while I do get the document just updated with stale data.
> >> 
> >> I went through lucene documentation and it seems opening the
> >> IndexReader with the IndexWriter should make the changes visible to
> >> the reader.
> >> 
> >> I checked solr logs no errors. I see this in logs each time
> >> 'Registered new searcher Searcher@x' even before searches that had
> >> the stale document. 
> >> 
> >> I have attached my solrconfig.xml for reference.
> > 
> > Your attachment made it through the mailing list processing.  Most
> > don't, I'm surprised.  Some thoughts:
> > 
> > maxBooleanClauses has been set to 40.  This is a lot.  If you
> > actually need a setting that high, then you are sending some MASSIVE
> > queries, which probably means that your Solr install is exceptionally
> > busy running those queries.
> > 
> > If the server is fairly busy, then you should increase maxTime on
> > autoCommit.  I use a value of five minutes (30) ... and my server is
> > NOT very busy most of the time.  A commit with openSearcher set to false
> > is relatively fast, but it still has somewhat heavy CPU, memory, and
> > disk I/O resource requirements.
> > 
> > You have autoSoftCommit set to happen after five seconds.  If updates
> > happen frequently or run for very long, this is potentially a LOT of
> > committing and opening new searchers.  I guess it's better than trying
> > for one second, but anything more frequent than once a minute is likely
> > to get you into trouble unless the system load is extremely light ...
> > but as already discussed, your system load is probably not light.
> > 
> > For the kind of Near Real Time setup you have mentioned, where you want
> > to do one or more updates, commit, and then query for the changes, you
> > probably should completely remove autoSoftCommit from the config and
> > *only* open new searchers with explicit soft commits.  Let autoCommit
> > (with a maxTime of 1 to 5 minutes) handle durability concerns.
> > 
> > A lot of pieces in your config file are set to depend on java system
> > properties just like the example does, but since we do not know what
> > system properties have been set, we can't tell for sure what those parts
> > of the config are doing.
> > 
> > Thanks,
> > Shawn
> > 


Re: Issue with using createNodeSet in Solr Cloud

2015-07-21 Thread Upayavira
Note, when you start up the instances, you can pass in a hostname to use
instead of the IP address. If you are using bin/solr (which you should
be!!) then you can use bin/solr -h my-host-name and that'll be used in
place of the IP.

Upayavira

On Tue, Jul 21, 2015, at 05:45 AM, Erick Erickson wrote:
> Glad you found a solution
> 
> Best,
> Erick
> 
> On Mon, Jul 20, 2015 at 3:21 AM, Savvas Andreas Moysidis
>  wrote:
> > Erick, spot on!
> >
> > The nodes had been registered in zookeeper under my network interface's IP
> > address...after specifying those the command worked just fine.
> >
> > It was indeed the thing I thought was true that wasn't... :)
> >
> > Many thanks,
> > Savvas
> >
> > On 18 July 2015 at 20:47, Erick Erickson  wrote:
> >
> >> P.S.
> >>
> >> "It ain't the things ya don't know that'll kill ya, it's the things ya
> >> _do_ know that ain't so"...
> >>
> >> On Sat, Jul 18, 2015 at 12:46 PM, Erick Erickson
> >>  wrote:
> >> > Could you post your clusterstate.json? Or at least the "live nodes"
> >> > section of your ZK config? (adminUI>>cloud>>tree>>live_nodes. The
> >> > addresses of my nodes are things like 192.168.1.201:8983_solr. I'm
> >> > wondering if you're taking your node names from the information ZK
> >> > records or assuming it's 127.0.0.1
> >> >
> >> > On Sat, Jul 18, 2015 at 8:56 AM, Savvas Andreas Moysidis
> >> >  wrote:
> >> >> Thanks Eric,
> >> >>
> >> >> The strange thing is that although I have set the log level to "ALL" I
> >> see
> >> >> no error messages in the logs (apart from the line saying that the
> >> response
> >> >> is a 400 one).
> >> >>
> >> >> I'm quite confident the configset does exist as the collection gets
> >> created
> >> >> fine if I don't specify the createNodeSet param.
> >> >>
> >> >> Complete mystery..! I'll keep on troubleshooting and report back with my
> >> >> findings.
> >> >>
> >> >> Cheers,
> >> >> Savvas
> >> >>
> >> >> On 17 July 2015 at 02:14, Erick Erickson 
> >> wrote:
> >> >>
> >> >>> There were a couple of cases where the "no live servers" was being
> >> >>> returned when the error was something completely different. Does the
> >> >>> Solr log show something more useful? And are you sure you have a
> >> >>> configset named collection_A?
> >> >>>
> >> >>> 'cause this works (admittedly on 5.x) fine for me, and I'm quite sure
> >> >>> there are bunches of automated tests that would be failing so I
> >> >>> suspect it's just a misleading error being returned.
> >> >>>
> >> >>> Best,
> >> >>> Erick
> >> >>>
> >> >>> On Thu, Jul 16, 2015 at 2:22 AM, Savvas Andreas Moysidis
> >> >>>  wrote:
> >> >>> > Hello There,
> >> >>> >
> >> >>> > I am trying to use the createNodeSet parameter when creating a new
> >> >>> > collection but I'm getting an error when doing so.
> >> >>> >
> >> >>> > More specifically, I have four Solr instances running locally in
> >> separate
> >> >>> > JVMs (127.0.0.1:8983, 127.0.0.1:8984, 127.0.0.1:8985, 127.0.0.1:8986
> >> )
> >> >>> and a
> >> >>> > standalone Zookeeper instance which all Solr instances point to. The
> >> four
> >> >>> > Solr instances have no collections added to them and are all up and
> >> >>> running
> >> >>> > (I can access the admin page in all of them).
> >> >>> >
> >> >>> > Now, I want to create a collections in only two of these four
> >> instances (
> >> >>> > 127.0.0.1:8983, 127.0.0.1:8984) but when I hit one instance with the
> >> >>> > following URL:
> >> >>> >
> >> >>> >
> >> >>>
> >> http://localhost:8983/solr/admin/collections?action=CREATE&name=collection_A&numShards=1&replicationFactor=2&maxShardsPerNode=1&createNodeSet=127.0.0.1:8983_solr,127.0.0.1:8984_solr&collection.configName=collection_A
> >> >>> >
> >> >>> > I am getting the following response:
> >> >>> >
> >> >>> > 
> >> >>> > 
> >> >>> > 400
> >> >>> > 3503
> >> >>> > 
> >> >>> > 
> >> >>> >
> >> >>>
> >> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
> >> >>> > Cannot create collection collection_A. No live Solr-instances among
> >> >>> > Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
> >> >>> 127.0.0.1:8984
> >> >>> > _solr
> >> >>> > 
> >> >>> > 
> >> >>> > 
> >> >>> > Cannot create collection collection_A. No live Solr-instances among
> >> >>> > Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
> >> >>> 127.0.0.1:8984
> >> >>> > _solr
> >> >>> > 
> >> >>> > 400
> >> >>> > 
> >> >>> > 
> >> >>> > 
> >> >>> > Cannot create collection collection_A. No live Solr-instances among
> >> >>> > Solr-instances specified in createNodeSet:127.0.0.1:8983_solr,
> >> >>> 127.0.0.1:8984
> >> >>> > _solr
> >> >>> > 
> >> >>> > 400
> >> >>> > 
> >> >>> > 
> >> >>> >
> >> >>> >
> >> >>> > The instances are definitely up and running (at least the admin
> >> console
> >> >>> can
> >> >>> > be accessed as mentioned) and if I remove the createNodeSet
> >> parameter the
> >> >>> > collection is created as expected.
> >> >>> >
> >> >>> > Am I missing something obvious or is this a bug?

Re: Use REST API URL to update field

2015-07-21 Thread Upayavira
curl is just a command line HTTP client. You can use HTTP POST to send
the JSON that you are mentioning below via any means that works for you
- the file does not need to exist on disk - it just needs to be added to
the body of the POST request. 

I'd say review how to do HTTP POST requests from your chosen programming
language and you should see how to do this.

Upayavira

On Tue, Jul 21, 2015, at 04:12 AM, Zheng Lin Edwin Yeo wrote:
> Hi Shawn,
> 
> So it means that if my following is in a text file called update.txt,
> 
> {"id":"testing_0001",
> 
> "popularity":{"inc":1}
> 
> This text file must still exist if I use the URL? Or can this information
> in the text file be put directly onto the URL?
> 
> Regards,
> Edwin
> 
> 
> On 20 July 2015 at 22:04, Shawn Heisey  wrote:
> 
> > On 7/20/2015 2:06 AM, Zheng Lin Edwin Yeo wrote:
> > > I'm using Solr 5.2.1, and I would like to check, is there a way to update
> > > certain field by using REST API URL directly instead of using curl?
> > >
> > > For example, I would like to increase the "popularity" field in my index
> > > each time a user click on the record.
> > >
> > > Currently, it can work with the curl command by having this in my text
> > file
> > > to be read by curl (the "id" is hard-coded here for example purpose)
> > >
> > > {"id":"testing_0001",
> > >
> > > "popularity":{"inc":1}
> > >
> > >
> > > Is there a REST API URL that I can call to achieve the same purpose?
> >
> > The URL that you would use with curl *IS* the URL that you would use for
> > a REST-like call.
> >
> > Thanks,
> > Shawn
> >
> >


Re: Installing Banana on Solr 5.2.1

2015-07-21 Thread Upayavira

On Tue, Jul 21, 2015, at 02:00 AM, Shawn Heisey wrote:
> On 7/20/2015 5:45 PM, Vineeth Dasaraju wrote:
> > I am trying to install Banana on top of solr but haven't been able to do
> > so. All the procedures that I get are for an earlier version of solr. Since
> > the directory structure has changed in the new version, inspite of me
> > placing the banana folder under the server/solr-webapp/webapp folder, I am
> > not able to access it using the url
> > localhost:8983/banana/src/index.html#/dashboard. I would appreciate it if
> > someone can throw some more light into how I can do it.
> 
> I think you would also need an xml file in server/contexts that tells
> Jetty how to load the application.
> 
> I cloned the git repository for banana, and I see
> jetty-contexts/banana-context.xml there.  I would imagine that copying
> this xml file into server/contexts and copying the banana.war generated
> by "ant build-war" into server/webapps would be enough to install it.
> 
> If what I have said here is not enough to help you, then your best bet
> for help with this is to talk to Lucidworks.  They know Solr REALLY well.

I just tried it with the latest Solr. I downloaded v1.5.0.tgz and
unpacked it. I moved the contents of the src directory into
server/solr-webapp/webapp/banana then visited
http://localhost:8983/solr/banana/index.html and it loaded up. I then
needed to click the cog in the top right and change the collection it
was accessing from collection1 to something that was actually there.

>From there, I assume the rest of it will work fine - my test system
didn't have any data in it for me to confirm that.

Upayavira


Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-21 Thread Ali Nazemian
Dear Erick,
I found another thing, I did check the number of unique terms for this
field using schema browser, It reported 1683404 number of terms! Does it
exceed the maximum number of unique terms for "fcs" facet method? I read
somewhere it should be more than 16m does it true?!

Best regards.


On Tue, Jul 21, 2015 at 10:00 AM, Ali Nazemian 
wrote:

> Dear Erick,
>
> Actually faceting on this field is not a user wanted application. I did
> that for the purpose of testing the customized normalizer and charfilter
> which I used. Therefore it just used for the purpose of testing. Anyway I
> did some googling on this error and It seems that changing facet method to
> enum works in other similar cases too. I dont know the differences between
> fcs and enum methods on calculating facet behind the scene, but it seems
> that enum works better in my case.
>
> Best regards.
>
> On Tue, Jul 21, 2015 at 9:08 AM, Erick Erickson 
> wrote:
>
>> This really seems like an XY problem. _Why_ are you faceting on a
>> tokenized field?
>> What are you really trying to accomplish? Because faceting on a
>> generalized
>> content field that's an analyzed field is often A Bad Thing. Try going
>> into the
>> admin UI>> Schema Browser for that field, and you'll see how many unique
>> terms
>> you have in that field. Faceting on that many unique terms is rarely
>> useful to the
>> end user, so my suspicion is that you're not doing what you think you
>> are. Or you
>> have an unusual use-case. Either way, we need to understand what use-case
>> you're trying to support in order to respond helpfully.
>>
>> You say that using facet.enum works, this is very surprising. That method
>> uses
>> the filterCache to create a bitset for each unique term. Which is totally
>> incompatible with the uninverted field error you're reporting, so I
>> clearly don't
>> understand something about your setup. Are you _sure_?
>>
>> Best,
>> Erick
>>
>> On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian 
>> wrote:
>> > Dear Toke and Davidphilip,
>> > Hi,
>> > The fieldtype text_fa has some custom language specific normalizer and
>> > charfilter, here is the schema.xml value related for this field:
>> > > positionIncrementGap="100">
>> >   
>> > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/>
>> > 
>> > 
>> > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/>
>> > > > words="lang/stopwords_fa.txt" />
>> >   
>> >   
>> > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/>
>> > 
>> > 
>> > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/>
>> > > > words="lang/stopwords_fa.txt" />
>> >   
>> > 
>> >
>> > I did try the facet.method=enum and it works fine. Did you mean that
>> > actually applying facet on analyzed field is wrong?
>> >
>> > Best regards.
>> >
>> > On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen > >
>> > wrote:
>> >
>> >> Ali Nazemian  wrote:
>> >> > I have a collection of 1.6m documents in Solr 5.2.1.
>> >> > [...]
>> >> > Caused by: java.lang.IllegalStateException: Too many values for
>> >> > UnInvertedField faceting on field content
>> >> > [...]
>> >> > > >> > default="noval" termVectors="true" termPositions="true"
>> >> > termOffsets="true"/>
>> >>
>> >> You are hitting an internal limit in Solr. As davidphilip tells you,
>> the
>> >> solution is docValues, but they cannot be enabled for text fields. You
>> need
>> >> String fields, but the name of your field suggests that you need
>> >> analyzation & tokenization, which cannot be done on String fields.
>> >>
>> >> > Would you please help me to solve this problem?
>> >>
>> >> With the information we have, it does not seem to be easy to solve: It
>> >> seems like you want to facet on all terms in your index. As they need
>> to be
>> >> String (to use docValues), you would have to do all the splitting on
>> white
>> >> space, normalization etc. outside of Solr.
>> >>
>> >> - Toke Eskildsen
>> >>
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian