DIH - LastModifiedDate - Format

2014-02-16 Thread PeriS
Hi,

I am using MySQL as the datastore and for the last_modified_date use the 
java.util.Date. I m seeing that the DIH doesn’t seem to pick records; Is there 
a date format that I should use for DIH to compare properly and pick up the 
records for indexing?

Thanks
-Peri.S

*** DISCLAIMER *** This is a PRIVATE message. If you are not the intended 
recipient, please delete without copying and kindly advise us by e-mail of the 
mistake in delivery.
NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global 
Services to any order or other contract unless pursuant to explicit written 
agreement or government initiative expressly permitting the use of e-mail for 
such purpose.




Re: update in SolrCloud through C++ client

2014-02-16 Thread Ramkumar R. Aiyengar
If only availability is your concern, you can always keep a list of servers
to which your C++ clients will send requests, and round robin amongst them.
If one of the servers go down, you will either not be able to reach it or
get a 500+ error in the HTTP response, you can take it out of circulation
(and probably retry in the background with some kind of a ping every minute
or so to these down servers to ascertain if they have come back and then
add them to the list). This is something SolrJ does currently. This doesn't
technically need any Zookeeper interaction.

The biggest benefit that SolrJ provides (since 4.6 I think) though is that
it finds the shard leader to send an update to using ZK and saves a hop.
You can technically do this by retrieving and listening to updates using a
C++ ZK client (available) and doing what SolrJ currently does. This would
be good, the only drawback though, apart from the effort, is that
improvements are still happening in the area of managing clusters and how
its state is saved with ZK. These changes might not break your code, but at
the same time you might not be able to take advantage of them without
additional effort.

An alternative approach is to link SolrJ into your C++ client using JNI.
This has the added benefit of using the Javabin format for requests which
would have some performance benefits.

In short, it comes down to what performance requirements are. If indexing
speed and throughput is not that big a deal, just go with having a list of
servers and load balancing amongst the active ones. I would suggest you try
this anyway before second guessing that you do need the optimization.

If not, I would probably try the JNI route,  and if that fails, using a C
ZK client to read the cluster state and using that knowledge to decide
where to send requests.
On 14 Feb 2014 10:58, neerajp neeraj_star2...@yahoo.com wrote:

 Hello All,
 I am using Solr for indexing my data. My client is in C++. So I make Curl
 request to Solr server for indexing.
 Now, I want to use indexing in SolrCloud mode using ZooKeeper for HA.  I
 read the wiki link of SolrCloud (http://wiki.apache.org/solr/SolrCloud).

 What I understand from wiki that we should always check solr instance
 status(up  running) in solrCloud before making an update request. Can I
 not
 send update request to zookeeper and let the zookeeper forwards it to
 appropriate replica/leader ? In the later case I need not to worry which
 servers are up and running before making indexing request.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/update-in-SolrCloud-through-C-client-tp4117340.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Luke 4.6.1 released

2014-02-16 Thread Dmitry Kan
Hello!

Luke 4.6.1 has been just released. Grab it here:

https://github.com/DmitryKey/luke/releases/tag/4.6.1

fixes:
loading the jar from command line is now working fine.

-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: twitter.com/dmitrykan


Re: Solr Hot Cpu and high load

2014-02-16 Thread Nitin Sharma
Thanks Tri


*a. Are you docs distributed evenly across shards: number of docs and size
of the shards*
 Yes the size of all the shards is equal (an ignorable delta in the order
of KB) and so are the # of docs

*b. Is your test client querying all nodes, or all the queries go to those
2 busy nodes?*
* *Yes all nodes are receiving exactly the same amount of queries


I have one more question. Do stored fields have significant impact on
performance of solr queries? Having 50% of the fields stored ( out of 100
fields) significantly worse that having 20% of the fields stored?
(signficantly == orders of 100s of milliseconds assuming all fields are of
the same size and type)

How are stored fields retrieved in general (always from disk or loaded into
memory in the first query and then going forward read from memory?)

Thanks
Nitin



On Fri, Feb 14, 2014 at 11:45 AM, Tri Cao tm...@me.com wrote:

 1. Yes, that's the right way to go, well, in theory at least :)
 2. Yes, queries are alway fanned to all shards and will be as slow as the
 slowest shard. When I looked into
 Solr distributed querying implementation a few months back, the support
 for graceful degradation for things
 like network failures and slow shards was not there yet.
 3. I doubt mmap settings would impact your read-only load, and it seems
 you can easily
 fit your index in RAM. You could try to warm the file cache to make sure
 with cat $sorl_dir  /dev/null.

 It's odd that only 2 nodes are at 100% in your set up. I would check a
 couple of things:
 a. Are you docs distributed evenly across shards: number of docs and size
 of the shards
 b. Is your test client querying all nodes, or all the queries go to those
 2 busy nodes?

 Regards,
 Tri

 On Feb 14, 2014, at 10:52 AM, Nitin Sharma nitin.sha...@bloomreach.com
 wrote:

 Hell folks

 We are currently using solrcloud 4.3.1. We have 8 node solrcloud cluster
 with 32 cores, 60Gb of ram and SSDs.We are using zk to manage the
 solrconfig used by our collections

 We have many collections and some of them are relatively very large
 compared to the other. The size of the shard of these big collections are
 in the order of Gigabytes.We decided to split the bigger collection evenly
 across all nodes (8 shards and 2 replicas) with maxNumShards  1.

 We did a test with a read load only on one big collection and we still see
 only 2 nodes running 100% CPU and the rest are blazing through the queries
 way faster (under 30% cpu). [Despite all of them being sharded across all
 nodes]

 I checked the JVM usage and found that none of the pools have high
 utilization (except Survivor space which is 100%). The GC cycles are in
 the order of ms and mostly doing scavenge. Mark and sweep occurs once every
 30 minutes

 Few questions:

 1. Sharding all collections (small and large) across all nodes evenly

 distributes the load and makes the system characteristics of all machines
 similar. Is this a recommended way to do ?
 2. Solr Cloud does a distributed query by default. So if a node is at

 100% CPU does it slow down the response time for the other nodes waiting
 for this query? (or does it have a timeout if it cannot get a response from
 a node within x seconds?)
 3. Our collections use Mmap directory but i specifically haven't enabled

 anything related to mmaps (locked pages under ulimit ). Does it adverse
 affect performance? or can lock pages even without this?

 Thanks a lot in advance.
 Nitin




-- 
- N


Re: SolrCloud Zookeeper disconnection/reconnection

2014-02-16 Thread lboutros
Thanks a lot for your answer.

Is there a web page, on the wiki for instance, where we could find some JVM
settings or recommandations that we should used for Solr with some index
configurations? 

Ludovic.





-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101p4117653.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud Zookeeper disconnection/reconnection

2014-02-16 Thread Ramkumar R. Aiyengar
Start with http://wiki.apache.org/solr/SolrPerformanceProblems It has a
section on GC tuning and a link to some example settings.
On 16 Feb 2014 21:19, lboutros boutr...@gmail.com wrote:

 Thanks a lot for your answer.

 Is there a web page, on the wiki for instance, where we could find some JVM
 settings or recommandations that we should used for Solr with some index
 configurations?

 Ludovic.





 -
 Jouve
 France.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101p4117653.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Luke 4.6.1 released

2014-02-16 Thread Alexandre Rafalovitch
Does it work with Solr? I couldn't tell what the description was from
this repo and it's Solr relevance.

I am sure all the long timers know, but for more recent Solr people,
the additional information would be useful.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, Feb 17, 2014 at 3:02 AM, Dmitry Kan solrexp...@gmail.com wrote:
 Hello!

 Luke 4.6.1 has been just released. Grab it here:

 https://github.com/DmitryKey/luke/releases/tag/4.6.1

 fixes:
 loading the jar from command line is now working fine.

 --
 Dmitry Kan
 Blog: http://dmitrykan.blogspot.com
 Twitter: twitter.com/dmitrykan


Re: Luke 4.6.1 released

2014-02-16 Thread Bill Bell
Yes it works with Solr 

Bill Bell
Sent from mobile


 On Feb 16, 2014, at 3:38 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:
 
 Does it work with Solr? I couldn't tell what the description was from
 this repo and it's Solr relevance.
 
 I am sure all the long timers know, but for more recent Solr people,
 the additional information would be useful.
 
 Regards,
   Alex.
 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Mon, Feb 17, 2014 at 3:02 AM, Dmitry Kan solrexp...@gmail.com wrote:
 Hello!
 
 Luke 4.6.1 has been just released. Grab it here:
 
 https://github.com/DmitryKey/luke/releases/tag/4.6.1
 
 fixes:
 loading the jar from command line is now working fine.
 
 --
 Dmitry Kan
 Blog: http://dmitrykan.blogspot.com
 Twitter: twitter.com/dmitrykan


Re: Solr Hot Cpu and high load

2014-02-16 Thread Erick Erickson
Stored fields are what the Solr DocumentCache in solrconfig.xml
is all about.

My general feeling is that stored fields are mostly irrelevant for
search speed, especially if lazy-loading is enabled. The only time
stored fields come in to play is when assembling the final result
list, i.e. the 10 or 20 documents that you return. That does imply
disk I/O, and if you have massive fields theres also decompression
to add to the CPU load.

So, as usual, it depends. Try measuring where you restrict the returned
fields to whatever your uniqueKey field is for one set of tests, then
try returning _everything_ for another?

Best,
Erick


On Sun, Feb 16, 2014 at 12:18 PM, Nitin Sharma
nitin.sha...@bloomreach.comwrote:

 Thanks Tri


 *a. Are you docs distributed evenly across shards: number of docs and size
 of the shards*
  Yes the size of all the shards is equal (an ignorable delta in the order
 of KB) and so are the # of docs

 *b. Is your test client querying all nodes, or all the queries go to those
 2 busy nodes?*
 * *Yes all nodes are receiving exactly the same amount of queries


 I have one more question. Do stored fields have significant impact on
 performance of solr queries? Having 50% of the fields stored ( out of 100
 fields) significantly worse that having 20% of the fields stored?
 (signficantly == orders of 100s of milliseconds assuming all fields are of
 the same size and type)

 How are stored fields retrieved in general (always from disk or loaded into
 memory in the first query and then going forward read from memory?)

 Thanks
 Nitin



 On Fri, Feb 14, 2014 at 11:45 AM, Tri Cao tm...@me.com wrote:

  1. Yes, that's the right way to go, well, in theory at least :)
  2. Yes, queries are alway fanned to all shards and will be as slow as the
  slowest shard. When I looked into
  Solr distributed querying implementation a few months back, the support
  for graceful degradation for things
  like network failures and slow shards was not there yet.
  3. I doubt mmap settings would impact your read-only load, and it seems
  you can easily
  fit your index in RAM. You could try to warm the file cache to make sure
  with cat $sorl_dir  /dev/null.
 
  It's odd that only 2 nodes are at 100% in your set up. I would check a
  couple of things:
  a. Are you docs distributed evenly across shards: number of docs and size
  of the shards
  b. Is your test client querying all nodes, or all the queries go to those
  2 busy nodes?
 
  Regards,
  Tri
 
  On Feb 14, 2014, at 10:52 AM, Nitin Sharma nitin.sha...@bloomreach.com
  wrote:
 
  Hell folks
 
  We are currently using solrcloud 4.3.1. We have 8 node solrcloud cluster
  with 32 cores, 60Gb of ram and SSDs.We are using zk to manage the
  solrconfig used by our collections
 
  We have many collections and some of them are relatively very large
  compared to the other. The size of the shard of these big collections are
  in the order of Gigabytes.We decided to split the bigger collection
 evenly
  across all nodes (8 shards and 2 replicas) with maxNumShards  1.
 
  We did a test with a read load only on one big collection and we still
 see
  only 2 nodes running 100% CPU and the rest are blazing through the
 queries
  way faster (under 30% cpu). [Despite all of them being sharded across all
  nodes]
 
  I checked the JVM usage and found that none of the pools have high
  utilization (except Survivor space which is 100%). The GC cycles are in
  the order of ms and mostly doing scavenge. Mark and sweep occurs once
 every
  30 minutes
 
  Few questions:
 
  1. Sharding all collections (small and large) across all nodes evenly
 
  distributes the load and makes the system characteristics of all machines
  similar. Is this a recommended way to do ?
  2. Solr Cloud does a distributed query by default. So if a node is at
 
  100% CPU does it slow down the response time for the other nodes waiting
  for this query? (or does it have a timeout if it cannot get a response
 from
  a node within x seconds?)
  3. Our collections use Mmap directory but i specifically haven't enabled
 
  anything related to mmaps (locked pages under ulimit ). Does it adverse
  affect performance? or can lock pages even without this?
 
  Thanks a lot in advance.
  Nitin
 
 


 --
 - N



Solr index filename doesn't match with solr vesion

2014-02-16 Thread Nguyen Manh Tien
Hello,

I upgraded recently from solr 4.0 to solr 4.6,
I check solr index folder and found there file

_aars_*Lucene41*_0.doc
_aars_*Lucene41*_0.pos
_aars_*Lucene41*_0.tim
_aars_*Lucene41*_0.tip

I don't know why it don't have *Lucene46* in file name.

Is there something wrong?

Thanks,
Tien


query parameters

2014-02-16 Thread Andreas Owen

in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like 
to use fq to force the following conditions:
   1: organisations is empty and roles is empty
   2: organisations contains one of the commadelimited list in variable $org
   3: roles contains one of the commadelimited list in variable $r
   4: rule 2 and 3

snipet of what i got (havent checked out if the is a in operator like in sql 
for the list value)

lst name=defaults
       str name=echoParamsexplicit/str
       int name=rows10/int
       str name=defTypeedismax/str
   str name=synonymstrue/str
   str name=qfplain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   /str
   str name=fq(organisations='' roles='') or (organisations=$org 
roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')/str
   str name=bq(expiration:[NOW TO *] OR (*:* 
-expiration:*))^6/str  !-- tested: now or newer or empty gets small boost --
   str name=bfdiv(clicks,max(displays,1))^8/str !-- tested --
   






Increasing number of SolrIndexSearcher (Leakage)?

2014-02-16 Thread Nguyen Manh Tien
Hello,

My solr got OOM recently after i upgraded from solr 4.0 to 4.6.1.
I check heap dump and found that it has many SolrIndexSearcher (SIS)
objects (24), i expect only 1 SIS because we have 1 core.

I make some experiment
- Right after start solr, it has only 1 SolrIndexSearcher
- *But after i index some docs and run softCommit or hardCommit with
openSearcher=false, number of SolrIndexSearcher increase by 1*
- When hard commit with openSearcher=true, nubmer of SolrIndexSearcher
(SIS) doesn't increase but i foudn it log, it open new searcher, i guest
old SIS closed.

I don't know why number of SIS increase like this and finally cause
OutOfMemory, can SolrIndexSearcher be leak?

Regards,
Tien


Re: Solr index filename doesn't match with solr vesion

2014-02-16 Thread Tri Cao
Lucene main file formats actually don't change a lot in 4.x (or even 5.x), and the newer codecs just delegate to previous versions for most file types. The newer file types don't typically include Lucene's version in file names.For example, Lucene 4.6 codes basically delegate stored fields and term vector file format to 4.1, doc format to 4.0, etc. and only implement the new segment info/fields info formats (the .si and .fnm files).https://github.com/apache/lucene-solr/blob/lucene_solr_4_6/lucene/core/src/java/org/apache/lucene/codecs/lucene46/Lucene46Codec.java#L50Hope this helps,TriOn Feb 16, 2014, at 08:52 PM, Shawn Heisey s...@elyograg.org wrote:On 2/16/2014 7:25 PM, Nguyen Manh Tien wrote:I upgraded recently from solr 4.0 to solr 4.6,I check solr index folder and found there file_aars_*Lucene41*_0.doc_aars_*Lucene41*_0.pos_aars_*Lucene41*_0.tim_aars_*Lucene41*_0.tipI don't know why it don't have *Lucene46* in file name. This is an indication that this part of the index is using a file format introduced in Lucene 4.1.  Here's what I have for one of my index segments on a Solr 4.6.1 server:  _5s7_2h.del _5s7.fdt _5s7.fdx _5s7.fnm _5s7_Lucene41_0.doc _5s7_Lucene41_0.pos _5s7_Lucene41_0.tim _5s7_Lucene41_0.tip _5s7_Lucene45_0.dvd _5s7_Lucene45_0.dvm _5s7.nvd _5s7.nvm _5s7.si _5s7.tvd _5s7.tvx  It shows the same pieces as your list, but I am also using docValues in my index, and those files indicate that they are using the format from Lucene 4.5. I'm not sure why there are not version numbers in *all* of the file extensions -- that happens in the Lucene layer, which is a bit of a mystery to me.  Thanks, Shawn 

Re: Increasing number of SolrIndexSearcher (Leakage)?

2014-02-16 Thread Shawn Heisey
On 2/16/2014 11:34 PM, Nguyen Manh Tien wrote:
 My solr got OOM recently after i upgraded from solr 4.0 to 4.6.1.
 I check heap dump and found that it has many SolrIndexSearcher (SIS)
 objects (24), i expect only 1 SIS because we have 1 core.
 
 I make some experiment
 - Right after start solr, it has only 1 SolrIndexSearcher
 - *But after i index some docs and run softCommit or hardCommit with
 openSearcher=false, number of SolrIndexSearcher increase by 1*
 - When hard commit with openSearcher=true, nubmer of SolrIndexSearcher
 (SIS) doesn't increase but i foudn it log, it open new searcher, i guest
 old SIS closed.
 
 I don't know why number of SIS increase like this and finally cause
 OutOfMemory, can SolrIndexSearcher be leak?

It's always possible that you've hit a bug that results in a memory
leak, but it is not likely.  I'm running version 4.6.1 in production
without any problems.  A lot of other people are doing so as well.  I
suspect that there's a misconfiguration, a buggy JVM, or something else
that's out of the ordinary.

We'll need answers to a bunch of questions: What filesystem and
operating system are you running on?  What vendor and version is your
JVM?  Can you use a file sharing site or a paste website to share your
full solrconfig.xml file?  What servlet container are you using to run
Solr?  Depending on what we learn from these answers, more questions
might be coming.

Are there any messages at WARN or ERROR in your Solr logfile?  Note that
I am not referring to the logging tab in the admin UI here - you'll need
to look at the actual logfile.

Thanks,
Shawn