Re: The most efficient way to get un-inverted view of the index?

2016-08-17 Thread Roman Chyla
in case this helps someone, here is a solution (probably very
efficient already, but i didn't profile it); it can deal with DocValues and
with FieldCache (the old 'stored' values)



private void unInvertedTheDamnThing(
  SolrIndexSearcher searcher,
  List fields,
  KVSetter setter) throws IOException {

LeafReader reader = searcher.getLeafReader();
  IndexSchema schema = searcher.getCore().getLatestSchema();
  List leaves = reader.getContext().leaves();

  Bits liveDocs;
  LeafReader lr;
  Transformer transformer;
for (LeafReaderContext leave: leaves) {
   int docBase = leave.docBase;
   liveDocs = leave.reader().getLiveDocs();
   lr = leave.reader();
   FieldInfos fInfo = lr.getFieldInfos();

   for (String field: fields) {

 FieldInfo fi = fInfo.fieldInfo(field);
 SchemaField fSchema = schema.getField(field);
 DocValuesType fType = fi.getDocValuesType();
 Map mapping = new HashMap();
 final LeafReader unReader;

 if (fType.equals(DocValuesType.NONE)) {
   Class c = fType.getClass();
  if (c.isAssignableFrom(TextField.class) ||
c.isAssignableFrom(StrField.class)) {
if (fSchema.multiValued()) {
  mapping.put(field, Type.SORTED);
}
else {
  mapping.put(field, Type.BINARY);
}
  }
  else if (c.isAssignableFrom(TrieIntField.class)) {
if (fSchema.multiValued()) {
  mapping.put(field, Type.SORTED_SET_INTEGER);
}
else {
  mapping.put(field, Type.INTEGER_POINT);
}
  }
  else {
continue;
  }
  unReader = new UninvertingReader(lr, mapping);
 }
 else {
   unReader = lr;
 }

switch(fType) {
   case NUMERIC:
 transformer = new Transformer() {
   NumericDocValues dv = unReader.getNumericDocValues(field);
   @Override
  public void process(int docBase, int docId) {
int v = (int) dv.get(docId);
setter.set(docBase, docId, v);
  }
 };
 break;
   case SORTED_NUMERIC:
 transformer = new Transformer() {
  SortedNumericDocValues dv =
unReader.getSortedNumericDocValues(field);
  @Override
  public void process(int docBase, int docId) {
dv.setDocument(docId);
int max = dv.count();
int v;
for (int i=0; i 5)
  return;
dv.setDocument(docId);
for (long ord = dv.nextOrd(); ord !=
SortedSetDocValues.NO_MORE_ORDS; ord = dv.nextOrd()) {
  final BytesRef value = dv.lookupOrd(ord);
  setter.set(docBase, docId, value.utf8ToString());
}
  }
};
 break;
   case SORTED:
 transformer = new Transformer() {
   SortedDocValues dv = unReader.getSortedDocValues(field);
  TermsEnum te;
  @Override
  public void process(int docBase, int docId) {
BytesRef v = dv.get(docId);
if (v.length == 0)
  return;
setter.set(docBase, docId, v.utf8ToString());
  }
};
 break;
   default:
 throw new IllegalArgumentException("The field " + field + "
is of type that cannot be un-inverted");
 }

 int i = 0;
while(i < lr.maxDoc()) {
  if (liveDocs != null && !(i < liveDocs.length() && liveDocs.get(i))) {
i++;
continue;
  }
  transformer.process(docBase, i);
  i++;
}
   }

  }
}

On Wed, Aug 17, 2016 at 1:22 PM, Roman Chyla  wrote:
> Joel, thanks, but which of them? I've counted at least 4, if not more,
> different ways of how to get DocValues. Are there many functionally
> equal approaches just because devs can't agree on using one api? Or is
> there a deeper reason?
>
> Btw, the FieldCache is still there - both in lucene (to be deprecated)
> and in solr; but became package accessible only
>
> This is what removed the FieldCache:
> https://issues.apache.org/jira/browse/LUCENE-5666
> This is what followed: https://issues.apache.org/jira/browse/SOLR-8096
>
> And there is still code which un-inverts data from an index if no
> doc-values are available.
>
> --roman
>
> On Tue, Aug 16, 2016 at 9:54 PM, Joel Bernstein 

Re: Using Solr invariants to set facet method?

2016-08-17 Thread ruby
Thanks for your reply. I was not seeing the param being added in return
results. but after adding echoParams=true, I see that facet method is being
added. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Solr-invariants-to-set-facet-method-tp4292142p4292149.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error During Indexing - org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: early EOF

2016-08-17 Thread Erick Erickson
>From my testing program, there's nothing standard here.

As the blog points out, since I was indexing fairly
simple documents you should _not_ be expecting to
see those indexing rates. The point of the article was
just to show the _relative_ changes when I sent
batches.

Best,
Erick

On Wed, Aug 17, 2016 at 1:59 PM, Jaspal Sawhney  wrote:
> Erick
> Going through the article which you shared. Where are you getting the
> Docs/second value?
> Thanks
>
> On 8/17/16, 4:37 PM, "Jaspal Sawhney"  wrote:
>
>>Erick
>>Thanks - My batch size was 30 and thread size also 30.
>>Thanks
>>
>>On 8/17/16, 3:48 PM, "Erick Erickson"  wrote:
>>
>>>What this probably indicates is that the size of the packets you send
>>>to Solr is large enough that it exceeds the transport protocol's
>>>limit. This is reinforced by your statement that reducing the batch
>>>size fixes the problem even though it increases indexing time.
>>>
>>>So the place I'd be looking is the jetty configurations for any limits
>>>there.
>>>
>>>That said, what is your batch size? In my testing I pretty quickly get
>>>into diminishing returns, here's a writeup from some time ago:
>>>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>>>
>>>Best,
>>>Erick
>>>
>>>On Wed, Aug 17, 2016 at 12:03 PM, Jaspal Sawhney 
>>>wrote:
 Bump !

 On 8/16/16, 10:53 PM, "Jaspal Sawhney"  wrote:

>Hello
>We are running solr 4.6 in master-slave configuration where in our
>master
>is used entirely for indexing. No search traffic comes to master ever.
>Off late we have started to get the early EOF error on the solr Master
>which results in a Broken Pipe error on the commerce application from
>where Indexing was kicked off from.
>
>Things to mention
>
>  1.  We have a couple of sites ­ each of which has the same document
>size but diff document count.
>  2.  This error is being observed in the site which has the most
>number
>of document count I.e. 2204743
>  3.  The way I have understood solr to work is that irrespective of
>number of document ­ the throughput is controlled by the ŒNumber of
>Threads¹ and ŒBatch size¹ - Am I correct?
> *   In our case we have not touched the batch size and Number of
>Threads when this error started coming
> *   However when I do touch these parameters (specifically reduce
>them) the error does not come ­ however indexing time increases a lot.
>  4.  We have to index overnight daily because we put product prices in
>the Index which get updated nightly
>  5.  Solr master is running with a 20 GB Heap
>
>What we have tried
>
>  1.  I disabled autoCommit (I.e. Hard commit) and put the
>autoSoftCommit
>as 5 mins
> *   I realized afterwards that this was a wrong test because my
>understanding of soft commit was incorrect, My understanding now is
>that
>hard commit just truncate the Tlog do hardCommit should be better
>indexing performance.
> *   This test failed for lack of space reason however because
>disable autoCommit did not make sense ­ I did not retry this test yet.
>  2.  Increased the RAMBufferSizeMB from 100MB to 1000MB
> *   This test did not yield anything favorable ­ the master gave
>the
>early EOF exception
>  3.  Increased the merge factor from 20 ‹> 100
> *   This test did not yield anything favorable ­ the master gave
>the
>early EOF exception
>  4.  Flipped the autoCommit to 15 secs and disabled auto commit
> *   This test did not yield anything favorable ­ the master gave
>the
>early EOF exception
> *   I got the input for this from
>https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-s
>o
>ft
>commit-and-commit-in-sorlcloud/ - Heavy (Bulk) Indexing section
>  5.  Tried to bypass transaction log all together ­ This test is
>underway currently
>
>Questions
>
>  1.  Since we are not using solrCloud ­ I want to understand the
>impact
>of bypassing transaction log
>  2.  How does solr take documents which are sent to it to storage as
>in
>what is the journey of a document from segment to tlog to storage.
>
>It would be great If there are any pointers which you can share.
>
>Thanks
>J./
>
>The actual Error Log
>ERROR - 2016-08-16 22:59:55.988; org.apache.solr.common.SolrException;
>org.apache.solr.common.SolrException: early EOF
>at
>org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
>at
>org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandle
>r
>.j
>ava:92)
>at
>org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Cont
>e
>nt

Re: Using Solr invariants to set facet method?

2016-08-17 Thread Erick Erickson
Setting the facet method to enum will have
consequences for the filterCache, especially
if you allow faceting on high-cardinality fields
so for that specific example I'd be cautious.

Best,
Erick

On Wed, Aug 17, 2016 at 3:01 PM, Alexandre Rafalovitch
 wrote:
> That's what it is there for. Are you seeing any issues?
>
> You can confirm whether it works or not by adding echoParams=all to
> the query (or in the defaults/invariants).
>
> Regards,
>Alex
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 18 August 2016 at 07:43, ruby  wrote:
>> Is it possible to use the invariants in Solr config to set facet.method to
>> override what user is sending?
>>
>> 
>>   enum
>> 
>>
>>
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Using-Solr-invariants-to-set-facet-method-tp4292142.html
>> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using Solr invariants to set facet method?

2016-08-17 Thread Alexandre Rafalovitch
That's what it is there for. Are you seeing any issues?

You can confirm whether it works or not by adding echoParams=all to
the query (or in the defaults/invariants).

Regards,
   Alex

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 18 August 2016 at 07:43, ruby  wrote:
> Is it possible to use the invariants in Solr config to set facet.method to
> override what user is sending?
>
> 
>   enum
> 
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Using-Solr-invariants-to-set-facet-method-tp4292142.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Using Solr invariants to set facet method?

2016-08-17 Thread ruby
Is it possible to use the invariants in Solr config to set facet.method to
override what user is sending?


  enum




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Solr-invariants-to-set-facet-method-tp4292142.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error During Indexing - org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: early EOF

2016-08-17 Thread Jaspal Sawhney
Erick
Going through the article which you shared. Where are you getting the
Docs/second value?
Thanks

On 8/17/16, 4:37 PM, "Jaspal Sawhney"  wrote:

>Erick
>Thanks - My batch size was 30 and thread size also 30.
>Thanks
>
>On 8/17/16, 3:48 PM, "Erick Erickson"  wrote:
>
>>What this probably indicates is that the size of the packets you send
>>to Solr is large enough that it exceeds the transport protocol's
>>limit. This is reinforced by your statement that reducing the batch
>>size fixes the problem even though it increases indexing time.
>>
>>So the place I'd be looking is the jetty configurations for any limits
>>there.
>>
>>That said, what is your batch size? In my testing I pretty quickly get
>>into diminishing returns, here's a writeup from some time ago:
>>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>>
>>Best,
>>Erick
>>
>>On Wed, Aug 17, 2016 at 12:03 PM, Jaspal Sawhney 
>>wrote:
>>> Bump !
>>>
>>> On 8/16/16, 10:53 PM, "Jaspal Sawhney"  wrote:
>>>
Hello
We are running solr 4.6 in master-slave configuration where in our
master
is used entirely for indexing. No search traffic comes to master ever.
Off late we have started to get the early EOF error on the solr Master
which results in a Broken Pipe error on the commerce application from
where Indexing was kicked off from.

Things to mention

  1.  We have a couple of sites ­ each of which has the same document
size but diff document count.
  2.  This error is being observed in the site which has the most
number
of document count I.e. 2204743
  3.  The way I have understood solr to work is that irrespective of
number of document ­ the throughput is controlled by the ŒNumber of
Threads¹ and ŒBatch size¹ - Am I correct?
 *   In our case we have not touched the batch size and Number of
Threads when this error started coming
 *   However when I do touch these parameters (specifically reduce
them) the error does not come ­ however indexing time increases a lot.
  4.  We have to index overnight daily because we put product prices in
the Index which get updated nightly
  5.  Solr master is running with a 20 GB Heap

What we have tried

  1.  I disabled autoCommit (I.e. Hard commit) and put the
autoSoftCommit
as 5 mins
 *   I realized afterwards that this was a wrong test because my
understanding of soft commit was incorrect, My understanding now is
that
hard commit just truncate the Tlog do hardCommit should be better
indexing performance.
 *   This test failed for lack of space reason however because
disable autoCommit did not make sense ­ I did not retry this test yet.
  2.  Increased the RAMBufferSizeMB from 100MB to 1000MB
 *   This test did not yield anything favorable ­ the master gave
the
early EOF exception
  3.  Increased the merge factor from 20 ‹> 100
 *   This test did not yield anything favorable ­ the master gave
the
early EOF exception
  4.  Flipped the autoCommit to 15 secs and disabled auto commit
 *   This test did not yield anything favorable ­ the master gave
the
early EOF exception
 *   I got the input for this from
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-s
o
ft
commit-and-commit-in-sorlcloud/ - Heavy (Bulk) Indexing section
  5.  Tried to bypass transaction log all together ­ This test is
underway currently

Questions

  1.  Since we are not using solrCloud ­ I want to understand the
impact
of bypassing transaction log
  2.  How does solr take documents which are sent to it to storage as
in
what is the journey of a document from segment to tlog to storage.

It would be great If there are any pointers which you can share.

Thanks
J./

The actual Error Log
ERROR - 2016-08-16 22:59:55.988; org.apache.solr.common.SolrException;
org.apache.solr.common.SolrException: early EOF
at
org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandle
r
.j
ava:92)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Cont
e
nt
StreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandler
B
as
e.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.j
a
va
:721)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.
j
av
a:417)
at

Re: Error During Indexing - org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: early EOF

2016-08-17 Thread Jaspal Sawhney
Erick
Thanks - My batch size was 30 and thread size also 30.
Thanks

On 8/17/16, 3:48 PM, "Erick Erickson"  wrote:

>What this probably indicates is that the size of the packets you send
>to Solr is large enough that it exceeds the transport protocol's
>limit. This is reinforced by your statement that reducing the batch
>size fixes the problem even though it increases indexing time.
>
>So the place I'd be looking is the jetty configurations for any limits
>there.
>
>That said, what is your batch size? In my testing I pretty quickly get
>into diminishing returns, here's a writeup from some time ago:
>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>
>Best,
>Erick
>
>On Wed, Aug 17, 2016 at 12:03 PM, Jaspal Sawhney 
>wrote:
>> Bump !
>>
>> On 8/16/16, 10:53 PM, "Jaspal Sawhney"  wrote:
>>
>>>Hello
>>>We are running solr 4.6 in master-slave configuration where in our
>>>master
>>>is used entirely for indexing. No search traffic comes to master ever.
>>>Off late we have started to get the early EOF error on the solr Master
>>>which results in a Broken Pipe error on the commerce application from
>>>where Indexing was kicked off from.
>>>
>>>Things to mention
>>>
>>>  1.  We have a couple of sites ­ each of which has the same document
>>>size but diff document count.
>>>  2.  This error is being observed in the site which has the most number
>>>of document count I.e. 2204743
>>>  3.  The way I have understood solr to work is that irrespective of
>>>number of document ­ the throughput is controlled by the ŒNumber of
>>>Threads¹ and ŒBatch size¹ - Am I correct?
>>> *   In our case we have not touched the batch size and Number of
>>>Threads when this error started coming
>>> *   However when I do touch these parameters (specifically reduce
>>>them) the error does not come ­ however indexing time increases a lot.
>>>  4.  We have to index overnight daily because we put product prices in
>>>the Index which get updated nightly
>>>  5.  Solr master is running with a 20 GB Heap
>>>
>>>What we have tried
>>>
>>>  1.  I disabled autoCommit (I.e. Hard commit) and put the
>>>autoSoftCommit
>>>as 5 mins
>>> *   I realized afterwards that this was a wrong test because my
>>>understanding of soft commit was incorrect, My understanding now is that
>>>hard commit just truncate the Tlog do hardCommit should be better
>>>indexing performance.
>>> *   This test failed for lack of space reason however because
>>>disable autoCommit did not make sense ­ I did not retry this test yet.
>>>  2.  Increased the RAMBufferSizeMB from 100MB to 1000MB
>>> *   This test did not yield anything favorable ­ the master gave
>>>the
>>>early EOF exception
>>>  3.  Increased the merge factor from 20 ‹> 100
>>> *   This test did not yield anything favorable ­ the master gave
>>>the
>>>early EOF exception
>>>  4.  Flipped the autoCommit to 15 secs and disabled auto commit
>>> *   This test did not yield anything favorable ­ the master gave
>>>the
>>>early EOF exception
>>> *   I got the input for this from
>>>https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-so
>>>ft
>>>commit-and-commit-in-sorlcloud/ - Heavy (Bulk) Indexing section
>>>  5.  Tried to bypass transaction log all together ­ This test is
>>>underway currently
>>>
>>>Questions
>>>
>>>  1.  Since we are not using solrCloud ­ I want to understand the impact
>>>of bypassing transaction log
>>>  2.  How does solr take documents which are sent to it to storage as in
>>>what is the journey of a document from segment to tlog to storage.
>>>
>>>It would be great If there are any pointers which you can share.
>>>
>>>Thanks
>>>J./
>>>
>>>The actual Error Log
>>>ERROR - 2016-08-16 22:59:55.988; org.apache.solr.common.SolrException;
>>>org.apache.solr.common.SolrException: early EOF
>>>at
>>>org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
>>>at
>>>org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler
>>>.j
>>>ava:92)
>>>at
>>>org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
>>>nt
>>>StreamHandlerBase.java:74)
>>>at
>>>org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
>>>as
>>>e.java:135)
>>>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>>>at
>>>org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
>>>va
>>>:721)
>>>at
>>>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
>>>av
>>>a:417)
>>>at
>>>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
>>>av
>>>a:201)
>>>at
>>>org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHan
>>>dl
>>>er.java:1419)
>>>at
>>>org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:45
>>>5)
>>>at

Unit testing HttpPost With an Embedded Solr Server

2016-08-17 Thread Jennifer Coston


Hello,

I have written a data service to send an HttpPost command to post JSON to
Solr. The code is working, but now I want to switch to using an embedded
Solr server for just  the unit tests. The problem is that the embedded Solr
server doesn't seem to be starting an embedded server with a port. So I'm
at a loss on how to test this. I guess I have two questions. (1) How do I
unit test my post command with an embedded Solr server? (2) If it isn't
possible to use the embedded Solr server, I believe I read somewhere that
Solr uses a Jetty server. Is it possible to convert an embedded jetty
server (with a port I can access) to a Solr server?

Here is the class I am trying to test:

public class SolrDataServiceClient {

private String urlString;
private HttpClient httpClient;
private final Logger LOGGER = LoggerFactory.getLogger
(SolrDataServiceClient.class);

/**
 * Constructor for connecting to the Solr Server
 * @param solrCore
 * @param serverName
 * @param portNumber
 */
public SolrDataServiceClient(String solrCore, String serverName,
String portNumber){
LOGGER.info("Initializing new Http Client to Connect To Solr");
urlString = serverName + ":" + portNumber + "/solr/" + solrCore
;

if(httpClient == null){
httpClient = new HttpClient();
}
}

/**
* Post the provided JSON to Solr
*/
public CloseableHttpResponse postJSON(String jsonToAdd) {
CloseableHttpResponse response = null;
try {
CloseableHttpClient client = 
HttpClients.createDefault();
HttpPost httpPost = new HttpPost(urlString +
"/update/json/docs");
HttpEntity entity = new ByteArrayEntity(jsonToAdd
.getBytes("UTF-8"));
httpPost.setEntity(entity);
httpPost.setHeader("Content-type", "application/json");
LOGGER.debug("httpPost = " + httpPost.toString());
response = client.execute(httpPost);
String result = EntityUtils.toString(response.getEntity
());
LOGGER.debug("result = " + result);
client.close();
} catch (IOException e) {
LOGGER.error("IOException", e);
}

return response;
}


Here is my JUnit test:

public class SolrDataServiceClientTest {

private static EmbeddedSolrServer embeddedServer;
private static SolrDataServiceClient solrDataServiceClient;

@BeforeClass
public static void setUpBeforeClass() throws Exception {
System.setProperty("solr.solr.home", "solr/conf");
System.setProperty("solr.data.dir", new File(
"target/solr-embedded-data").getAbsolutePath());
CoreContainer coreContainer = new CoreContainer("solr/conf");
coreContainer.load();

CoreDescriptor cd = new CoreDescriptor(coreContainer, "myCoreName",
new File("solr").getAbsolutePath());
coreContainer.create(cd);

embeddedServer = new EmbeddedSolrServer(coreContainer, "myCoreName");

solrDataServiceClient = new SolrDataServiceClient("myCoreName",
"http://localhost;, "8983"); //I'm not sure what should go here
}

@Test
public void testPostJson() {
 String testJson = " { " +
 "\"observationId\": \"12345c\"," +
 "\"observationType\": \"image\"," +
"\"locationLat\": 38.9215," +
 "\"locationLon\": -77.235" +
"}";
 CloseableHttpResponse response = solrDataServiceClient.postJSON(
testJson);
 assertEquals(response.getStatusLine().getStatusCode(), 200);
 }

Thank you!

Jennifer

Re: Error During Indexing - org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: early EOF

2016-08-17 Thread Erick Erickson
What this probably indicates is that the size of the packets you send
to Solr is large enough that it exceeds the transport protocol's
limit. This is reinforced by your statement that reducing the batch
size fixes the problem even though it increases indexing time.

So the place I'd be looking is the jetty configurations for any limits there.

That said, what is your batch size? In my testing I pretty quickly get
into diminishing returns, here's a writeup from some time ago:
https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, Aug 17, 2016 at 12:03 PM, Jaspal Sawhney  wrote:
> Bump !
>
> On 8/16/16, 10:53 PM, "Jaspal Sawhney"  wrote:
>
>>Hello
>>We are running solr 4.6 in master-slave configuration where in our master
>>is used entirely for indexing. No search traffic comes to master ever.
>>Off late we have started to get the early EOF error on the solr Master
>>which results in a Broken Pipe error on the commerce application from
>>where Indexing was kicked off from.
>>
>>Things to mention
>>
>>  1.  We have a couple of sites ­ each of which has the same document
>>size but diff document count.
>>  2.  This error is being observed in the site which has the most number
>>of document count I.e. 2204743
>>  3.  The way I have understood solr to work is that irrespective of
>>number of document ­ the throughput is controlled by the ŒNumber of
>>Threads¹ and ŒBatch size¹ - Am I correct?
>> *   In our case we have not touched the batch size and Number of
>>Threads when this error started coming
>> *   However when I do touch these parameters (specifically reduce
>>them) the error does not come ­ however indexing time increases a lot.
>>  4.  We have to index overnight daily because we put product prices in
>>the Index which get updated nightly
>>  5.  Solr master is running with a 20 GB Heap
>>
>>What we have tried
>>
>>  1.  I disabled autoCommit (I.e. Hard commit) and put the autoSoftCommit
>>as 5 mins
>> *   I realized afterwards that this was a wrong test because my
>>understanding of soft commit was incorrect, My understanding now is that
>>hard commit just truncate the Tlog do hardCommit should be better
>>indexing performance.
>> *   This test failed for lack of space reason however because
>>disable autoCommit did not make sense ­ I did not retry this test yet.
>>  2.  Increased the RAMBufferSizeMB from 100MB to 1000MB
>> *   This test did not yield anything favorable ­ the master gave the
>>early EOF exception
>>  3.  Increased the merge factor from 20 ‹> 100
>> *   This test did not yield anything favorable ­ the master gave the
>>early EOF exception
>>  4.  Flipped the autoCommit to 15 secs and disabled auto commit
>> *   This test did not yield anything favorable ­ the master gave the
>>early EOF exception
>> *   I got the input for this from
>>https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-soft
>>commit-and-commit-in-sorlcloud/ - Heavy (Bulk) Indexing section
>>  5.  Tried to bypass transaction log all together ­ This test is
>>underway currently
>>
>>Questions
>>
>>  1.  Since we are not using solrCloud ­ I want to understand the impact
>>of bypassing transaction log
>>  2.  How does solr take documents which are sent to it to storage as in
>>what is the journey of a document from segment to tlog to storage.
>>
>>It would be great If there are any pointers which you can share.
>>
>>Thanks
>>J./
>>
>>The actual Error Log
>>ERROR - 2016-08-16 22:59:55.988; org.apache.solr.common.SolrException;
>>org.apache.solr.common.SolrException: early EOF
>>at
>>org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
>>at
>>org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.j
>>ava:92)
>>at
>>org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content
>>StreamHandlerBase.java:74)
>>at
>>org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
>>e.java:135)
>>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>>at
>>org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java
>>:721)
>>at
>>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>>a:417)
>>at
>>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>>a:201)
>>at
>>org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
>>er.java:1419)
>>at
>>org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>>at
>>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>>37)
>>at
>>org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557
>>)
>>at
>>org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja
>>va:231)
>>at
>>org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja

Re: Error During Indexing - org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: early EOF

2016-08-17 Thread Jaspal Sawhney
Bump !

On 8/16/16, 10:53 PM, "Jaspal Sawhney"  wrote:

>Hello
>We are running solr 4.6 in master-slave configuration where in our master
>is used entirely for indexing. No search traffic comes to master ever.
>Off late we have started to get the early EOF error on the solr Master
>which results in a Broken Pipe error on the commerce application from
>where Indexing was kicked off from.
>
>Things to mention
>
>  1.  We have a couple of sites ­ each of which has the same document
>size but diff document count.
>  2.  This error is being observed in the site which has the most number
>of document count I.e. 2204743
>  3.  The way I have understood solr to work is that irrespective of
>number of document ­ the throughput is controlled by the ŒNumber of
>Threads¹ and ŒBatch size¹ - Am I correct?
> *   In our case we have not touched the batch size and Number of
>Threads when this error started coming
> *   However when I do touch these parameters (specifically reduce
>them) the error does not come ­ however indexing time increases a lot.
>  4.  We have to index overnight daily because we put product prices in
>the Index which get updated nightly
>  5.  Solr master is running with a 20 GB Heap
>
>What we have tried
>
>  1.  I disabled autoCommit (I.e. Hard commit) and put the autoSoftCommit
>as 5 mins
> *   I realized afterwards that this was a wrong test because my
>understanding of soft commit was incorrect, My understanding now is that
>hard commit just truncate the Tlog do hardCommit should be better
>indexing performance.
> *   This test failed for lack of space reason however because
>disable autoCommit did not make sense ­ I did not retry this test yet.
>  2.  Increased the RAMBufferSizeMB from 100MB to 1000MB
> *   This test did not yield anything favorable ­ the master gave the
>early EOF exception
>  3.  Increased the merge factor from 20 ‹> 100
> *   This test did not yield anything favorable ­ the master gave the
>early EOF exception
>  4.  Flipped the autoCommit to 15 secs and disabled auto commit
> *   This test did not yield anything favorable ­ the master gave the
>early EOF exception
> *   I got the input for this from
>https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-soft
>commit-and-commit-in-sorlcloud/ - Heavy (Bulk) Indexing section
>  5.  Tried to bypass transaction log all together ­ This test is
>underway currently
>
>Questions
>
>  1.  Since we are not using solrCloud ­ I want to understand the impact
>of bypassing transaction log
>  2.  How does solr take documents which are sent to it to storage as in
>what is the journey of a document from segment to tlog to storage.
>
>It would be great If there are any pointers which you can share.
>
>Thanks
>J./
>
>The actual Error Log
>ERROR - 2016-08-16 22:59:55.988; org.apache.solr.common.SolrException;
>org.apache.solr.common.SolrException: early EOF
>at 
>org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:176)
>at 
>org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.j
>ava:92)
>at 
>org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Content
>StreamHandlerBase.java:74)
>at 
>org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas
>e.java:135)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>at 
>org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java
>:721)
>at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a:417)
>at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a:201)
>at 
>org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
>er.java:1419)
>at 
>org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>37)
>at 
>org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557
>)
>at 
>org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja
>va:231)
>at 
>org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja
>va:1075)
>at 
>org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
>at 
>org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.jav
>a:193)
>at 
>org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.jav
>a:1009)
>at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>35)
>at 
>org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHa
>ndlerCollection.java:255)
>at 
>org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollectio
>n.java:154)
>at 
>org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java
>:116)
>at 

Re: The most efficient way to get un-inverted view of the index?

2016-08-17 Thread Roman Chyla
Joel, thanks, but which of them? I've counted at least 4, if not more,
different ways of how to get DocValues. Are there many functionally
equal approaches just because devs can't agree on using one api? Or is
there a deeper reason?

Btw, the FieldCache is still there - both in lucene (to be deprecated)
and in solr; but became package accessible only

This is what removed the FieldCache:
https://issues.apache.org/jira/browse/LUCENE-5666
This is what followed: https://issues.apache.org/jira/browse/SOLR-8096

And there is still code which un-inverts data from an index if no
doc-values are available.

--roman

On Tue, Aug 16, 2016 at 9:54 PM, Joel Bernstein  wrote:
> You'll want to use org.apache.lucene.index.DocValues. The DocValues api has
> replaced the field cache.
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 8:18 PM, Roman Chyla  wrote:
>
>> I need to read data from the index in order to build a special cache.
>> Previously, in SOLR4, this was accomplished with FieldCache or
>> DocTermOrds
>>
>> Now, I'm struggling to see what API to use, there is many of them:
>>
>> on lucene level:
>>
>> UninvertingReader.getNumericDocValues (and others)
>> .getNumericValues()
>> MultiDocValues.getNumericValues()
>> MultiFields.getTerms()
>>
>> on solr level:
>>
>> reader.getNumericValues()
>> UninvertingReader.getNumericDocValues()
>> and extensions to FilterLeafReader - eg. very intersting, but
>> undocumented facet accumulators (ex: NumericAcc)
>>
>>
>> I need this for solr, and ideally re-use the existing cache [ie. the
>> special cache is using another fields so those get loaded only once
>> and reused in the old solr; which is a win-win situation]
>>
>> If I use reader.getValues() or FilterLeafReader will I be reading data
>> every time the object is created? What would be the best way to read
>> data only once?
>>
>> Thanks,
>>
>> --roman
>>


Re: [Ext] Influence ranking based on document committed date

2016-08-17 Thread Stefan Matheis
Erick already gave you the solution, additional to that there’s a wiki
page that might contain a few more things about relevancy:

https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29

-Stefan


On August 17, 2016 at 5:35:10 PM, Erick Erickson
(erickerick...@gmail.com) wrote:
> Try:
> recip(rord(creationDate),1,1000,1000)
>
> See:
> https://wiki.apache.org/solr/FunctionQuery
>
> You can play with the magic numbers to influence how this scales your docs.
>
> Best,
> Erick
>
> On Wed, Aug 17, 2016 at 7:11 AM, Jay Parashar wrote:
> > This is correct: " I index it and feed it the timestamp at index time".
> > You can sort desc on that field (can be a TrieDateField)
> >
> >
> > -Original Message-
> > From: Steven White [mailto:swhite4...@gmail.com]
> > Sent: Wednesday, August 17, 2016 9:01 AM
> > To: solr-user@lucene.apache.org
> > Subject: [Ext] Influence ranking based on document committed date
> >
> > Hi everyone
> >
> > Let's say I search for the word "Olympic" and I get a hit on 10 documents 
> > that have similar
> content (let us assume the content is at least 80%
> > identical) how can I have Solr rank them so that the ones with most 
> > recently updated doc
> gets ranked higher? Is this something I have to do at index time or search 
> time?
> >
> > Is the trick to have a field that holds the committed timestamp and boost 
> > on that field
> during search? If so, is this field something I can configure in Solr's 
> schema.xml or
> must I index it and feed it the timestamp at index time? If I'm on the right 
> track, does this
> mean I have to always append this field base boost to each query a user 
> issues?
> >
> > If there is a wiki or article written on this topic, that would be a good 
> > start.
> >
> > In case it matters, I'm using Solr 5.2 and my searches are utilizing 
> > edismax.
> >
> > Thanks in advanced!
> >
> > Steve
>


Re: Increasing filterCache size and Java Heap size

2016-08-17 Thread Zheng Lin Edwin Yeo
Hi Toke,

Thanks for the explanation.
I will prefer the memory-based limit too. At first I got confused with that
too, thinking that the setting of 2000 means 2GB.

Regards,
Edwin


On 17 August 2016 at 17:40, Toke Eskildsen  wrote:

> On Wed, 2016-08-17 at 11:02 +0800, Zheng Lin Edwin Yeo wrote:
> > Would like to check, do I need to increase my Java Heap size for
> > Solr, if I plan to increase my filterCache size in solrconfig.xml?
> >
> > I'm using Solr 6.1.0
>
> It _seems_ that you can specify a limit in megabytes when using
> LRUCache in Solr 5.2+: https://issues.apache.org/jira/browse/SOLR-7372
>
> The documentation only mentions it for queryResultCache, but I do not
> know if that is intentional (i.e. it does not work for filterCache) or
> a shortcoming of the documentation:
> https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+Solr
> Config
>
> If it does work for filterCache too (using LRUCache, I guess), then
> that would be a much better way of limiting cache size than the highly
> insufficient count-based limiter.
>
>
> I say "highly insufficient" because filter cache entries are not of
> equal size. With small sets they are stored as sparse, using a
> relatively small amount of memory. For larger sets they are stored as
> bitmaps, taking up ~1K + maxdoc/8 bytes as Erick describes.
>
> So a fixed upper limit measured in counts needs to be adjusted to worst
> case, meaning maxdoc/8, to ensure stability. In reality most of the
> filter cache entries are small, meaning that there is plenty of heap
> not being used. This leads people to over-allocate the max size for the
> filterCache (very understandable) , resulting in setups that are only
> stable as long as there are not too many large filter sets stores.
> Leaving it to chance really.
>
> I would prefer the count-based limit to be deprecated for the
> filterCache, or at least warned against, in favour of memory-based.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>


Re: [Ext] Influence ranking based on document committed date

2016-08-17 Thread Erick Erickson
Try:
recip(rord(creationDate),1,1000,1000)

See:
https://wiki.apache.org/solr/FunctionQuery

You can play with the magic numbers to influence how this scales your docs.

Best,
Erick

On Wed, Aug 17, 2016 at 7:11 AM, Jay Parashar  wrote:
> This is correct: " I index it and feed it the timestamp at index time".
> You can sort desc on that field (can be a TrieDateField)
>
>
> -Original Message-
> From: Steven White [mailto:swhite4...@gmail.com]
> Sent: Wednesday, August 17, 2016 9:01 AM
> To: solr-user@lucene.apache.org
> Subject: [Ext] Influence ranking based on document committed date
>
> Hi everyone
>
> Let's say I search for the word "Olympic" and I get a hit on 10 documents 
> that have similar content (let us assume the content is at least 80%
> identical) how can I have Solr rank them so that the ones with most recently 
> updated doc gets ranked higher?  Is this something I have to do at index time 
> or search time?
>
> Is the trick to have a field that holds the committed timestamp and boost on 
> that field during search?  If so, is this field something I can configure in 
> Solr's schema.xml or must I index it and feed it the timestamp at index time? 
>  If I'm on the right track, does this mean I have to always append this field 
> base boost to each query a user issues?
>
> If there is a wiki or article written on this topic, that would be a good 
> start.
>
> In case it matters, I'm using Solr 5.2 and my searches are utilizing edismax.
>
> Thanks in advanced!
>
> Steve


Re: Modified stat of index

2016-08-17 Thread Scott Derrick

thanks that works perfectly!

Scott

 Original Message 
Subject: Re: Modified stat of index
From: Alexandre Rafalovitch 
To: solr-user 
Date: 08/16/2016 04:17 PM


I believe you can get that via Luke REST API:
http://localhost:8983/solr//admin/luke

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 17 August 2016 at 07:18, Scott Derrick  wrote:

I need to retrieve the last modified timestamp of my search index.

Is there a query I can use or is it stored in a particular file?

thansk,

Scott

--
One man's "magic" is another man's engineering. "Supernatural" is a null
word.”
Robert A. Heinlein





--
It is with our passions, as it is with fire and water, they are good servants 
but bad masters.
Aesop



Re: Use function in condition

2016-08-17 Thread Emir Arnautovic

Hi Nabil,

You can use frange queries, e.g. you can use fq={!frange 
l=100}sum(field1,field2) to filter doc with sum greater than 100.


Regards,
Emir


On 17.08.2016 16:26, nabil Kouici wrote:

Hi,
Is it possible to use functions (function query 
https://cwiki.apache.org/confluence/display/solr/Function+Queries) in q or fq 
parameters to build a complex search expression.
For exemple, take only documents that sum(field1,field2)> 100. Another exemple: 
if(test,value1,value2):vallue3
Regards,Nabil.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: index size increses dramatically

2016-08-17 Thread Jan Høydahl
Hi

It is quite normal that index size can be close to double during background 
merge of segments. If you have a lot of deletions and/or reindexed docs then 
the same document may also exist in multiple segments, taking up space 
temporarily until a merge or optimize.

If this slows down your system then it sounds like your system is not sized 
properly wrt memory.

But you need to provide more details for anyone to be able to tell you exactly 
what is going on in your situation.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 17. aug. 2016 kl. 15.59 skrev kshitij tyagi :
> 
> Hi,
> 
> 
> Suddenly my index size just doubles and indexing just slows down poorly.
> 
> After sometime it reduces back to normal and indexing starts working.
> 
> Can someone help me out in finding why index size doubles abnormally??



Re: Tagging and excluding Filters with BlockJoin Queries and BlockJoin Faceting

2016-08-17 Thread Stefan Moises

Hi Mikhail,

thanks for the info ... what is the advantage of using the JSON FACET 
API compared to the standard BlockJoinQuery features?


Is there already anybody working on the tagging/exclusion feature or is 
there any timeframe for it? There wasn't any discussion yet in SOLR-8998 
about exclusions, was there?


Thank you very much,

best,

Stefan


Am 17.08.16 um 15:26 schrieb Mikhail Khludnev:

Stefan,
child.facet.field never intend to support exclusions. My preference is to
implement it under json.facet that's discussed under
https://issues.apache.org/jira/browse/SOLR-8998.

On Wed, Aug 17, 2016 at 3:52 PM, Stefan Moises  wrote:


Hey girls and guys,

for a long time we have been using our own BlockJoin Implementation,
because for our Shop Systems a lot of requirements that we had were not
implemented in solr.

As we now had a deeper look into how far the standard has come, we saw
that BlockJoin and faceting on children is now part of the standard, which
is pretty cool.
When I tried to refactor our external code to use that now, I stumbled
upon one non-working feature with BlockJoins that still keeps us from using
it:

It seems that tagging and excluding Filters with BlockJoin Faceting simply
does not work yet.

Simple query:

=products
={!parent which='isparent:true'}shirt AND isparent:false
=true
={!parent which='isparent:true'}{!tag=myTag}color:grey
={!ex=myTag}color


Gives us:
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
undefined field: "{!ex=myTag}color"
 at org.apache.solr.schema.IndexSchema.getField(IndexSchema.
java:1231)


Does somebody have an idea?


Best,
Stefan

--
--

Stefan Moises
Manager Research & Development
shoptimax GmbH
Ulmenstraße 52 H
90443 Nürnberg
Tel.: 0911/25566-0
Fax: 0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de

Geschäftsführung: Friedrich Schreieck
Ust.-IdNr.: DE 814340642
Amtsgericht Nürnberg HRB 21703
   






--
--

Stefan Moises
Manager Research & Development
shoptimax GmbH
Ulmenstraße 52 H
90443 Nürnberg
Tel.: 0911/25566-0
Fax: 0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de

Geschäftsführung: Friedrich Schreieck
Ust.-IdNr.: DE 814340642
Amtsgericht Nürnberg HRB 21703
  





Re: What does refCount denotes in solr admin

2016-08-17 Thread kshitij tyagi
any update??

On Wed, Aug 17, 2016 at 12:47 PM, kshitij tyagi  wrote:

> Hi,
>
> I need to understand what is refcount in stats section of solr admin.
>
> I am seeing refcount: 2 on my solr cores and on one of the core i am
> seeing refcount:171.
>
> The core with refcount  with higher number   is having very slow indexing
> speed?
>
>
>


Use function in condition

2016-08-17 Thread nabil Kouici
Hi,
Is it possible to use functions (function query 
https://cwiki.apache.org/confluence/display/solr/Function+Queries) in q or fq 
parameters to build a complex search expression. 
For exemple, take only documents that sum(field1,field2)> 100. Another exemple: 
if(test,value1,value2):vallue3
Regards,Nabil. 

RE: [Ext] Influence ranking based on document committed date

2016-08-17 Thread Jay Parashar
This is correct: " I index it and feed it the timestamp at index time".
You can sort desc on that field (can be a TrieDateField)


-Original Message-
From: Steven White [mailto:swhite4...@gmail.com] 
Sent: Wednesday, August 17, 2016 9:01 AM
To: solr-user@lucene.apache.org
Subject: [Ext] Influence ranking based on document committed date

Hi everyone

Let's say I search for the word "Olympic" and I get a hit on 10 documents that 
have similar content (let us assume the content is at least 80%
identical) how can I have Solr rank them so that the ones with most recently 
updated doc gets ranked higher?  Is this something I have to do at index time 
or search time?

Is the trick to have a field that holds the committed timestamp and boost on 
that field during search?  If so, is this field something I can configure in 
Solr's schema.xml or must I index it and feed it the timestamp at index time?  
If I'm on the right track, does this mean I have to always append this field 
base boost to each query a user issues?

If there is a wiki or article written on this topic, that would be a good start.

In case it matters, I'm using Solr 5.2 and my searches are utilizing edismax.

Thanks in advanced!

Steve


Influence ranking based on document committed date

2016-08-17 Thread Steven White
Hi everyone

Let's say I search for the word "Olympic" and I get a hit on 10 documents
that have similar content (let us assume the content is at least 80%
identical) how can I have Solr rank them so that the ones with most
recently updated doc gets ranked higher?  Is this something I have to do at
index time or search time?

Is the trick to have a field that holds the committed timestamp and boost
on that field during search?  If so, is this field something I can
configure in Solr's schema.xml or must I index it and feed it the timestamp
at index time?  If I'm on the right track, does this mean I have to always
append this field base boost to each query a user issues?

If there is a wiki or article written on this topic, that would be a good
start.

In case it matters, I'm using Solr 5.2 and my searches are utilizing
edismax.

Thanks in advanced!

Steve


index size increses dramatically

2016-08-17 Thread kshitij tyagi
Hi,


Suddenly my index size just doubles and indexing just slows down poorly.

After sometime it reduces back to normal and indexing starts working.

Can someone help me out in finding why index size doubles abnormally??


Re: Tagging and excluding Filters with BlockJoin Queries and BlockJoin Faceting

2016-08-17 Thread Mikhail Khludnev
Stefan,
child.facet.field never intend to support exclusions. My preference is to
implement it under json.facet that's discussed under
https://issues.apache.org/jira/browse/SOLR-8998.

On Wed, Aug 17, 2016 at 3:52 PM, Stefan Moises  wrote:

> Hey girls and guys,
>
> for a long time we have been using our own BlockJoin Implementation,
> because for our Shop Systems a lot of requirements that we had were not
> implemented in solr.
>
> As we now had a deeper look into how far the standard has come, we saw
> that BlockJoin and faceting on children is now part of the standard, which
> is pretty cool.
> When I tried to refactor our external code to use that now, I stumbled
> upon one non-working feature with BlockJoins that still keeps us from using
> it:
>
> It seems that tagging and excluding Filters with BlockJoin Faceting simply
> does not work yet.
>
> Simple query:
>
> =products
> ={!parent which='isparent:true'}shirt AND isparent:false
> =true
> ={!parent which='isparent:true'}{!tag=myTag}color:grey
> ={!ex=myTag}color
>
>
> Gives us:
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
> undefined field: "{!ex=myTag}color"
> at org.apache.solr.schema.IndexSchema.getField(IndexSchema.
> java:1231)
>
>
> Does somebody have an idea?
>
>
> Best,
> Stefan
>
> --
> --
> 
> Stefan Moises
> Manager Research & Development
> shoptimax GmbH
> Ulmenstraße 52 H
> 90443 Nürnberg
> Tel.: 0911/25566-0
> Fax: 0911/25566-29
> moi...@shoptimax.de
> http://www.shoptimax.de
>
> Geschäftsführung: Friedrich Schreieck
> Ust.-IdNr.: DE 814340642
> Amtsgericht Nürnberg HRB 21703
>   
>
>


-- 
Sincerely yours
Mikhail Khludnev


Tagging and excluding Filters with BlockJoin Queries and BlockJoin Faceting

2016-08-17 Thread Stefan Moises

Hey girls and guys,

for a long time we have been using our own BlockJoin Implementation, because 
for our Shop Systems a lot of requirements that we had were not implemented in 
solr.

As we now had a deeper look into how far the standard has come, we saw that 
BlockJoin and faceting on children is now part of the standard, which is pretty 
cool.
When I tried to refactor our external code to use that now, I stumbled upon one 
non-working feature with BlockJoins that still keeps us from using it:

It seems that tagging and excluding Filters with BlockJoin Faceting simply does 
not work yet.

Simple query:

=products
={!parent which='isparent:true'}shirt AND isparent:false
=true
={!parent which='isparent:true'}{!tag=myTag}color:grey
={!ex=myTag}color


Gives us:
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: undefined field: 
"{!ex=myTag}color"
at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:1231)


Does somebody have an idea?


Best,
Stefan

--
--

Stefan Moises
Manager Research & Development
shoptimax GmbH
Ulmenstraße 52 H
90443 Nürnberg
Tel.: 0911/25566-0
Fax: 0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de

Geschäftsführung: Friedrich Schreieck
Ust.-IdNr.: DE 814340642
Amtsgericht Nürnberg HRB 21703
  





Re: Creating a SolrJ Data Service to send JSON to Solr

2016-08-17 Thread Jennifer Coston
Thank you Alex and Anshum! I will look into both of these.

Jennifer



From:   Anshum Gupta 
To: solr-user@lucene.apache.org
Date:   08/16/2016 08:17 PM
Subject:Re: Creating a SolrJ Data Service to send JSON to Solr



I would also suggest sending the JSON directly to the JSON end point, with
the mapping :
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index
+Handlers#UploadingDatawithIndexHandlers-JSONUpdateConveniencePaths

On Tue, Aug 16, 2016 at 4:43 PM Alexandre Rafalovitch 
wrote:

> Why do you need a POJO? For Solr purposes, you could just get the
> field names from schema and use those to map directly from JSON to the
> 'addField' calls in SolrDocument.
>
> Do you need it for non-Solr purposes? Then you can search for generic
> Java dynamic POJO generation solution.
>
> Also, you could look at creating a superset rather than common-subset
> POJO and then ignore all unknown fields on Solr side by adding a
> dynamicField that matches '*' with everything (index, store,
> docValues) set to false.
>
> Regards,
>Alex.
>
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 17 August 2016 at 02:49, Jennifer Coston
>  wrote:
> >
> > Hello,
> > I am trying to write a data service using SolrJ that will allow me to
> > accept JSON through a REST API, create a Solr document ,and write it to
> > multiple different Solr cores (depending on the core name specified).
The
> > problem I am running into is that each core is going to have a
different
> > schema. My current code has the common fields between all the schemas
in
> a
> > data POJO which I then walk and set the values specified in the JSON to
> the
> > Solr Document. However, I don’t want to create a different class for
each
> > schema to process the JSON and convert it to a Solr Document. Is there
a
> > way to process the extra JSON fields that are not common between the
> > schemas and add them to the Solr Document, without knowing what they
are
> > ahead of time? Is there a way to convert JSON to a Solr Document
without
> > having to use a POJO?  An alternative I was looking into is to use the
> > SolrClient to get the schema fields, create a POJO, walk that POJO to
> > create a Solr Document and then add it to Solr but, it doesn’t seem to
be
> > possible to obtain the fields this way.
> >
> > I know that the easiest way to add JSON to Solr would be to use a curl
> > command and send the JSON directly to Solr but this doesn’t match our
> > requirements, so I need to figure out a way to perform the same
operation
> > using SolrJ. Any other ideas or suggestions would be greatly
appreciated!
> >
> > Thank you,
> >
> > -Jennifer
>


Re: Increasing filterCache size and Java Heap size

2016-08-17 Thread Toke Eskildsen
On Wed, 2016-08-17 at 11:02 +0800, Zheng Lin Edwin Yeo wrote:
> Would like to check, do I need to increase my Java Heap size for
> Solr, if I plan to increase my filterCache size in solrconfig.xml?
> 
> I'm using Solr 6.1.0

It _seems_ that you can specify a limit in megabytes when using
LRUCache in Solr 5.2+: https://issues.apache.org/jira/browse/SOLR-7372

The documentation only mentions it for queryResultCache, but I do not
know if that is intentional (i.e. it does not work for filterCache) or
a shortcoming of the documentation:
https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+Solr
Config

If it does work for filterCache too (using LRUCache, I guess), then
that would be a much better way of limiting cache size than the highly
insufficient count-based limiter.


I say "highly insufficient" because filter cache entries are not of
equal size. With small sets they are stored as sparse, using a
relatively small amount of memory. For larger sets they are stored as
bitmaps, taking up ~1K + maxdoc/8 bytes as Erick describes.

So a fixed upper limit measured in counts needs to be adjusted to worst
case, meaning maxdoc/8, to ensure stability. In reality most of the
filter cache entries are small, meaning that there is plenty of heap
not being used. This leads people to over-allocate the max size for the
filterCache (very understandable) , resulting in setups that are only
stable as long as there are not too many large filter sets stores.
Leaving it to chance really.

I would prefer the count-based limit to be deprecated for the
filterCache, or at least warned against, in favour of memory-based.

- Toke Eskildsen, State and University Library, Denmark



What does refCount denotes in solr admin

2016-08-17 Thread kshitij tyagi
Hi,

I need to understand what is refcount in stats section of solr admin.

I am seeing refcount: 2 on my solr cores and on one of the core i am seeing
refcount:171.

The core with refcount  with higher number   is having very slow indexing
speed?


RE: solr-6.1.0 - Using different client and server certificates for authentication doesn't work

2016-08-17 Thread Kostas
This is what helped me:
https://gist.github.com/jankronquist/6412839




-Original Message-
From: Kostas [mailto:k...@dataverse.gr] 
Sent: Tuesday, July 26, 2016 3:22 PM
To: solr-user@lucene.apache.org
Subject: solr-6.1.0 - Using different client and server certificates for
authentication doesn't work

Hello.

 

I have setup Solr 6.1.0 to use SSL (on Windows) and to do client
authentication based on the client certificate.

When I use the same certificate for both the server and the client
authentication, everything works OK :

 



== solr.in.cmd

set SOLR_SSL_KEY_STORE=%ROO%/server/etc/solr-ssl.keystore.jks

set SOLR_SSL_KEY_STORE_PASSWORD=password

set SOLR_SSL_TRUST_STORE=%ROO%/server/etc/solr-ssl.keystore.jks

set SOLR_SSL_TRUST_STORE_PASSWORD=password

set SOLR_SSL_NEED_CLIENT_AUTH=true

set SOLR_SSL_WANT_CLIENT_AUTH=false

REM (Client settings residing below are commented out.)

 

== server\etc\jetty-ssl.xml

  

  

  

  

  

  

 

==  This works :

curl ^

--cert "solr-ssl.keystore.pem" ^

--cacert "solr-ssl.keystore.pem" ^

"https://localhost:8898/solr/admin/collections?action=CLUSTERSTATUS=json;
indent=on"



 

However, when I try to use different server and client certificates, it
doesn't work (it seems that it still uses the server certificate for client
authorizations) :

 



== solr.in.cmd

set SOLR_SSL_KEY_STORE=%ROO%/server/etc/solr-ssl.keystore.jks

set SOLR_SSL_KEY_STORE_PASSWORD=password

set SOLR_SSL_TRUST_STORE=%ROO%/server/etc/solr-ssl.keystore.jks

set SOLR_SSL_TRUST_STORE_PASSWORD=password

set SOLR_SSL_NEED_CLIENT_AUTH=true

set SOLR_SSL_WANT_CLIENT_AUTH=false

 

set SOLR_SSL_CLIENT_KEY_STORE=%ROO%/server/etc/solr-ssl-client.keystore.jks

set SOLR_SSL_CLIENT_KEY_STORE_PASSWORD=password

set
SOLR_SSL_CLIENT_TRUST_STORE=%ROO%/server/etc/solr-ssl-client.keystore.jks

set SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD=password

 

 

== server\etc\jetty-ssl.xml

  

  

  

  

  

  

 

 

== This fails (!!!):

curl ^

--cert "solr-ssl-client.keystore.pem" ^

--cacert "solr-ssl.keystore.pem" ^

"https://localhost:8898/solr/admin/collections?action=CLUSTERSTATUS=json;
indent=on"

 

== This STILL works (!!!):

curl ^

--cert "solr-ssl.keystore.pem" ^

--cacert "solr-ssl.keystore.pem" ^

"https://localhost:8898/solr/admin/collections?action=CLUSTERSTATUS=json;
indent=on"



 

I run Solr like this:

 

"%ROO%\bin\solr" start -c -V -f -p 8898^

-Dsolr.ssl.checkPeerName=false

 

>From what I can tell, Solr uses the values from ` server\etc\jetty-ssl.xml `
and totally discards the ones form `solr.in.cmd`.

Naturally, I would try to set the client certificate inside there
(jetty-ssl.xml), but I don't see any setting available for that.

Is what I am trying to do (use different certificates for server and client
authentication) supported or I waste my time?

Also, why don't the docs say that jetty-ssl.xml overrides the settings in
`solr.in.cmd`? Am I missing something?

 

Thanks,
Kostas

 




Re: Increasing filterCache size and Java Heap size

2016-08-17 Thread Zheng Lin Edwin Yeo
Hi Erick,

Thanks for your reply.

But do we have to set the Java Heap size based on all the collections
available (if I were to increase the filterCache size for all my
collections)?

I come across this from StackOverFlow,
http://stackoverflow.com/questions/2004/solr-filter-cache-fastlrucache-takes-too-much-memory-and-results-in-out-of-mem
it says that if we want to have a filterCache size of 2000, we will need
12GB of memory.

Let's say if we have 3 of the collections, which all the filterCache size
are set to 2000, do we need 36GB of Java Heap space memory? Or just 12GB
will be sufficient?

Regards,
Edwin


On 17 August 2016 at 14:09, Erick Erickson  wrote:

> Yes. Each entry is roughly 1K + maxdoc/8 bytes. The maxdoc/8 is the
> bitmap that holds the result set and the 1K is just overhead for the
> text of the query itself and cache overhead. Usually it's safe to
> ignore since the maxdoc/8 usually dominates by a wide margin.
>
> Best,
> Erick
>
> On Tue, Aug 16, 2016 at 8:02 PM, Zheng Lin Edwin Yeo
>  wrote:
> > Hi,
> >
> > Would like to check, do I need to increase my Java Heap size for Solr,
> if I
> > plan to increase my filterCache size in solrconfig.xml?
> >
> > I'm using Solr 6.1.0
> >
> > Regards,
> > Edwin
>


Re: Indexing (posting document) taking a lot of time

2016-08-17 Thread kshitij tyagi
I am posting json using curl.

On Wed, Aug 17, 2016 at 4:41 AM, Alexandre Rafalovitch 
wrote:

> What format are those documents? Solr XML? Custom JSON?
>
> Or are you sending PDF/binary documents to Solr's extract handler and
> asking it to do the extraction of the useful stuff? If later, you
> could take that step out of Solr with a custom client using Tika (what
> Solr has under the hood) and only send to Solr the processed output.
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 16 August 2016 at 22:49, kshitij tyagi 
> wrote:
> > 400kb is size of single document and i am sending 100 documents per
> request.
> > solr heap size is 16gb and running on multithread.
> >
> > On Tue, Aug 16, 2016 at 5:10 PM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi,
> >>
> >> 400KB/doc * 100doc = 40MB. If you are running it single threaded, Solr
> >> will be idle while accepting relatively large request. Or is 400KB 100
> doc
> >> bulk that you are sending?
> >>
> >> What is Solr's heap size? I would try increasing number of threads and
> >> monitor Solr's heap/CPU/IO to see where is the bottleneck.
> >>
> >> How complex is fields' analysis?
> >>
> >> Regards,
> >> Emir
> >>
> >>
> >> On 16.08.2016 13:25, kshitij tyagi wrote:
> >>
> >>> hi,
> >>>
> >>> we are sending about 100 documents per request for indexing? we have
> >>> autocmmit set to false and commit only when 1 documents are
> >>> present.solr and the machine sending request are in same pool.
> >>>
> >>>
> >>>
> >>> On Tue, Aug 16, 2016 at 4:51 PM, Emir Arnautovic <
> >>> emir.arnauto...@sematext.com> wrote:
> >>>
> >>> Hi,
> 
>  Do you send one doc per request? How frequently do you commit? Where
> is
>  Solr running? What is network connection between your machine and
> Solr?
>  What are JVM settings? Is 10-30s for entire indexing or single doc?
> 
>  Regards,
>  Emir
> 
> 
>  On 16.08.2016 11:34, kshitij tyagi wrote:
> 
>  Hi alexandre,
> >
> > 1 document of 400kb size is taking approx 10-30 sec and this is
> > varying. I
> > am posting document using curl
> >
> > On Tue, Aug 16, 2016 at 2:11 PM, Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > wrote:
> >
> > How many records is that and what is 'slow'? Also is this standalone
> or
> >
> >> cluster setup?
> >>
> >> On 16 Aug 2016 6:33 PM, "kshitij tyagi" <
> kshitij.shopcl...@gmail.com>
> >> wrote:
> >>
> >> Hi,
> >>
> >>> I am indexing a lot of data about 8GB, but it is taking a lot of
> >>> time. I
> >>> have read about maxBufferedDocs, ramBufferSizeMB, merge policy
> ,etc in
> >>> solrconfig file.
> >>>
> >>> It would be helpful if someone could help me out tune the segtting
> for
> >>> faster indexing speeds.
> >>>
> >>> *I have read the docs but not able to get what exactly means
> changing
> >>>
> >>> these
> >>
> >> configs.*
> >>>
> >>>
> >>> *Regards,*
> >>> *Kshitij*
> >>>
> >>>
> >>> --
>  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>  Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> 
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> Solr & Elasticsearch Support * http://sematext.com/
> >>
> >>
>


Re: Multiple rollups/facets in one streaming aggregation?

2016-08-17 Thread Radu Gheorghe
Thanks a lot, Joel, for your very fast and informative reply!

We'll chew on this and add a Jira if we're going on this route.
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Aug 16, 2016 at 8:29 PM, Joel Bernstein  wrote:
> For the initial implementation we could skip the merge piece if that helps
> get things done faster. In this scenario the metrics could be gathered
> after some parallel operation, then there would be no need for a merge.
> Sample syntax:
>
> metrics(parallel(join())
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein  wrote:
>
>> The concept of a MetricStream was in the early designs but hasn't yet been
>> implemented. Now might be a good time to work on the implementation.
>>
>> The MetricStream wraps a stream and gathers metrics in memory, continuing
>> to emit the tuples from the underlying stream. This allows multiple
>> MetricStreams to operate over the same stream without transforming the
>> stream. Psuedo code for a metric expression syntax is below:
>>
>> metrics(metrics(search())
>>
>> The MetricStream delivers it's metrics through the EOF Tuple. So the
>> MetricStream simply adds the finished aggregations to the EOF Tuple and
>> returns it. If we're going to support parallel metric gathering then we'll
>> also need to support the merging of the metrics. Something like this:
>>
>> metrics(parallel(metrics(join())
>>
>> Where the metrics wrapping the parallel function would need to collect the
>> EOF tuples from each worker and the merge the metrics and then emit the
>> merged metrics in and EOF Tuple.
>>
>> If you think this meets your needs, feel free to create a jira and add
>> begin a patch and I can help get it committed.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe <
>> radu.gheor...@sematext.com> wrote:
>>
>>> Hello Solr users :)
>>>
>>> Right now it seems that if I want to rollup on two different fields
>>> with streaming expressions, I would need to do two separate requests.
>>> This is too slow for our use-case, when we need to do joins before
>>> sorting and rolling up (because we'd have to re-do the joins).
>>>
>>> Since in our case we are actually looking for some not-necessarily
>>> accurate facets (top N), the best solution we could come up with was
>>> to implement a new stream decorator that implements an algorithm like
>>> Count-min sketch[1] which would run on the tuples provided by the
>>> stream function it wraps. This would have two big wins for us:
>>> 1) it would do the facet without needing to sort on the facet field,
>>> so we'll potentially save lots of memory
>>> 2) because sorting isn't needed, we could do multiple facets in one go
>>>
>>> That said, I have two (broad) questions:
>>> A) is there a better way of doing this? Let's reduce the problem to
>>> streaming aggregations, where the assumption is that we have multiple
>>> collections where data needs to be joined, and then facet on fields
>>> from all collections. But maybe there's a better algorithm, something
>>> out of the box or closer to what is offered out of the box?
>>> B) whatever the best way is, could we do it in a way that can be
>>> contributed back to Solr? Any hints on how to do that? Just another
>>> decorator?
>>>
>>> Thanks and best regards,
>>> Radu
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>>>
>>
>>


Re: Increasing filterCache size and Java Heap size

2016-08-17 Thread Erick Erickson
Yes. Each entry is roughly 1K + maxdoc/8 bytes. The maxdoc/8 is the
bitmap that holds the result set and the 1K is just overhead for the
text of the query itself and cache overhead. Usually it's safe to
ignore since the maxdoc/8 usually dominates by a wide margin.

Best,
Erick

On Tue, Aug 16, 2016 at 8:02 PM, Zheng Lin Edwin Yeo
 wrote:
> Hi,
>
> Would like to check, do I need to increase my Java Heap size for Solr, if I
> plan to increase my filterCache size in solrconfig.xml?
>
> I'm using Solr 6.1.0
>
> Regards,
> Edwin