Re: Need Help in migrating Solr version 1.4 to 4.3

2013-06-26 Thread Shawn Heisey
On 6/26/2013 11:25 PM, Sandeep Gupta wrote:
> To have singleton design pattern for SolrServer object creation,
> I found that there are so many ways described in
> http://en.wikipedia.org/wiki/Singleton_pattern
> So which is the best one, out of 5 examples mentioned in above url, for web
> application in general practice.
> 
> I am sure lots of people (in this mailing list) will have practical
> experience
> as which type of singleton pattern need to be implement for creation of
> SolrServer object.

I will admit that when I used the word "singleton" I honestly hadn't
looked it up to see what it really meant.  If you do use the full
meaning of singleton, you can do this in any way you want.

Perhaps a better thing to say is that you only need one SolrServer
object for each base URL (host/port/core combination).  Things are a
little bit different when it comes to SolrCloud - you can use one
CloudSolrServer object for the entire cloud, even if there are many
collections and many servers.

In my own SolrJ code, I create two HttpSolrServer objects within each of
my homegrown "Core" objects.  One of them is for operations against that
specific Solr core, the other is for CoreAdmin operations.

Because the URL for CoreAdmin operations is common to multiple cores, I
create a static Map with those server objects so that my "Core" objects
can share the SolrServer object used for CoreAdmin when they are on the
same server machine.

For the query side, if you're in a situation where you have one access
point to your Solr installation (a load balancer in front of replicating
Solr servers) and you only have one index, then you could create a
single static SolrServer object for your entire application.

Thanks,
Shawn



Re: Need Help in migrating Solr version 1.4 to 4.3

2013-06-26 Thread Sandeep Gupta
Thanks Shawn.
To have singleton design pattern for SolrServer object creation,
I found that there are so many ways described in
http://en.wikipedia.org/wiki/Singleton_pattern
So which is the best one, out of 5 examples mentioned in above url, for web
application in general practice.

I am sure lots of people (in this mailing list) will have practical
experience
as which type of singleton pattern need to be implement for creation of
SolrServer object.

Waiting for some comments in this front ?

Regards
Sandeep




On Wed, Jun 26, 2013 at 9:20 PM, Shawn Heisey  wrote:

> On 6/25/2013 11:52 PM, Sandeep Gupta wrote:
> > Also in application development side,
> > as I said that I am going to use HTTPSolrServer API and I found that we
> > shouldn't create this object multiple times
> > (as per the wiki document
> http://wiki.apache.org/solr/Solrj#HttpSolrServer)
> > So I am planning to have my Server class as singleton.
> >  Please advice little bit in this front also.
>
> This is always the way that SolrServer objects are intended to be used,
> including CommonsHttpSolrServer in version 1.4.  The only major
> difference between the two objects is that the new one uses
> HttpComponents 4.x and the old one uses HttpClient 3.x.  There are other
> differences, but they are just the result of incremental improvements
> from version to version.
>
> Thanks,
> Shawn
>
>


Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Otis Gospodnetic
If hibernate search is like regular hibernate ORM I'm not sure I'd
trust it to pick the most optimal solutions...

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Jun 26, 2013 4:44 PM, "Guido Medina"  wrote:

> Never heard of embedded Solr server, isn't better to just use lucene alone
> for that purpose? Using a helper like Hibernate? Since most applications
> that require indexes will have a relational DB behind the scene, it would
> not be a bad idea to use a ORM combined with Lucene annotations (aka
> hibernate-search)
>
> Guido.
>
> On 26/06/13 20:30, Alexandre Rafalovitch wrote:
>
>> Yes, it is possible by running an embedded Solr inside SolrJ process.
>> The nice thing is that the index is portable, so you can then access
>> it from the standalone Solr server later.
>>
>> I have an example here:
>> https://github.com/arafalov/**solr-indexing-book/tree/**
>> master/published/solrj
>> , which shows SolrJ running both as a client and with an embedded
>> container. Notice that you will probably need more jars than you
>> expect for the standalone Solr to work, including a number of servlet
>> jars.
>>
>> Regards,
>>Alex.
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: 
>> http://www.linkedin.com/in/**alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Wed, Jun 26, 2013 at 2:59 PM, Learner  wrote:
>>
>>> I currently have a SOLRJ program which I am using for indexing the data
>>> in
>>> SOLR. I am trying to figure out a way to build index without depending on
>>> running instance of SOLR. I should be able to supply the solrconfig and
>>> schema.xml to the indexing program which in turn create index files that
>>> I
>>> can use with any SOLR instance. Is it possible to implement this?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://lucene.472066.n3.**
>>> nabble.com/Is-there-a-way-to-**build-indexes-using-SOLRJ-**
>>> without-SOLR-instance-**tp4073383.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>


Replicating files containing external file fields

2013-06-26 Thread Arun Rangarajan
>From https://wiki.apache.org/solr/SolrReplication I understand that index
dir and any files under the conf dir can be replicated to slaves. I want to
know if there is any way the files under the data dir containing external
file fields can be replicated. These are not replicated by default.
Currently we are running the ext file field reload script on both the
master and the slave and then running reloadCache on each server once they
are loaded.


Re: OOM killer script woes

2013-06-26 Thread Timothy Potter
Thanks for the feedback Daniel ... For now, I've opted to just kill
the JVM with System.exit(1) in the SolrDispatchFilter code and will
restart it with a Linux supervisor. Not elegant but the alternative of
having a zombie Solr instance walking around my cluster is much worse
;-) Will try to dig into the code that is trapping this error but for
now I've lost too many hours on this problem.

Cheers,
Tim

On Wed, Jun 26, 2013 at 2:43 PM, Daniel Collins  wrote:
> Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and
> throwing it/packaging it as a java.lang.RuntimeException.  The -XX option
> assumes that the application doesn't handle the Errors and so they would
> reach the JVM and thus invoke the handler.
> Since Jetty has an exception handler that is dealing with anything
> (included Errors), they never reach the JVM, hence no handler.
>
> Not much we can do short of not using Jetty?
>
> That's a pain, I'd just written a nice OOM handler too!
>
>
> On 26 June 2013 20:37, Timothy Potter  wrote:
>
>> A little more to this ...
>>
>> Just on chance this was a weird Jetty issue or something, I tried with
>> the latest 9 and the problem still occurs :-(
>>
>> This is on Java 7 on debian:
>>
>> java version "1.7.0_21"
>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>
>> Here is an example stack trace from the log
>>
>> 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR
>> solr.servlet.SolrDispatchFilter Q:22 -
>> null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
>> space
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>> at
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
>> at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>> at
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>> at
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
>> at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
>> at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>> at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
>> at
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>> at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>> at org.eclipse.jetty.server.Server.handle(Server.java:445)
>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
>> at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
>> at
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
>> at java.lang.Thread.run(Thread.java:722)
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>
>> On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter 
>> wrote:
>> > Recently upgraded to 4.3.1 but this problem has persisted for a while
>> now ...
>> >
>> > I'm using the following configuration when starting Jetty:
>> >
>> > -XX:OnOutOfMemoryError="/home/solr/oom_killer.sh 83 %p"
>> >
>> > If an OOM is triggered during Solr web app initialization (such as by
>> > me lowering -Xmx to a value that is too low to initialize Solr with),
>> > then the script gets called and does what I expect!
>> >
>> > However, once the Solr webapp initializes and Solr is happily
>> > responding to updates and queries. When an OOM occurs in this
>> > situation, then the script doesn't actually get invoked! All I see is
>> > the following in the stdout/stderr log of my process:
>> >
>> > #
>> > # java.lang.OutOfMemoryError: Java heap space
>> > # -XX:OnOutOfMemoryError="/home/solr/oom_killer.sh 83 %p"
>> > #   Executing /bin/sh -c "/home/solr/oom_killer.sh 83 21358"...
>> >
>> > The oom_killer.sh script doesn't actually get called!
>> >
>> > So to recap, it works if an OOM occurs during initialization but once
>> > Solr is running, the OOM killer doesn't fire correctly. This leads me
>> > to believe my script is fine and there's something else going wrong.
>> > Here's the oom_killer.sh script (pretty basic):
>> >
>> > #!/bin/bash
>> > SOLR_PORT=$1

Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Alexandre Rafalovitch
On Wed, Jun 26, 2013 at 4:43 PM, Guido Medina wrote:

> Never heard of embedded Solr server,


I guess that's the exciting part about Solr. Always more nuances to learn:
https://wiki.apache.org/solr/EmbeddedSolr :-)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Configuring Solr to retrieve documents?

2013-06-26 Thread aspielman
Is it possible to to configure Solr to automatically grab documents in a
specidfied directory, with having to use the post command?

I've not found any way to do this, though admittedly, I'm not terribly
experienced with config files of this type.

Thanks!



-
<| A.Spielman |>
"In theory there is no difference between theory and practice. In practice 
there is." - Chuck Reid
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuring-Solr-to-retrieve-documents-tp4073372.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to get values of external file field(s) in Solr query?

2013-06-26 Thread Arun Rangarajan
Yonik,
Thanks, your answer works!


On Wed, Jun 26, 2013 at 2:07 PM, Yonik Seeley  wrote:

> On Wed, Jun 26, 2013 at 4:02 PM, Arun Rangarajan
>  wrote:
> >
> http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
> > says
> > this about external file fields:
> > "They can be used only for function queries or display".
> > I understand how to use them in function queries, but how do I retrieve
> the
> > values for display?
> >
> > If I want to fetch only the values of a single external file field for a
> > set of primary keys, I can do:
> > q=_val_:"EXT_FILE_FIELD"&fq=id:(doc1 doc2 doc3)&fl=id,score
> > For this query, the score is the value of the external file field.
> >
> > But how to get the values for docs that match some arbitrary query?
>
> Pseudo-fields allow you to retrieve the value for any arbitrary
> function per returned document.
> Should work here, but I haven't tried it.
>
> fl=id, score, field(EXT_FILE_FIELD)
>
> or you can alias it:
>
> fl=id, score, myfield:field(EXT_FILE_FIELD)
>
> -Yonik
> http://lucidworks.com
>


Re: How to get values of external file field(s) in Solr query?

2013-06-26 Thread Yonik Seeley
On Wed, Jun 26, 2013 at 4:02 PM, Arun Rangarajan
 wrote:
> http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
> says
> this about external file fields:
> "They can be used only for function queries or display".
> I understand how to use them in function queries, but how do I retrieve the
> values for display?
>
> If I want to fetch only the values of a single external file field for a
> set of primary keys, I can do:
> q=_val_:"EXT_FILE_FIELD"&fq=id:(doc1 doc2 doc3)&fl=id,score
> For this query, the score is the value of the external file field.
>
> But how to get the values for docs that match some arbitrary query?

Pseudo-fields allow you to retrieve the value for any arbitrary
function per returned document.
Should work here, but I haven't tried it.

fl=id, score, field(EXT_FILE_FIELD)

or you can alias it:

fl=id, score, myfield:field(EXT_FILE_FIELD)

-Yonik
http://lucidworks.com


Re: How to get values of external file field(s) in Solr query?

2013-06-26 Thread Upayavira
The only way is using a frange (function range) query:

q={!frange l=0 u=10}my_external_field

Will pull out documents that have your external field with a value
between zero and 10.

Upayavira 

On Wed, Jun 26, 2013, at 09:02 PM, Arun Rangarajan wrote:
> http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
> says
> this about external file fields:
> "They can be used only for function queries or display".
> I understand how to use them in function queries, but how do I retrieve
> the
> values for display?
> 
> If I want to fetch only the values of a single external file field for a
> set of primary keys, I can do:
> q=_val_:"EXT_FILE_FIELD"&fq=id:(doc1 doc2 doc3)&fl=id,score
> For this query, the score is the value of the external file field.
> 
> But how to get the values for docs that match some arbitrary query? Is
> there a syntax trick that will work where the value of the ext file field
> does not affect the score of the main query, but I can still retrieve its
> value?
> 
> Also is it possible to retrieve the values of more than one external file
> field in a single query?


Re: Is it possible to searh Solr with a longer query string?

2013-06-26 Thread Gary Young
Oh this is good!


On Wed, Jun 26, 2013 at 12:05 PM, Shawn Heisey  wrote:

> On 6/25/2013 6:15 PM, Jack Krupansky wrote:
> > Are you using Tomcat?
> >
> > See:
> > http://wiki.apache.org/solr/SolrTomcat#Enabling_Longer_Query_Requests
> >
> > Enabling Longer Query Requests
> >
> > If you try to submit too long a GET query to Solr, then Tomcat will
> > reject your HTTP request on the grounds that the HTTP header is too
> > large; symptoms may include an HTTP 400 Bad Request error or (if you
> > execute the query in a web browser) a blank browser window.
> >
> > If you need to enable longer queries, you can set the maxHttpHeaderSize
> > attribute on the HTTP Connector element in your server.xml file. The
> > default value is 4K. (See
> > http://tomcat.apache.org/tomcat-5.5-doc/config/http.html)
>
> Even better would be to force SolrJ to use a POST request.  In newer
> versions (4.1 and later) Solr sets the servlet container's POST buffer
> size and defaults it to 2MB.  In older versions, you'd have to adjust
> this in your servlet container config, but the default should be
> considerably larger than the header buffer used for GET requests.
>
> I thought that SolrJ used POST by default, but after looking at the
> code, it seems that I was wrong.  Here's how to send a POST query:
>
> response = server.query(query, METHOD.POST);
>
> The import required for this is:
>
> import org.apache.solr.client.solrj.SolrRequest.METHOD;
>
> Gary, if you can avoid it, you should not be creating a new
> HttpSolrServer object every time you make a query.  It is completely
> thread-safe, so create a singleton and use it for all queries against
> the medline core.
>
> Thanks,
> Shawn
>
>


Re: Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Shawn Heisey

On 6/26/2013 1:36 PM, Mike L. wrote:

Here's the scrubbed version of my DIH: http://apaste.info/6uGH

It contains everything I'm more or less doing...pretty straight forward.. One thing to note and I 
don't know if this is a bug or not, but the batchSize="-1" streaming feature doesn't seem 
to work, at least with informix jdbc drivers. I set the batchsize to "500", but have 
tested it with various numbers including 5000, 1. I'm aware that behind the scenes this should 
be just setting the fetchsize, but its a bit puzzling why I don't see a difference regardless of 
what value I actually use. I was told by one of our DBA's that our value is set as a global DB 
param and can't be modified (which I haven't looked into afterward.)


Setting the batchSize to -1 causes DIH to set fetchSize to 
Integer.MIN_VALUE (around negative two billion), which seems to be a 
MySQL-specific hack to enable result streaming.  I've never heard of it 
working on any other JDBC driver.


Assuming that the Informix JDBC driver is actually honoring the 
fetchSize, setting batchSize in the DIH config should be enough.  If 
it's not, then it's a bug in the JDBC driver or possibly a server 
misconfiguration.



As far as HEAP patterns, I watch the process via WILY and notice GC occurs 
every 15min's or so, but becomes infrequent and not as significant as the 
previous one. It's almost as if some memory is never released until it 
eventually catches up to the max heap size.

I did assume that perhaps there could have been some locking issues, which is 
why I made the following modifications:

readOnly="true" transactionIsolation="TRANSACTION_READ_UNCOMMITTED"


I can't really comment here.  It does appear that the Informix JDBC 
driver is not something you can download from IBM's website without 
paying them money.  I would suggest going to IBM (or an informix-related 
support avenue) for some help, ESPECIALLY if you've paid money for it.



What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? 
My general understanding is the higher the mergeFactor, the less frequent 
merges which should improve index time, but slow down query response time. I 
also read somewhere that an increase on the ramBufferSize should help prevent 
frequent merges...but confused why I didn't really see an improvement...perhaps 
my combination of these values wasn't right in relation to my total fetch size.


Of these, ramBufferSizeMB is the only one that should have a 
*significant* effect on RAM usage, and at a value of 100, I would not 
expect there to be a major issue unless you are doing a lot of imports 
at the same time.


Because you are using Solr 3.5, if you do not need your import results 
to be visible until the end, I wouldn't worry about using autoCommit. 
If you were using Solr 4.x, I would recommend that you turn autoCommit 
on, but with openSearcher set to false.



Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e 
the defaults) the better on memory management, but cost on index time as you 
pay for the overhead of committing. That is a number I've been experimenting 
with as well and have scene some variations in heap trends but unfortunately, 
have not completed the job quite yet with any config... I did get very close.. 
I'd hate to throw additional memory at the problem if there is something else I 
can tweak..


General impressions:  Unless the amount of data involved in each Solr 
document is absolutely enormous, this is very likely bugs (memory leaks 
or fetchSize problems) in the Informix JDBC driver.  I did find the 
following page, but it's REALLY REALLY old, which hopefully means that 
it doesn't apply.


http://www-01.ibm.com/support/docview.wss?uid=swg21260832

If your documents ARE huge, then you probably need to give more memory 
to the java heap ... but you might still have memory leak bugs in the 
JDBC driver.


When it comes to Java and Lucene/Solr, IBM has a *terrible* track 
record, especially for people using the IBM Java VM.  I would not be 
surprised if their JDBC driver is plagued by similar problems.  If you 
do find a support resource and they tell you that you should change your 
JDBC code to work differently, then you need to tell them that you can't 
change the JDBC code and that they need to give you a configuration URL 
workaround.


Here's another possibility of a bug that causes memory leaks:

http://www-01.ibm.com/support/docview.wss?uid=swg1IC58469

You might ask whether the problem could be a memory leak in Solr.  It's 
always possible, but I've had a lot of experience with DIH from MySQL on 
Solr 1.4.0, 1.4.1, 3.2.0, 3.5.0, and 4.2.1.  I've never seen any signs 
of a leak.


Thanks,
Shawn



Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Guido Medina
Never heard of embedded Solr server, isn't better to just use lucene 
alone for that purpose? Using a helper like Hibernate? Since most 
applications that require indexes will have a relational DB behind the 
scene, it would not be a bad idea to use a ORM combined with Lucene 
annotations (aka hibernate-search)


Guido.

On 26/06/13 20:30, Alexandre Rafalovitch wrote:

Yes, it is possible by running an embedded Solr inside SolrJ process.
The nice thing is that the index is portable, so you can then access
it from the standalone Solr server later.

I have an example here:
https://github.com/arafalov/solr-indexing-book/tree/master/published/solrj
, which shows SolrJ running both as a client and with an embedded
container. Notice that you will probably need more jars than you
expect for the standalone Solr to work, including a number of servlet
jars.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Jun 26, 2013 at 2:59 PM, Learner  wrote:

I currently have a SOLRJ program which I am using for indexing the data in
SOLR. I am trying to figure out a way to build index without depending on
running instance of SOLR. I should be able to supply the solrconfig and
schema.xml to the indexing program which in turn create index files that I
can use with any SOLR instance. Is it possible to implement this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: OOM killer script woes

2013-06-26 Thread Daniel Collins
Ooh, I guess Jetty is trapping that java.lang.OutOfMemoryError, and
throwing it/packaging it as a java.lang.RuntimeException.  The -XX option
assumes that the application doesn't handle the Errors and so they would
reach the JVM and thus invoke the handler.
Since Jetty has an exception handler that is dealing with anything
(included Errors), they never reach the JVM, hence no handler.

Not much we can do short of not using Jetty?

That's a pain, I'd just written a nice OOM handler too!


On 26 June 2013 20:37, Timothy Potter  wrote:

> A little more to this ...
>
> Just on chance this was a weird Jetty issue or something, I tried with
> the latest 9 and the problem still occurs :-(
>
> This is on Java 7 on debian:
>
> java version "1.7.0_21"
> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>
> Here is an example stack trace from the log
>
> 2013-06-26 19:31:33,801 [qtp632640515-62] ERROR
> solr.servlet.SolrDispatchFilter Q:22 -
> null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
> space
> at
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.eclipse.jetty.server.Server.handle(Server.java:445)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>
> On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter 
> wrote:
> > Recently upgraded to 4.3.1 but this problem has persisted for a while
> now ...
> >
> > I'm using the following configuration when starting Jetty:
> >
> > -XX:OnOutOfMemoryError="/home/solr/oom_killer.sh 83 %p"
> >
> > If an OOM is triggered during Solr web app initialization (such as by
> > me lowering -Xmx to a value that is too low to initialize Solr with),
> > then the script gets called and does what I expect!
> >
> > However, once the Solr webapp initializes and Solr is happily
> > responding to updates and queries. When an OOM occurs in this
> > situation, then the script doesn't actually get invoked! All I see is
> > the following in the stdout/stderr log of my process:
> >
> > #
> > # java.lang.OutOfMemoryError: Java heap space
> > # -XX:OnOutOfMemoryError="/home/solr/oom_killer.sh 83 %p"
> > #   Executing /bin/sh -c "/home/solr/oom_killer.sh 83 21358"...
> >
> > The oom_killer.sh script doesn't actually get called!
> >
> > So to recap, it works if an OOM occurs during initialization but once
> > Solr is running, the OOM killer doesn't fire correctly. This leads me
> > to believe my script is fine and there's something else going wrong.
> > Here's the oom_killer.sh script (pretty basic):
> >
> > #!/bin/bash
> > SOLR_PORT=$1
> > SOLR_PID=$2
> > NOW=$(date +"%Y%m%d_%H%M")
> > (
> > echo "Running OOM killer script for process $SOLR_PID for Solr on port
> > 89$SOLR_PORT"
> > kill -9 $SOLR_PID
> > echo "Killed process $SOLR_PID"
> > exec /home/solr/solr-dg/dg-solr.sh recover $SOLR_PORT &
> > echo "Restarted Solr on 89$SOLR_PORT after OOM"
> > ) | tee oom_killer-89$SOLR_PORT-$NOW.log
> >
> > Anyone see anything like this before? Suggestions on where to begin
> > tracking down this issue?
> >
> > Cheers,
> > Tim
>


Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Guido Medina
AFAIK solrj is just the network client that connects to a Solr server 
using Java, now, if you just need to index your data on your local HDD 
you might want to step back to Lucene. I'm assuming you are using Java 
so you could also annotate your POJO's with Lucene annotations, google 
hibernate-search, maybe that's what you are looking for.


HTH,

Guido.

On 26/06/13 19:59, Learner wrote:

I currently have a SOLRJ program which I am using for indexing the data in
SOLR. I am trying to figure out a way to build index without depending on
running instance of SOLR. I should be able to supply the solrconfig and
schema.xml to the indexing program which in turn create index files that I
can use with any SOLR instance. Is it possible to implement this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr 4.2.1 - master taking long time to respond after tomcat restart

2013-06-26 Thread Arun Rangarajan
Thanks, Shawn & Jack. I will go with the wiki and use autoCommit with
openSearcher set to false.


On Wed, Jun 26, 2013 at 10:23 AM, Jack Krupansky wrote:

> You need to do occasional hard commits, otherwise the update log just
> grows and grows and gets replayed on each server start.
>
> -- Jack Krupansky
>
> -Original Message- From: Arun Rangarajan
> Sent: Wednesday, June 26, 2013 1:18 PM
> To: solr-user@lucene.apache.org
> Subject: Solr 4.2.1 - master taking long time to respond after tomcat
> restart
>
>
> Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates,
> we enabled updateLog and made the few unstored int and boolean fields as
> "stored". We have a single master and a single slave and all the queries go
> only to the slave. We make only max. 50 atomic update requests/hour to the
> master.
>
> Noticing that on restarting tomcat, the master Solr server takes several
> minutes to respond. This was not happening in 3.6.1. The slave is
> responding as quickly as before after restarting tomcat. Any ideas why only
> master would take this long?
>


How to get values of external file field(s) in Solr query?

2013-06-26 Thread Arun Rangarajan
http://docs.lucidworks.com/display/solr/Working+with+External+Files+and+Processes
says
this about external file fields:
"They can be used only for function queries or display".
I understand how to use them in function queries, but how do I retrieve the
values for display?

If I want to fetch only the values of a single external file field for a
set of primary keys, I can do:
q=_val_:"EXT_FILE_FIELD"&fq=id:(doc1 doc2 doc3)&fl=id,score
For this query, the score is the value of the external file field.

But how to get the values for docs that match some arbitrary query? Is
there a syntax trick that will work where the value of the ext file field
does not affect the score of the main query, but I can still retrieve its
value?

Also is it possible to retrieve the values of more than one external file
field in a single query?


Re: OOM killer script woes

2013-06-26 Thread Timothy Potter
A little more to this ...

Just on chance this was a weird Jetty issue or something, I tried with
the latest 9 and the problem still occurs :-(

This is on Java 7 on debian:

java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

Here is an example stack trace from the log

2013-06-26 19:31:33,801 [qtp632640515-62] ERROR
solr.servlet.SolrDispatchFilter Q:22 -
null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
space
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:445)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.OutOfMemoryError: Java heap space

On Wed, Jun 26, 2013 at 12:27 PM, Timothy Potter  wrote:
> Recently upgraded to 4.3.1 but this problem has persisted for a while now ...
>
> I'm using the following configuration when starting Jetty:
>
> -XX:OnOutOfMemoryError="/home/solr/oom_killer.sh 83 %p"
>
> If an OOM is triggered during Solr web app initialization (such as by
> me lowering -Xmx to a value that is too low to initialize Solr with),
> then the script gets called and does what I expect!
>
> However, once the Solr webapp initializes and Solr is happily
> responding to updates and queries. When an OOM occurs in this
> situation, then the script doesn't actually get invoked! All I see is
> the following in the stdout/stderr log of my process:
>
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="/home/solr/oom_killer.sh 83 %p"
> #   Executing /bin/sh -c "/home/solr/oom_killer.sh 83 21358"...
>
> The oom_killer.sh script doesn't actually get called!
>
> So to recap, it works if an OOM occurs during initialization but once
> Solr is running, the OOM killer doesn't fire correctly. This leads me
> to believe my script is fine and there's something else going wrong.
> Here's the oom_killer.sh script (pretty basic):
>
> #!/bin/bash
> SOLR_PORT=$1
> SOLR_PID=$2
> NOW=$(date +"%Y%m%d_%H%M")
> (
> echo "Running OOM killer script for process $SOLR_PID for Solr on port
> 89$SOLR_PORT"
> kill -9 $SOLR_PID
> echo "Killed process $SOLR_PID"
> exec /home/solr/solr-dg/dg-solr.sh recover $SOLR_PORT &
> echo "Restarted Solr on 89$SOLR_PORT after OOM"
> ) | tee oom_killer-89$SOLR_PORT-$NOW.log
>
> Anyone see anything like this before? Suggestions on where to begin
> tracking down this issue?
>
> Cheers,
> Tim


Re: Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Mike L.
Thanks for the response.
 
Here's the scrubbed version of my DIH: http://apaste.info/6uGH 
 
It contains everything I'm more or less doing...pretty straight forward.. One 
thing to note and I don't know if this is a bug or not, but the batchSize="-1" 
streaming feature doesn't seem to work, at least with informix jdbc drivers. I 
set the batchsize to "500", but have tested it with various numbers including 
5000, 1. I'm aware that behind the scenes this should be just setting the 
fetchsize, but its a bit puzzling why I don't see a difference regardless of 
what value I actually use. I was told by one of our DBA's that our value is set 
as a global DB param and can't be modified (which I haven't looked into 
afterward.)
 
As far as HEAP patterns, I watch the process via WILY and notice GC occurs 
every 15min's or so, but becomes infrequent and not as significant as the 
previous one. It's almost as if some memory is never released until it 
eventually catches up to the max heap size.
 
I did assume that perhaps there could have been some locking issues, which is 
why I made the following modifications:
 
readOnly="true" transactionIsolation="TRANSACTION_READ_UNCOMMITTED"
 
What do you recommend for the mergeFactor,ramBufferSize and autoCommit options? 
My general understanding is the higher the mergeFactor, the less frequent 
merges which should improve index time, but slow down query response time. I 
also read somewhere that an increase on the ramBufferSize should help prevent 
frequent merges...but confused why I didn't really see an improvement...perhaps 
my combination of these values wasn't right in relation to my total fetch size.
 
Also- my impression is the lower the autoCommit maxDocs/maxTime numbers (i.e 
the defaults) the better on memory management, but cost on index time as you 
pay for the overhead of committing. That is a number I've been experimenting 
with as well and have scene some variations in heap trends but unfortunately, 
have not completed the job quite yet with any config... I did get very close.. 
I'd hate to throw additional memory at the problem if there is something else I 
can tweak.. 
 
Thanks!
Mike
 

From: Shawn Heisey 
To: solr-user@lucene.apache.org 
Sent: Wednesday, June 26, 2013 12:13 PM
Subject: Re: Parallal Import Process on same core. Solr 3.5


On 6/26/2013 10:58 AM, Mike L. wrote:
>  
> Hello,
>  
>        I'm trying to execute a parallel DIH process and running into heap 
>related issues, hoping somebody has experienced this and can recommend some 
>options..
>  
>        Using Solr 3.5 on CentOS.
>        Currently have JVM heap 4GB min , 8GB max
>  
>      When executing the entities in a sequential process (entities executing 
>in sequence by default), my heap never exceeds 3GB. When executing the 
>parallel process, everything runs fine for roughly an hour, then I reach the 
>8GB max heap size and the process stalls/fails.
>  
>      More specifically, here's how I'm executing the parallel import process: 
>I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
>VALUE') within my entity queries. And within Solrconfig.xml, I've created 
>corresponding data import handlers, one for each of these entities.
>  
> My total rows fetch/count is 9M records.
>  
> And when I initiate the import, I call each one, similar to the below 
> (obviously I've stripped out my server & naming conventions.
>  
> http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true
>  
> http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]
>  
>  
> I assume that when doing this, only the first import request needs to contain 
> the clean=true param. 
>  
> I've divided each import query to target roughly the same amount of data, and 
> in solrconfig, I've tried various things in hopes to reduce heap size.

Thanks for including some solrconfig snippets, but I think what we
really need is your DIH configuration(s).  Use a pastebin site and
choose the proper document type.  http://apaste.info/is available and
the proper type there would be (X)HTML.  If you need to sanitize these
to remove host/user/pass, please replace the values with something else
rather than deleting them entirely.

With full-import, clean defaults to true, so including it doesn't change
anything.  What I would actually do is have clean=true on the first
import you run, then after waiting a few seconds to be sure it is
running, start the others with clean=false so that they don't do ANOTHER
clean.

I suspect that you might be running into JDBC driver behavior where the
entire result set is being buffered into RAM.

Thanks,
Shawn

Re: Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Alexandre Rafalovitch
Yes, it is possible by running an embedded Solr inside SolrJ process.
The nice thing is that the index is portable, so you can then access
it from the standalone Solr server later.

I have an example here:
https://github.com/arafalov/solr-indexing-book/tree/master/published/solrj
, which shows SolrJ running both as a client and with an embedded
container. Notice that you will probably need more jars than you
expect for the standalone Solr to work, including a number of servlet
jars.

Regards,
  Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Jun 26, 2013 at 2:59 PM, Learner  wrote:
> I currently have a SOLRJ program which I am using for indexing the data in
> SOLR. I am trying to figure out a way to build index without depending on
> running instance of SOLR. I should be able to supply the solrconfig and
> schema.xml to the indexing program which in turn create index files that I
> can use with any SOLR instance. Is it possible to implement this?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr document auto-upload?

2013-06-26 Thread Jack Krupansky

Take a look at LucidWorks Search for automated crawler scheduling:
http://docs.lucidworks.com/display/help/Create+or+Edit+a+Schedule
http://docs.lucidworks.com/display/lweug/Data+Source+Schedules

ManifoldCF also has crawler job scheduling:
http://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html

I think the general idea on Unix is that cron is the obvious way to schedule 
periodic operations.


You could certainly do a custom request handler that initializes with a 
thread on a timer and initiate custom directory crawling of your own.


But there is no such feature directly implemented in Solr

-- Jack Krupansky

-Original Message- 
From: aspielman

Sent: Wednesday, June 26, 2013 2:16 PM
To: solr-user@lucene.apache.org
Subject: Solr document auto-upload?

Is it possible to to configure Solr to automatically grab documents in a
specidfied directory, with having to use the post command?

I've not found any way to do this, though admittedly, I'm not terribly
experienced with config files of this type.

Thanks!



-
<| A.Spielman |>
"In theory there is no difference between theory and practice. In practice 
there is." - Chuck Reid

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-document-auto-upload-tp4073373.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Is there a way to build indexes using SOLRJ without SOLR instance?

2013-06-26 Thread Learner
I currently have a SOLRJ program which I am using for indexing the data in
SOLR. I am trying to figure out a way to build index without depending on
running instance of SOLR. I should be able to supply the solrconfig and
schema.xml to the indexing program which in turn create index files that I
can use with any SOLR instance. Is it possible to implement this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-build-indexes-using-SOLRJ-without-SOLR-instance-tp4073383.html
Sent from the Solr - User mailing list archive at Nabble.com.


OOM killer script woes

2013-06-26 Thread Timothy Potter
Recently upgraded to 4.3.1 but this problem has persisted for a while now ...

I'm using the following configuration when starting Jetty:

-XX:OnOutOfMemoryError="/home/solr/oom_killer.sh 83 %p"

If an OOM is triggered during Solr web app initialization (such as by
me lowering -Xmx to a value that is too low to initialize Solr with),
then the script gets called and does what I expect!

However, once the Solr webapp initializes and Solr is happily
responding to updates and queries. When an OOM occurs in this
situation, then the script doesn't actually get invoked! All I see is
the following in the stdout/stderr log of my process:

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/home/solr/oom_killer.sh 83 %p"
#   Executing /bin/sh -c "/home/solr/oom_killer.sh 83 21358"...

The oom_killer.sh script doesn't actually get called!

So to recap, it works if an OOM occurs during initialization but once
Solr is running, the OOM killer doesn't fire correctly. This leads me
to believe my script is fine and there's something else going wrong.
Here's the oom_killer.sh script (pretty basic):

#!/bin/bash
SOLR_PORT=$1
SOLR_PID=$2
NOW=$(date +"%Y%m%d_%H%M")
(
echo "Running OOM killer script for process $SOLR_PID for Solr on port
89$SOLR_PORT"
kill -9 $SOLR_PID
echo "Killed process $SOLR_PID"
exec /home/solr/solr-dg/dg-solr.sh recover $SOLR_PORT &
echo "Restarted Solr on 89$SOLR_PORT after OOM"
) | tee oom_killer-89$SOLR_PORT-$NOW.log

Anyone see anything like this before? Suggestions on where to begin
tracking down this issue?

Cheers,
Tim


Re: Need help with indexing names in a pdf

2013-06-26 Thread Walter Underwood
This kind of text processing is called entity extraction. I'm not up to date on 
what is available in Solr, but search on that.

wunder

On Jun 26, 2013, at 10:26 AM, Warren H. Prince wrote:

>   We receive about 100 documents a day of various sizes.  The documents 
> could pertain to any of 40,000 contacts stored in our database, and could 
> include more than one.   For each file we have, we maintain a list of 
> contacts that are related to or involved in that file.  I know it will never 
> be exact, but I'd like to index possible names in the text, and then attempt 
> to identify which files the document might pertain to, looking with files 
> that are tied to contacts contained in the document.
> 
> I've found some regex code to parse names from the text, but does anyone have 
> any ideas on how to set up the index.  There are currently approximately 
> 900,000 documents in our library.
> 
> --Warren






Solr document auto-upload?

2013-06-26 Thread aspielman
Is it possible to to configure Solr to automatically grab documents in a
specidfied directory, with having to use the post command? 

I've not found any way to do this, though admittedly, I'm not terribly
experienced with config files of this type. 

Thanks!



-
<| A.Spielman |>
"In theory there is no difference between theory and practice. In practice 
there is." - Chuck Reid
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-document-auto-upload-tp4073373.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Querying multiple collections in SolrCloud

2013-06-26 Thread Chris Toomey
Thanks Erick, that's a very helpful answer.

Regarding the grouping option, does that require all the docs to be put
into a single collection, or could it be done with across N collections
(assuming each collection had a common "type" field for grouping on)?

Chris


On Wed, Jun 26, 2013 at 7:01 AM, Erick Erickson wrote:

> bq: Would the above setup qualify as "multiple compatible collections"
>
> No. While there may be enough fields in common to form a single query,
> the TF/IDF calculations will not be "compatible" and the scores from the
> various collections will NOT be comparable. So simply getting the list of
> top N docs will probably be dominated by the docs from a single type.
>
> bq: How does SolrCloud combine the query results from multiple collections?
>
> It doesn't. SolrCloud sorts the results from multiple nodes in the
> _same_ collection
> according to whatever sort criteria are specified, defaulting to score.
> Say you
> ask for the top 20 docs. A node from each shard returns the top 20 docs
> for that
> shard. The node processing them just merges all the returned lists and
> only keeps
> the top 20.
>
> I don't think your last two questions are really relevant, SolrCloud
> isn't built to
> query multiple collections and return the results coherently.
>
> The root problem here is that you're trying to compare docs from
> different collections for "goodness" to return the top N. This isn't
> actually hard
> _except_ when "goodness" is the score, then it just doesn't work. You can't
> even compare scores from different queries on the _same_ collection, much
> less different ones. Consider two collections, books and songs. One
> consists
> of lots and lots of text and the ter frequency and inverse doc freq
> (TF/IDF)
> will be hugely different than songs. Not to mention field length
> normalization.
>
> Now, all that aside there's an option. Index all the docs in a single
> collection and
> use grouping (aka field collapsing) to get a single response that has the
> top N
> docs from each type (they'll be in different sections of the original
> response) and present
> them to the user however makes sense. You'll get "hands on" experience in
> why this isn't something that's easy to do automatically if you try to
> sort these
> into a single list by relevance ...
>
> Best
> Erick
>
> On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey  wrote:
> > Thanks Jack for the alternatives.  The first is interesting but has the
> > downside of requiring multiple queries to get the full matching docs.
>  The
> > second is interesting and very simple, but has the downside of not being
> > modular and being difficult to configure field boosting when the
> > collections have overlapping field names with different boosts being
> needed
> > for the same field in different document types.
> >
> > I'd still like to know about the viability of my original approach though
> > too.
> >
> > Chris
> >
> >
> > On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky  >wrote:
> >
> >> One simple scenario to consider: N+1 collections - one collection per
> >> document type with detailed fields for that document type, and one
> common
> >> collection that indexes a subset of the fields. The main user query
> would
> >> be an edismax over the common fields in that "main" collection. You can
> >> then display summary results from the common collection. You can also
> then
> >> support "drill down" into the type-specific collection based on a "type"
> >> field for each document in the main collection.
> >>
> >> Or, sure, you actually CAN index multiple document types in the same
> >> collection - add all the fields to one schema - there is no time or
> space
> >> penalty if most of the field are empty for most documents.
> >>
> >> -- Jack Krupansky
> >>
> >> -Original Message- From: Chris Toomey
> >> Sent: Tuesday, June 25, 2013 6:08 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Querying multiple collections in SolrCloud
> >>
> >>
> >> Hi, I'm investigating using SolrCloud for querying documents of
> different
> >> but similar/related types, and have read through docs. on the wiki and
> done
> >> many searches in these archives, but still have some questions.  Thanks
> in
> >> advance for your help.
> >>
> >> Setup:
> >> * Say that I have N distinct types of documents and I want to do queries
> >> that return the best matches regardless document type.  I.e., something
> >> akin to a Google search where I'd like to get the best matches from the
> >> web, news, images, and maps.
> >>
> >> * Our main use case is supporting simple user-entered searches, which
> would
> >> just contain terms / phrases and wouldn't specify fields.
> >>
> >> * The document types will not all have the same fields, though there
> may be
> >> some overlap in the fields.
> >>
> >> * We plan to use a separate collection for each document type, and to
> use
> >> the eDisMax query parser.  Each collection would have a
> document-specific
> >> schema conf

Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads

2013-06-26 Thread Vinay Pothnis
Thank you Erick!

Will look at all these suggestions.

-Vinay


On Wed, Jun 26, 2013 at 6:37 AM, Erick Erickson wrote:

> Right, unfortunately this is a gremlin lurking in the weeds, see:
> http://wiki.apache.org/solr/DistributedSearch#Distributed_Deadlock
>
> There are a couple of ways to deal with this:
> 1> go ahead and up the limit and re-compile, if you look at
> SolrCmdDistributor the semaphore is defined there.
>
> 2> https://issues.apache.org/jira/browse/SOLR-4816 should
> address this as well as improve indexing throughput. I'm totally sure
> Joel (the guy working on this) would be thrilled if you were able to
> verify that these two points, I'd ask him (on the JIRA) whether he thinks
> it's ready to test.
>
> 3> Reduce the number of threads you're indexing with
>
> 4> index docs in small packets, perhaps even one and just rack
> together a zillion threads to get throughput.
>
> FWIW,
> Erick
>
> On Tue, Jun 25, 2013 at 8:55 AM, Vinay Pothnis  wrote:
> > Jason and Scott,
> >
> > Thanks for the replies and pointers!
> > Yes, I will consider the 'maxDocs' value as well. How do i monitor the
> > transaction logs during the interval between commits?
> >
> > Thanks
> > Vinay
> >
> >
> > On Mon, Jun 24, 2013 at 8:48 PM, Jason Hellman <
> > jhell...@innoventsolutions.com> wrote:
> >
> >> Scott,
> >>
> >> My comment was meant to be a bit tongue-in-cheek, but my intent in the
> >> statement was to represent hard failure along the lines Vinay is seeing.
> >>  We're talking about OutOfMemoryException conditions, total cluster
> >> paralysis requiring restart, or other similar and disastrous conditions.
> >>
> >> Where that line is is impossible to generically define, but trivial to
> >> accomplish.  What any of us running Solr has to achieve is a realistic
> >> simulation of our desired production load (probably well above peak)
> and to
> >> see what limits are reached.  Armed with that information we tweak.  In
> >> this case, we look at finding the point where data ingestion reaches a
> >> natural limit.  For some that may be JVM GC, for others memory buffer
> size
> >> on the client load, and yet others it may be I/O limits on multithreaded
> >> reads from a database or file system.
> >>
> >> In old Solr days we had a little less to worry about.  We might play
> with
> >> a commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial
> >> commits and rollback recoveries.  But with 4.x we now have more durable
> >> write options and NRT to consider, and SolrCloud begs to use this.  So
> we
> >> have to consider transaction logs, the file handles they leave open
> until
> >> commit operations occur, and how we want to manage writing to all cores
> >> simultaneously instead of a more narrow master/slave relationship.
> >>
> >> It's all manageable, all predictable (with some load testing) and all
> >> filled with many possibilities to meet our specific needs.  Considering
> hat
> >> each person's data model, ingestion pipeline, request processors, and
> field
> >> analysis steps will be different, 5 threads of input at face value
> doesn't
> >> really contemplate the whole problem.  We have to measure our actual
> data
> >> against our expectations and find where the weak chain links are to
> >> strengthen them.  The symptoms aren't necessarily predictable in
> advance of
> >> this testing, but they're likely addressable and not difficult to
> decipher.
> >>
> >> For what it's worth, SolrCloud is new enough that we're still
> experiencing
> >> some "uncharted territory with unknown ramifications" but with continued
> >> dialog through channels like these there are fewer territories without
> good
> >> cartography :)
> >>
> >> Hope that's of use!
> >>
> >> Jason
> >>
> >>
> >>
> >> On Jun 24, 2013, at 7:12 PM, Scott Lundgren <
> >> scott.lundg...@carbonblack.com> wrote:
> >>
> >> > Jason,
> >> >
> >> > Regarding your statement "push you over the edge"- what does that
> mean?
> >> > Does it mean "uncharted territory with unknown ramifications" or
> >> something
> >> > more like specific, known symptoms?
> >> >
> >> > I ask because our use is similar to Vinay's in some respects, and we
> want
> >> > to be able to push the capabilities of write perf - but not over the
> >> edge!
> >> > In particular, I am interested in knowing the symptoms of failure, to
> >> help
> >> > us troubleshoot the underlying problems if and when they arise.
> >> >
> >> > Thanks,
> >> >
> >> > Scott
> >> >
> >> > On Monday, June 24, 2013, Jason Hellman wrote:
> >> >
> >> >> Vinay,
> >> >>
> >> >> You may wish to pay attention to how many transaction logs are being
> >> >> created along the way to your hard autoCommit, which should truncate
> the
> >> >> open handles for those files.  I might suggest setting a maxDocs
> value
> >> in
> >> >> parallel with your maxTime value (you can use both) to ensure the
> commit
> >> >> occurs at either breakpoint.  30 seconds is plenty of time for 5
> >> parallel
> >> >> processes of 20 doc

Re: Solr 4.2.1 - master taking long time to respond after tomcat restart

2013-06-26 Thread Jack Krupansky
You need to do occasional hard commits, otherwise the update log just grows 
and grows and gets replayed on each server start.


-- Jack Krupansky

-Original Message- 
From: Arun Rangarajan

Sent: Wednesday, June 26, 2013 1:18 PM
To: solr-user@lucene.apache.org
Subject: Solr 4.2.1 - master taking long time to respond after tomcat 
restart


Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates,
we enabled updateLog and made the few unstored int and boolean fields as
"stored". We have a single master and a single slave and all the queries go
only to the slave. We make only max. 50 atomic update requests/hour to the
master.

Noticing that on restarting tomcat, the master Solr server takes several
minutes to respond. This was not happening in 3.6.1. The slave is
responding as quickly as before after restarting tomcat. Any ideas why only
master would take this long? 



Need help with indexing names in a pdf

2013-06-26 Thread Warren H. Prince
We receive about 100 documents a day of various sizes.  The documents 
could pertain to any of 40,000 contacts stored in our database, and could 
include more than one.   For each file we have, we maintain a list of contacts 
that are related to or involved in that file.  I know it will never be exact, 
but I'd like to index possible names in the text, and then attempt to identify 
which files the document might pertain to, looking with files that are tied to 
contacts contained in the document.

I've found some regex code to parse names from the text, but does anyone have 
any ideas on how to set up the index.  There are currently approximately 
900,000 documents in our library.

--Warren

Re: Solr 4.2.1 - master taking long time to respond after tomcat restart

2013-06-26 Thread Shawn Heisey
On 6/26/2013 11:18 AM, Arun Rangarajan wrote:
> Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates,
> we enabled updateLog and made the few unstored int and boolean fields as
> "stored". We have a single master and a single slave and all the queries go
> only to the slave. We make only max. 50 atomic update requests/hour to the
> master.
> 
> Noticing that on restarting tomcat, the master Solr server takes several
> minutes to respond. This was not happening in 3.6.1. The slave is
> responding as quickly as before after restarting tomcat. Any ideas why only
> master would take this long?

Classic problem after enabling the updateLog:

http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_startup

Thanks,
Shawn



Solr 4.2.1 - master taking long time to respond after tomcat restart

2013-06-26 Thread Arun Rangarajan
Upgraded from Solr 3.6.1 to 4.2.1. Since we wanted to use atomic updates,
we enabled updateLog and made the few unstored int and boolean fields as
"stored". We have a single master and a single slave and all the queries go
only to the slave. We make only max. 50 atomic update requests/hour to the
master.

Noticing that on restarting tomcat, the master Solr server takes several
minutes to respond. This was not happening in 3.6.1. The slave is
responding as quickly as before after restarting tomcat. Any ideas why only
master would take this long?


Re: Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Shawn Heisey
On 6/26/2013 10:58 AM, Mike L. wrote:
>  
> Hello,
>  
>I'm trying to execute a parallel DIH process and running into heap 
> related issues, hoping somebody has experienced this and can recommend some 
> options..
>  
>Using Solr 3.5 on CentOS.
>Currently have JVM heap 4GB min , 8GB max
>  
>  When executing the entities in a sequential process (entities executing 
> in sequence by default), my heap never exceeds 3GB. When executing the 
> parallel process, everything runs fine for roughly an hour, then I reach the 
> 8GB max heap size and the process stalls/fails.
>  
>  More specifically, here's how I'm executing the parallel import process: 
> I target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
> VALUE') within my entity queries. And within Solrconfig.xml, I've created 
> corresponding data import handlers, one for each of these entities.
>  
> My total rows fetch/count is 9M records.
>  
> And when I initiate the import, I call each one, similar to the below 
> (obviously I've stripped out my server & naming conventions.
>  
> http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true
>  
> http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]
>  
>  
> I assume that when doing this, only the first import request needs to contain 
> the clean=true param. 
>  
> I've divided each import query to target roughly the same amount of data, and 
> in solrconfig, I've tried various things in hopes to reduce heap size.

Thanks for including some solrconfig snippets, but I think what we
really need is your DIH configuration(s).  Use a pastebin site and
choose the proper document type.  http://apaste.info is available and
the proper type there would be (X)HTML.  If you need to sanitize these
to remove host/user/pass, please replace the values with something else
rather than deleting them entirely.

With full-import, clean defaults to true, so including it doesn't change
anything.  What I would actually do is have clean=true on the first
import you run, then after waiting a few seconds to be sure it is
running, start the others with clean=false so that they don't do ANOTHER
clean.

I suspect that you might be running into JDBC driver behavior where the
entire result set is being buffered into RAM.

Thanks,
Shawn



Re: Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Michael Della Bitta
Hi Mike,

Have you considered trying something like jhat or visualvm to see what's
taking up room on the heap?

http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html
http://visualvm.java.net/


Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Wed, Jun 26, 2013 at 12:58 PM, Mike L.  wrote:

>
> Hello,
>
>I'm trying to execute a parallel DIH process and running into heap
> related issues, hoping somebody has experienced this and can recommend some
> options..
>
>Using Solr 3.5 on CentOS.
>Currently have JVM heap 4GB min , 8GB max
>
>  When executing the entities in a sequential process (entities
> executing in sequence by default), my heap never exceeds 3GB. When
> executing the parallel process, everything runs fine for roughly an hour,
> then I reach the 8GB max heap size and the process stalls/fails.
>
>  More specifically, here's how I'm executing the parallel import
> process: I target a logical range (i.e WHERE some field BETWEEN 'SOME
> VALUE' AND 'SOME VALUE') within my entity queries. And within
> Solrconfig.xml, I've created corresponding data import handlers, one for
> each of these entities.
>
> My total rows fetch/count is 9M records.
>
> And when I initiate the import, I call each one, similar to the below
> (obviously I've stripped out my server & naming conventions.
>
> http://
> [server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true
> http://
> [server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]
>
> I assume that when doing this, only the first import request needs to
> contain the clean=true param.
>
> I've divided each import query to target roughly the same amount of data,
> and in solrconfig, I've tried various things in hopes to reduce heap size.
>
> Here's my current config:
>
>  false
> 15
> 100
> 2147483647
> 1
> 1000
> 1
> single
>   
>   
> false
> 100  
> 15
> 2147483647
> 1
> false
>   
>
>
> 
>
>   6 
>   25000 
> 
> 10
>  
>
>
> What gets tricky is finding the sweet spot with these parameters, but
> wondering if anybody has any recommendations for an optimal config. Also,
> regarding autoCommit, I've even turned that feature off, but my heap size
> reaches its max sooner. I am wondering though, what would be the difference
> with autoCommit and passing in the commit=true param on each import query.
>
> Thanks in advance!
> Mike


Parallal Import Process on same core. Solr 3.5

2013-06-26 Thread Mike L.
 
Hello,
 
   I'm trying to execute a parallel DIH process and running into heap 
related issues, hoping somebody has experienced this and can recommend some 
options..
 
   Using Solr 3.5 on CentOS.
   Currently have JVM heap 4GB min , 8GB max
 
 When executing the entities in a sequential process (entities executing in 
sequence by default), my heap never exceeds 3GB. When executing the parallel 
process, everything runs fine for roughly an hour, then I reach the 8GB max 
heap size and the process stalls/fails.
 
 More specifically, here's how I'm executing the parallel import process: I 
target a logical range (i.e WHERE some field BETWEEN 'SOME VALUE' AND 'SOME 
VALUE') within my entity queries. And within Solrconfig.xml, I've created 
corresponding data import handlers, one for each of these entities.
 
My total rows fetch/count is 9M records.
 
And when I initiate the import, I call each one, similar to the below 
(obviously I've stripped out my server & naming conventions.
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting1]&clean=true
 
http://[server]/[solrappname]/[corename]/[ImportHandlerName]?command=full-import&entity=[NameOfEntityTargetting2]
 
 
I assume that when doing this, only the first import request needs to contain 
the clean=true param. 
 
I've divided each import query to target roughly the same amount of data, and 
in solrconfig, I've tried various things in hopes to reduce heap size.
 
Here's my current config: 
 
 false
    15    
    100 
    2147483647
    1
    1000
    1
    single
  
  
    false
    100   
    15
    2147483647
    1
    false
  

 

   
  6  
  25000  
    
    10
 

 
What gets tricky is finding the sweet spot with these parameters, but wondering 
if anybody has any recommendations for an optimal config. Also, regarding 
autoCommit, I've even turned that feature off, but my heap size reaches its max 
sooner. I am wondering though, what would be the difference with autoCommit and 
passing in the commit=true param on each import query.
 
Thanks in advance!
Mike

MoreLikeThis handler and pivot facets

2013-06-26 Thread Achim Domma
Hi,

I have the current worklow, which works fine:

- User enters search text
- Text is send to SOLR as query. Quite some faceting is also include in the 
request.
- Result comes back and extensive facet information is displayed.

Now I want to allow my user to enter a whole reference text as search text. So 
I do the same as above, but send the text via POST to a MoreLikeThis handler. 
Therefore I add those additional parameters:

mlt.fl = 'text_field'
mlt.minwl = 1
mlt.maxqt = 20
mlt.minf = 0

and remove of course the q parameter. The rest of the request - i.e. the 
faceting parameters - are identical. But I do not get facets back. For my 
sample request, I can see that 499 documents were found, but all facets are 
just empty. And the facet_pivot key does not exist at all.

Is there any know issue with MLT + facets? I know that MLT + facets worked for 
me, but not yet when using pivot facets.

kind regards,
Achim

Re: Dynamic Type For Solr Schema

2013-06-26 Thread Shawn Heisey
On 6/26/2013 8:51 AM, Furkan KAMACI wrote:
> If I get a document that has a "lang" field holds "*tr*" I want that:
> 
> ...
> 
> 

Changing the TYPE of a field based on the contents of another field
isn't possible.  The language detection that has been mentioned in your
other replies makes it possible to direct different languages to
different fields, but won't change the type.

Solr is highly dependent on its schema.  The schema is necessarily
fairly static.  This is changing to some degree with the schema REST API
in newer versions, but even with that, types aren't dynamic.  If you
change them, you have to reindex.  Making them dynamic would require a
major rewrite of Solr internals, and it's very likely that nobody would
be able to agree on the criteria used to choose a type.

What you are trying to do could be done by writing a custom Lucene
application, because Lucene has no schema.  Field types are determined
by whatever code you write yourself.  The problem with this approach is
that you have to write ALL the server code, something that you get for
free with Solr.  It would not be a trivial task.

Thanks,
Shawn



Re: Is it possible to searh Solr with a longer query string?

2013-06-26 Thread Shawn Heisey
On 6/25/2013 6:15 PM, Jack Krupansky wrote:
> Are you using Tomcat?
> 
> See:
> http://wiki.apache.org/solr/SolrTomcat#Enabling_Longer_Query_Requests
> 
> Enabling Longer Query Requests
> 
> If you try to submit too long a GET query to Solr, then Tomcat will
> reject your HTTP request on the grounds that the HTTP header is too
> large; symptoms may include an HTTP 400 Bad Request error or (if you
> execute the query in a web browser) a blank browser window.
> 
> If you need to enable longer queries, you can set the maxHttpHeaderSize
> attribute on the HTTP Connector element in your server.xml file. The
> default value is 4K. (See
> http://tomcat.apache.org/tomcat-5.5-doc/config/http.html)

Even better would be to force SolrJ to use a POST request.  In newer
versions (4.1 and later) Solr sets the servlet container's POST buffer
size and defaults it to 2MB.  In older versions, you'd have to adjust
this in your servlet container config, but the default should be
considerably larger than the header buffer used for GET requests.

I thought that SolrJ used POST by default, but after looking at the
code, it seems that I was wrong.  Here's how to send a POST query:

response = server.query(query, METHOD.POST);

The import required for this is:

import org.apache.solr.client.solrj.SolrRequest.METHOD;

Gary, if you can avoid it, you should not be creating a new
HttpSolrServer object every time you make a query.  It is completely
thread-safe, so create a singleton and use it for all queries against
the medline core.

Thanks,
Shawn



Re: Dynamic Type For Solr Schema

2013-06-26 Thread Alexandre Rafalovitch
On Wed, Jun 26, 2013 at 11:46 AM, Jack Krupansky
 wrote:
> But there are also built-in "language identifier" update processors that can
> simultaneously identify what language is used in the input value for a field
> AND do the redirection to a language-specific field AND store the language
> code.

I have an example of using this as well (for English/Russian):
https://github.com/arafalov/solr-indexing-book/tree/master/published/languages
. This includes the collection data files, so you can see the end
result and play with it. The instructions on how to recreate this and
explanation behind routing and field aliases setup are in my book :
http://blog.outerthoughts.com/2013/06/my-book-on-solr-is-now-published/
:-)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


Re: Dynamic Type For Solr Schema

2013-06-26 Thread Jack Krupansky
You can certainly do redirection of input values in an update processing, 
even in a JavaScript script.


But there are also built-in "language identifier" update processors that can 
simultaneously identify what language is used in the input value for a field 
AND do the redirection to a language-specific field AND store the language 
code.


See:
LangDetectLanguageIdentifierUpdateProcessorFactory
TikaLanguageIdentifierUpdateProcessorFactory
http://lucene.apache.org/solr/4_3_0/solr-langid/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessorFactory.html
http://lucene.apache.org/solr/4_3_0/solr-langid/org/apache/solr/update/processor/TikaLanguageIdentifierUpdateProcessorFactory.html
http://wiki.apache.org/solr/LanguageDetection

The non-Tika version may be better, depending on the nature of your input.

Neither processor is in the new Apache Solr Reference Guide nor current 
release from Lucid, but see the detailed examples in my book.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Wednesday, June 26, 2013 10:51 AM
To: solr-user@lucene.apache.org
Subject: Dynamic Type For Solr Schema

I use Solr 4.3.1 as SolrCloud. I know that I can define analyzer at
schema.xml. Let's assume that I have specialized my analyzer for Turkish.
However I want to have another analzyer too, i.e. for English. I have that
fields at my schema:
...


...

I have a field type as text_tr that is combined for Turkish. I have another
field type as text_en that is combined for Englished. I have another field
at my schema as lang. lang holds the language of document as "en" or "tr".

If I get a document that has a "lang" field holds "*tr*" I want that:

...


...

If I get a document that has a "lang" field holds "*en*" I want that:

...


...

I want dynamic types just for that fields other will be same. How can I do
that properly at Solr? (UpdateRequestProcessor, ...?) 



Re: Need Help in migrating Solr version 1.4 to 4.3

2013-06-26 Thread Shawn Heisey
On 6/25/2013 11:52 PM, Sandeep Gupta wrote:
> Also in application development side,
> as I said that I am going to use HTTPSolrServer API and I found that we
> shouldn't create this object multiple times
> (as per the wiki document http://wiki.apache.org/solr/Solrj#HttpSolrServer)
> So I am planning to have my Server class as singleton.
>  Please advice little bit in this front also.

This is always the way that SolrServer objects are intended to be used,
including CommonsHttpSolrServer in version 1.4.  The only major
difference between the two objects is that the new one uses
HttpComponents 4.x and the old one uses HttpClient 3.x.  There are other
differences, but they are just the result of incremental improvements
from version to version.

Thanks,
Shawn



RE: StatsComponent doesn't work if field's type is TextField - can I change field's type to String

2013-06-26 Thread Elran Dvir
Erick, thanks for the response.

I think the stats component works with strings.

In StatsValuesFactory, I see the following code:

public static StatsValues createStatsValues(SchemaField sf) {
...
   else if (StrField.class.isInstance(fieldType)) {
  return new StringStatsValues(sf);
} 
  }

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, June 26, 2013 5:30 PM
To: solr-user@lucene.apache.org
Subject: Re: StatsComponent doesn't work if field's type is TextField - can I 
change field's type to String

>From the stats component page:

"The stats component returns simple statistics for indexed numeric fields 
within the DocSet"

So string, text, anything non-numeric won't work. You can declare it 
multiValued but then you have to add multiple values for the field when you 
send the doc to Solr or implement a custom update component to break them up. 
At least there's no filter that I know of that takes a delimited set of numbers 
and transforms them.

FWIW,
Erick

On Wed, Jun 26, 2013 at 4:14 AM, Elran Dvir  wrote:
> Hi all,
>
> StatsComponent doesn't work if field's type is TextField.
> I get the following message:
> "Field type 
> textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.
> solr.analysis.TokenizerChain,args={positionIncrementGap=100,
> sortMissingLast=true}} is not currently supported".
>
> My field configuration is:
>
>  sortMissingLast="true">
> 
>  />
> 
> 
>
>  multiValued="true"/>
>
> So, the reason my field is of type TextField is that in the document indexed 
> there may be multiple values in the field separated by new lines.
> The tokenizer is splitting it to multiple values and the field is indexed as 
> multi-valued field.
>
> Is there a way I can define the field as regular String field? Or a way to 
> make StatsComponent work with TextField?
>
> Thank you very much.

Email secured by Check Point


Re: URL search and indexing

2013-06-26 Thread Flavio Pompermaier
Obviously I messed up with email thread...however I found a problem
indexing my document via post.sh.
This is basically my schema.xml:


 
   
   
   
 
 url
  


 


and this is the document I tried to upload via post.sh:



  http://test.example.org/first.html
  1000
  1000
  1000
  5000


  http://test.example.org/second.html
  1000
  5000



When playing with administration and debugging tools I discovered that
searching for q=itemid:5000 gave me the same score for those docs, while I
was expecting different term frequencies between the first and the second.
In fact, using java to upload documents lead to correct results (3
occurrences of item 1000 in the first doc and 1 in the second), e.g.:
document1.addField("itemid", "1000");
document1.addField("itemid", "1000");
document1.addField("itemid", "1000");

Am I right or am I missing something else?


On Wed, Jun 26, 2013 at 5:18 PM, Jack Krupansky wrote:

> If there is a bug... we should identify it. What's a sample post command
> that you issued?
>
>
> -- Jack Krupansky
>
> -Original Message- From: Flavio Pompermaier
> Sent: Wednesday, June 26, 2013 10:53 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: URL search and indexing
>
> I was doing exactly that and, thanks to the administration page and
> explanation/debugging, I checked if results were those expected.
> Unfortunately, results were not correct submitting updates trough post.sh
> script (that use curl in the end).
> Probably, if it founds the same tag (same value for the same field-name),
> it will collapse them.
> Rewriting the same document in Java and submitting the updates did the
> things work correctly.
>
> In my opinion this is a bug (of the entire process, then I don't know it
> this is a problem of curl or of the script itself).
>
> Best,
> Flavio
>
> On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson *
> *wrote:
>
>  Flavio:
>>
>> You mention that you're new to Solr, so I thought I'd make sure
>> you know that the admin/analysis page is your friend! I flat
>> guarantee that as you try to index/search following the suggestions
>> you'll scratch your head at your results and you'll discover that
>> the analysis process isn't doing quite what you expect. The
>> admin/analysis page shows you the transformation of the input
>> at each stage, i.e. how the input is tokenized, what transformations
>> are applied to each token etc. It's invaluable!
>>
>> Best
>> Erick
>>
>> P.S. Feel free to un-check the "verbose" box, it provides lots
>> of information but can be overwhelming, especially at first!
>>
>> On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
>>  wrote:
>> > Ok thank you all for the great help!
>> > Now I'm ready to start playing with my index!
>> >
>> > Best,
>> > Flavio
>> >
>> >
>> > On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky <
>> j...@basetechnology.com>wrote:
>> >
>> >> Yeah, URL Classify does only do so much. That's why you need to combine
>> >> multiple methods.
>> >>
>> >> As a fourth method, you could code up a short JavaScript "**
>> >> StatelessScriptUpdateProcessor" that did something like take a
>> full
>> >> domain name (such as output by URL Classify) and turn it into multiple
>> >> values, each with more of the prefix removed, so that "
>> lucene.apache.org"
>> >> would index as:
>> >>
>> >> lucene.apache.org
>> >> apache.org
>> >> apache
>> >> .org
>> >> org
>> >>
>> >> And then the user could query by any of those partial domain names.
>> >>
>> >> But, if you simply tokenize the URL (copy the URL string to a text
>> field),
>> >> you automatically get most of that. The user can query by a URL
>> fragment,
>> >> such as "apache.org", ".org", "lucene.apache.org", etc. and the
>> >> tokenization will strip out the punctuation.
>> >>
>> >> I'll add this script to my list of examples to add in the next rev of
>> >> my
>> >> book.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -Original Message- From: Flavio Pompermaier
>> >> Sent: Tuesday, June 25, 2013 10:06 AM
>> >>
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: URL search and indexing
>> >>
>> >> I bought the book and looking at the example I still don't understand
>> if it
>> >> possible query all sub-urls of my URL.
>> >> For example, if the URLClassifyProcessorFactory takes in input >>
>> "url_s":"
>> >> http://lucene.apache.org/solr/4_0_0/changes/Changes.html
>> <
>> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
>> >"
>> >> and makes some
>> >> outputs like
>> >> - "url_domain_s":"lucene.apache.org "
>> >> - "url_canonical_s":"
>> >> http://lucene.apache.org/solr/4_0_0/changes/Changes.html
>> <
>> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
>> >
>> >> "
>> >> Ho

Re: Solr indexer and Hadoop

2013-06-26 Thread Jack Krupansky

See Mark's comments on the Jira when I asked that question.

My take: If 4.4 happens real soon (which some people have proposed), then it 
may not make it into 4.4. But if a 4.4 RC doesn't happen for another couple 
of weeks (my inclination), then the HDFS support could well make it into 
4.4. If not in 4.4, 4.5 is probably a slam-dunk.


-- Jack Krupansky

-Original Message- 
From: David Larochelle

Sent: Wednesday, June 26, 2013 11:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr indexer and Hadoop

Pardon, my unfamiliarity with the Solr development process.

Now that it's in the trunk, will it appear in the next 4.X release?

--

David



On Wed, Jun 26, 2013 at 9:42 AM, Erick Erickson 
wrote:



Well, it's been merged into trunk according to the comments, so

Try it on trunk, help with any bugs, buy Mark beer.

And, most especially, document up what it takes to make it work.
Mark is juggling a zillion things and I'm sure he'd appreciate any
help there.

Erick

On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta
 wrote:
> zomghowcanihelp? :)
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062  | c: +1 917 477 7906
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions  | g+:
> plus.google.com/appinions
> w: appinions.com 
>
>
> On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson wrote:
>
>> You might be interested in following:
>> https://issues.apache.org/jira/browse/SOLR-4916
>>
>> Best
>> Erick
>>
>> On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta
>>  wrote:
>> > Jack,
>> >
>> > Sorry, but I don't agree that it's that cut and dried. I've very
>> > successfully worked with terabytes of data in Hadoop that was stored
on
>> an
>> > Isilon mounted via NFS, for example. In cases like this, you're using
>> > MapReduce purely for it's execution model (which existed far before
>> Hadoop
>> > and HDFS ever did).
>> >
>> >
>> > Michael Della Bitta
>> >
>> > Applications Developer
>> >
>> > o: +1 646 532 3062  | c: +1 917 477 7906
>> >
>> > appinions inc.
>> >
>> > “The Science of Influence Marketing”
>> >
>> > 18 East 41st Street
>> >
>> > New York, NY 10017
>> >
>> > t: @appinions  | g+:
>> > plus.google.com/appinions
>> > w: appinions.com 
>> >
>> >
>> > On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky <
j...@basetechnology.com
>> >wrote:
>> >
>> >> ???
>> >>
>> >> Hadoop=HDFS
>> >>
>> >> If the data is not in Hadoop/HDFS, just use the normal Solr indexing
>> >> tools, including SolrCell and Data Import Handler, and possibly
>> ManifoldCF.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -Original Message- From: engy.morsy
>> >> Sent: Tuesday, June 25, 2013 8:10 AM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Solr indexer and Hadoop
>> >>
>> >>
>> >> Thank you Jack. So, I need to convert those nodes holding data to
HDFS.
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context: http://lucene.472066.n3.**
>> >> nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html<
>>
http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html
>> >
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>>





Re: URL search and indexing

2013-06-26 Thread Erick Erickson
Your other best friend is &debug=query on the URL, you might
be seeing different parsed queries than you expect, although that
doesn't really hold water given you say SolrJ fixes things.

I'd be surprised if posting the xml was the culprit, but you never
know. Did you re-index after schema changes etc?

Best
Erick

On Wed, Jun 26, 2013 at 8:18 AM, Jack Krupansky  wrote:
> If there is a bug... we should identify it. What's a sample post command
> that you issued?
>
>
> -- Jack Krupansky
>
> -Original Message- From: Flavio Pompermaier
> Sent: Wednesday, June 26, 2013 10:53 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: URL search and indexing
>
> I was doing exactly that and, thanks to the administration page and
> explanation/debugging, I checked if results were those expected.
> Unfortunately, results were not correct submitting updates trough post.sh
> script (that use curl in the end).
> Probably, if it founds the same tag (same value for the same field-name),
> it will collapse them.
> Rewriting the same document in Java and submitting the updates did the
> things work correctly.
>
> In my opinion this is a bug (of the entire process, then I don't know it
> this is a problem of curl or of the script itself).
>
> Best,
> Flavio
>
> On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson
> wrote:
>
>> Flavio:
>>
>> You mention that you're new to Solr, so I thought I'd make sure
>> you know that the admin/analysis page is your friend! I flat
>> guarantee that as you try to index/search following the suggestions
>> you'll scratch your head at your results and you'll discover that
>> the analysis process isn't doing quite what you expect. The
>> admin/analysis page shows you the transformation of the input
>> at each stage, i.e. how the input is tokenized, what transformations
>> are applied to each token etc. It's invaluable!
>>
>> Best
>> Erick
>>
>> P.S. Feel free to un-check the "verbose" box, it provides lots
>> of information but can be overwhelming, especially at first!
>>
>> On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
>>  wrote:
>> > Ok thank you all for the great help!
>> > Now I'm ready to start playing with my index!
>> >
>> > Best,
>> > Flavio
>> >
>> >
>> > On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky <
>> j...@basetechnology.com>wrote:
>> >
>> >> Yeah, URL Classify does only do so much. That's why you need to combine
>> >> multiple methods.
>> >>
>> >> As a fourth method, you could code up a short JavaScript "**
>> >> StatelessScriptUpdateProcessor**" that did something like take a full
>> >> domain name (such as output by URL Classify) and turn it into multiple
>> >> values, each with more of the prefix removed, so that "
>> lucene.apache.org"
>> >> would index as:
>> >>
>> >> lucene.apache.org
>> >> apache.org
>> >> apache
>> >> .org
>> >> org
>> >>
>> >> And then the user could query by any of those partial domain names.
>> >>
>> >> But, if you simply tokenize the URL (copy the URL string to a text
>> field),
>> >> you automatically get most of that. The user can query by a URL
>> fragment,
>> >> such as "apache.org", ".org", "lucene.apache.org", etc. and the
>> >> tokenization will strip out the punctuation.
>> >>
>> >> I'll add this script to my list of examples to add in the next rev of
>> >> >> my
>> >> book.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -Original Message- From: Flavio Pompermaier
>> >> Sent: Tuesday, June 25, 2013 10:06 AM
>> >>
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: URL search and indexing
>> >>
>> >> I bought the book and looking at the example I still don't understand
>> if it
>> >> possible query all sub-urls of my URL.
>> >> For example, if the URLClassifyProcessorFactory takes in input >>
>> >> "url_s":"
>> >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
>> http://lucene.apache.org/solr/4_0_0/changes/Changes.html>"
>> >> and makes some
>> >> outputs like
>> >> - "url_domain_s":"lucene.apache.**org "
>> >> - "url_canonical_s":"
>> >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
>> http://lucene.apache.org/solr/4_0_0/changes/Changes.html>
>> >> "
>> >> How should I configure url_domain_s in order to be able to makes query
>> like
>> >> '*.apache.org'?
>> >> How should I configure url_canonical_s in order to be able to makes
>> query
>> >> like 'http://lucene.apache.org/**solr/* <
>> http://lucene.apache.org/solr/*>
>> >> '?
>> >> Is it better to have two different fields for the two queries or could
>> >> >> I
>> >> create just one field for the two kind of queries (obviously for the
>> former
>> >> case then I should query something like *://.apache.org/*)?
>> >>
>> >>
>> >> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky <
>> j...@basetechnology.com>*
>> >> *wrote:
>> >>
>> >>  There are examples in my book:
>> >>> http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-<
>> http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**>
>> >>> ear

Re: Solr indexer and Hadoop

2013-06-26 Thread David Larochelle
Pardon, my unfamiliarity with the Solr development process.

Now that it's in the trunk, will it appear in the next 4.X release?

--

David



On Wed, Jun 26, 2013 at 9:42 AM, Erick Erickson wrote:

> Well, it's been merged into trunk according to the comments, so
>
> Try it on trunk, help with any bugs, buy Mark beer.
>
> And, most especially, document up what it takes to make it work.
> Mark is juggling a zillion things and I'm sure he'd appreciate any
> help there.
>
> Erick
>
> On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta
>  wrote:
> > zomghowcanihelp? :)
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062  | c: +1 917 477 7906
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions  | g+:
> > plus.google.com/appinions
> > w: appinions.com 
> >
> >
> > On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson  >wrote:
> >
> >> You might be interested in following:
> >> https://issues.apache.org/jira/browse/SOLR-4916
> >>
> >> Best
> >> Erick
> >>
> >> On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta
> >>  wrote:
> >> > Jack,
> >> >
> >> > Sorry, but I don't agree that it's that cut and dried. I've very
> >> > successfully worked with terabytes of data in Hadoop that was stored
> on
> >> an
> >> > Isilon mounted via NFS, for example. In cases like this, you're using
> >> > MapReduce purely for it's execution model (which existed far before
> >> Hadoop
> >> > and HDFS ever did).
> >> >
> >> >
> >> > Michael Della Bitta
> >> >
> >> > Applications Developer
> >> >
> >> > o: +1 646 532 3062  | c: +1 917 477 7906
> >> >
> >> > appinions inc.
> >> >
> >> > “The Science of Influence Marketing”
> >> >
> >> > 18 East 41st Street
> >> >
> >> > New York, NY 10017
> >> >
> >> > t: @appinions  | g+:
> >> > plus.google.com/appinions
> >> > w: appinions.com 
> >> >
> >> >
> >> > On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky <
> j...@basetechnology.com
> >> >wrote:
> >> >
> >> >> ???
> >> >>
> >> >> Hadoop=HDFS
> >> >>
> >> >> If the data is not in Hadoop/HDFS, just use the normal Solr indexing
> >> >> tools, including SolrCell and Data Import Handler, and possibly
> >> ManifoldCF.
> >> >>
> >> >>
> >> >> -- Jack Krupansky
> >> >>
> >> >> -Original Message- From: engy.morsy
> >> >> Sent: Tuesday, June 25, 2013 8:10 AM
> >> >> To: solr-user@lucene.apache.org
> >> >> Subject: Re: Solr indexer and Hadoop
> >> >>
> >> >>
> >> >> Thank you Jack. So, I need to convert those nodes holding data to
> HDFS.
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> View this message in context: http://lucene.472066.n3.**
> >> >> nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html<
> >>
> http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html
> >> >
> >> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >> >>
> >>
>


Re: index analyzer vs query analyzer

2013-06-26 Thread Erick Erickson
Yes! A rather extreme difference and you probably want it in both.

The admin/analysis page is your friend.

Basically, putting stuff in the type="index" section dictates what
goes into the index, and that is _all_ that is searchable. The result
of the full analysis chain is what's in the index and searchable.

Putting stuff in the type="query" section dictates what terms the index
is searched for.

So if the two don't match, you will get "surprising" results.

I'd advise that you keep them both identical until you're more familiar
with how all this works or use one of the pre-defined examples and
add or remove filters _in the same order_.

Best
Erick

On Wed, Jun 26, 2013 at 6:23 AM, Mugoma Joseph O.  wrote:
> Hello,
>
> What's the criteria used in putting an analyzer at query or index? e.g. I
> want to use NGramFilterFactory, is there a difference whether I put it
> under  or  ?
>
> Thanks.
>
>
> Mugoma
>


Re: URL search and indexing

2013-06-26 Thread Jack Krupansky
If there is a bug... we should identify it. What's a sample post command 
that you issued?


-- Jack Krupansky

-Original Message- 
From: Flavio Pompermaier

Sent: Wednesday, June 26, 2013 10:53 AM
To: solr-user@lucene.apache.org
Subject: Re: URL search and indexing

I was doing exactly that and, thanks to the administration page and
explanation/debugging, I checked if results were those expected.
Unfortunately, results were not correct submitting updates trough post.sh
script (that use curl in the end).
Probably, if it founds the same tag (same value for the same field-name),
it will collapse them.
Rewriting the same document in Java and submitting the updates did the
things work correctly.

In my opinion this is a bug (of the entire process, then I don't know it
this is a problem of curl or of the script itself).

Best,
Flavio

On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson 
wrote:



Flavio:

You mention that you're new to Solr, so I thought I'd make sure
you know that the admin/analysis page is your friend! I flat
guarantee that as you try to index/search following the suggestions
you'll scratch your head at your results and you'll discover that
the analysis process isn't doing quite what you expect. The
admin/analysis page shows you the transformation of the input
at each stage, i.e. how the input is tokenized, what transformations
are applied to each token etc. It's invaluable!

Best
Erick

P.S. Feel free to un-check the "verbose" box, it provides lots
of information but can be overwhelming, especially at first!

On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
 wrote:
> Ok thank you all for the great help!
> Now I'm ready to start playing with my index!
>
> Best,
> Flavio
>
>
> On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky <
j...@basetechnology.com>wrote:
>
>> Yeah, URL Classify does only do so much. That's why you need to combine
>> multiple methods.
>>
>> As a fourth method, you could code up a short JavaScript "**
>> StatelessScriptUpdateProcessor**" that did something like take a full
>> domain name (such as output by URL Classify) and turn it into multiple
>> values, each with more of the prefix removed, so that "
lucene.apache.org"
>> would index as:
>>
>> lucene.apache.org
>> apache.org
>> apache
>> .org
>> org
>>
>> And then the user could query by any of those partial domain names.
>>
>> But, if you simply tokenize the URL (copy the URL string to a text
field),
>> you automatically get most of that. The user can query by a URL
fragment,
>> such as "apache.org", ".org", "lucene.apache.org", etc. and the
>> tokenization will strip out the punctuation.
>>
>> I'll add this script to my list of examples to add in the next rev of 
>> my

>> book.
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Flavio Pompermaier
>> Sent: Tuesday, June 25, 2013 10:06 AM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: URL search and indexing
>>
>> I bought the book and looking at the example I still don't understand
if it
>> possible query all sub-urls of my URL.
>> For example, if the URLClassifyProcessorFactory takes in input 
>> "url_s":"

>> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
http://lucene.apache.org/solr/4_0_0/changes/Changes.html>"
>> and makes some
>> outputs like
>> - "url_domain_s":"lucene.apache.**org "
>> - "url_canonical_s":"
>> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
http://lucene.apache.org/solr/4_0_0/changes/Changes.html>
>> "
>> How should I configure url_domain_s in order to be able to makes query
like
>> '*.apache.org'?
>> How should I configure url_canonical_s in order to be able to makes
query
>> like 'http://lucene.apache.org/**solr/* <
http://lucene.apache.org/solr/*>
>> '?
>> Is it better to have two different fields for the two queries or could 
>> I

>> create just one field for the two kind of queries (obviously for the
former
>> case then I should query something like *://.apache.org/*)?
>>
>>
>> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky <
j...@basetechnology.com>*
>> *wrote:
>>
>>  There are examples in my book:
>>> http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-<
http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**>
>>> early-access-release-1/ebook/product-21079719.html>> www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
>>> early-access-release-1/ebook/**product-21079719.html<
http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
>
>>> >
>>>
>>>
>>> But... I still think you should use a tokenized text field as well -
use
>>> all three: raw string, tokenized text, and URL classification fields.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Flavio Pompermaier
>>> Sent: Tuesday, June 25, 2013 9:02 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: URL search and indexing
>>>
>>>
>>> That's sound exactly what I'm looking for! However I canno

Re: URL search and indexing

2013-06-26 Thread Flavio Pompermaier
I was doing exactly that and, thanks to the administration page and
explanation/debugging, I checked if results were those expected.
Unfortunately, results were not correct submitting updates trough post.sh
script (that use curl in the end).
Probably, if it founds the same tag (same value for the same field-name),
it will collapse them.
Rewriting the same document in Java and submitting the updates did the
things work correctly.

In my opinion this is a bug (of the entire process, then I don't know it
this is a problem of curl or of the script itself).

Best,
Flavio

On Wed, Jun 26, 2013 at 4:18 PM, Erick Erickson wrote:

> Flavio:
>
> You mention that you're new to Solr, so I thought I'd make sure
> you know that the admin/analysis page is your friend! I flat
> guarantee that as you try to index/search following the suggestions
> you'll scratch your head at your results and you'll discover that
> the analysis process isn't doing quite what you expect. The
> admin/analysis page shows you the transformation of the input
> at each stage, i.e. how the input is tokenized, what transformations
> are applied to each token etc. It's invaluable!
>
> Best
> Erick
>
> P.S. Feel free to un-check the "verbose" box, it provides lots
> of information but can be overwhelming, especially at first!
>
> On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
>  wrote:
> > Ok thank you all for the great help!
> > Now I'm ready to start playing with my index!
> >
> > Best,
> > Flavio
> >
> >
> > On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky <
> j...@basetechnology.com>wrote:
> >
> >> Yeah, URL Classify does only do so much. That's why you need to combine
> >> multiple methods.
> >>
> >> As a fourth method, you could code up a short JavaScript "**
> >> StatelessScriptUpdateProcessor**" that did something like take a full
> >> domain name (such as output by URL Classify) and turn it into multiple
> >> values, each with more of the prefix removed, so that "
> lucene.apache.org"
> >> would index as:
> >>
> >> lucene.apache.org
> >> apache.org
> >> apache
> >> .org
> >> org
> >>
> >> And then the user could query by any of those partial domain names.
> >>
> >> But, if you simply tokenize the URL (copy the URL string to a text
> field),
> >> you automatically get most of that. The user can query by a URL
> fragment,
> >> such as "apache.org", ".org", "lucene.apache.org", etc. and the
> >> tokenization will strip out the punctuation.
> >>
> >> I'll add this script to my list of examples to add in the next rev of my
> >> book.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> -Original Message- From: Flavio Pompermaier
> >> Sent: Tuesday, June 25, 2013 10:06 AM
> >>
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: URL search and indexing
> >>
> >> I bought the book and looking at the example I still don't understand
> if it
> >> possible query all sub-urls of my URL.
> >> For example, if the URLClassifyProcessorFactory takes in input "url_s":"
> >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
> http://lucene.apache.org/solr/4_0_0/changes/Changes.html>"
> >> and makes some
> >> outputs like
> >> - "url_domain_s":"lucene.apache.**org "
> >> - "url_canonical_s":"
> >> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html<
> http://lucene.apache.org/solr/4_0_0/changes/Changes.html>
> >> "
> >> How should I configure url_domain_s in order to be able to makes query
> like
> >> '*.apache.org'?
> >> How should I configure url_canonical_s in order to be able to makes
> query
> >> like 'http://lucene.apache.org/**solr/* <
> http://lucene.apache.org/solr/*>
> >> '?
> >> Is it better to have two different fields for the two queries or could I
> >> create just one field for the two kind of queries (obviously for the
> former
> >> case then I should query something like *://.apache.org/*)?
> >>
> >>
> >> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky <
> j...@basetechnology.com>*
> >> *wrote:
> >>
> >>  There are examples in my book:
> >>> http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-<
> http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**>
> >>> early-access-release-1/ebook/product-21079719.html >>> www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
> >>> early-access-release-1/ebook/**product-21079719.html<
> http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html
> >
> >>> >
> >>>
> >>>
> >>> But... I still think you should use a tokenized text field as well -
> use
> >>> all three: raw string, tokenized text, and URL classification fields.
> >>>
> >>> -- Jack Krupansky
> >>>
> >>> -Original Message- From: Flavio Pompermaier
> >>> Sent: Tuesday, June 25, 2013 9:02 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: URL search and indexing
> >>>
> >>>
> >>> That's sound exactly what I'm looking for! However I cannot find an
> >>> example
> >>> of how to use it..could yo

Dynamic Type For Solr Schema

2013-06-26 Thread Furkan KAMACI
I use Solr 4.3.1 as SolrCloud. I know that I can define analyzer at
schema.xml. Let's assume that I have specialized my analyzer for Turkish.
However I want to have another analzyer too, i.e. for English. I have that
fields at my schema:
...


...

I have a field type as text_tr that is combined for Turkish. I have another
field type as text_en that is combined for Englished. I have another field
at my schema as lang. lang holds the language of document as "en" or "tr".

If I get a document that has a "lang" field holds "*tr*" I want that:

...


...

If I get a document that has a "lang" field holds "*en*" I want that:

...


...

I want dynamic types just for that fields other will be same. How can I do
that properly at Solr? (UpdateRequestProcessor, ...?)


Re: StatsComponent doesn't work if field's type is TextField - can I change field's type to String

2013-06-26 Thread Erick Erickson
>From the stats component page:

"The stats component returns simple statistics for indexed numeric
fields within the DocSet"

So string, text, anything non-numeric won't work. You can declare it
multiValued but then
you have to add multiple values for the field when you send the doc to
Solr or implement
a custom update component to break them up. At least there's no filter
that I know of that
takes a delimited set of numbers and transforms them.

FWIW,
Erick

On Wed, Jun 26, 2013 at 4:14 AM, Elran Dvir  wrote:
> Hi all,
>
> StatsComponent doesn't work if field's type is TextField.
> I get the following message:
> "Field type 
> textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100,
> sortMissingLast=true}} is not currently supported".
>
> My field configuration is:
>
>  sortMissingLast="true">
> 
>  />
> 
> 
>
>  multiValued="true"/>
>
> So, the reason my field is of type TextField is that in the document indexed 
> there may be multiple values in the field separated by new lines.
> The tokenizer is splitting it to multiple values and the field is indexed as 
> multi-valued field.
>
> Is there a way I can define the field as regular String field? Or a way to 
> make StatsComponent work with TextField?
>
> Thank you very much.


Re: URL search and indexing

2013-06-26 Thread Erick Erickson
Flavio:

You mention that you're new to Solr, so I thought I'd make sure
you know that the admin/analysis page is your friend! I flat
guarantee that as you try to index/search following the suggestions
you'll scratch your head at your results and you'll discover that
the analysis process isn't doing quite what you expect. The
admin/analysis page shows you the transformation of the input
at each stage, i.e. how the input is tokenized, what transformations
are applied to each token etc. It's invaluable!

Best
Erick

P.S. Feel free to un-check the "verbose" box, it provides lots
of information but can be overwhelming, especially at first!

On Wed, Jun 26, 2013 at 12:20 AM, Flavio Pompermaier
 wrote:
> Ok thank you all for the great help!
> Now I'm ready to start playing with my index!
>
> Best,
> Flavio
>
>
> On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky 
> wrote:
>
>> Yeah, URL Classify does only do so much. That's why you need to combine
>> multiple methods.
>>
>> As a fourth method, you could code up a short JavaScript "**
>> StatelessScriptUpdateProcessor**" that did something like take a full
>> domain name (such as output by URL Classify) and turn it into multiple
>> values, each with more of the prefix removed, so that "lucene.apache.org"
>> would index as:
>>
>> lucene.apache.org
>> apache.org
>> apache
>> .org
>> org
>>
>> And then the user could query by any of those partial domain names.
>>
>> But, if you simply tokenize the URL (copy the URL string to a text field),
>> you automatically get most of that. The user can query by a URL fragment,
>> such as "apache.org", ".org", "lucene.apache.org", etc. and the
>> tokenization will strip out the punctuation.
>>
>> I'll add this script to my list of examples to add in the next rev of my
>> book.
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Flavio Pompermaier
>> Sent: Tuesday, June 25, 2013 10:06 AM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: URL search and indexing
>>
>> I bought the book and looking at the example I still don't understand if it
>> possible query all sub-urls of my URL.
>> For example, if the URLClassifyProcessorFactory takes in input "url_s":"
>> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html"
>> and makes some
>> outputs like
>> - "url_domain_s":"lucene.apache.**org "
>> - "url_canonical_s":"
>> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
>> "
>> How should I configure url_domain_s in order to be able to makes query like
>> '*.apache.org'?
>> How should I configure url_canonical_s in order to be able to makes query
>> like 'http://lucene.apache.org/**solr/* 
>> '?
>> Is it better to have two different fields for the two queries or could I
>> create just one field for the two kind of queries (obviously for the former
>> case then I should query something like *://.apache.org/*)?
>>
>>
>> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky *
>> *wrote:
>>
>>  There are examples in my book:
>>> http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-
>>> early-access-release-1/ebook/product-21079719.html>> www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
>>> early-access-release-1/ebook/**product-21079719.html
>>> >
>>>
>>>
>>> But... I still think you should use a tokenized text field as well - use
>>> all three: raw string, tokenized text, and URL classification fields.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Flavio Pompermaier
>>> Sent: Tuesday, June 25, 2013 9:02 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: URL search and indexing
>>>
>>>
>>> That's sound exactly what I'm looking for! However I cannot find an
>>> example
>>> of how to use it..could you help me please?
>>> Moreover, about id field, isn't true that id field shouldn't be analyzed
>>> as
>>> suggested in
>>> http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document
>>> 
>>> >
>>>
>>> ?
>>>
>>>
>>> On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl 
>>> wrote:
>>>
>>>  Sure you can query the url directly. Or if you choose you can split it up
>>>
 in multiple components, e.g. using
 http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/
 solr/update/processor/URLClassifyProcessor.html>>> ://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/**
 update/processor/**URLClassifyProcessor.html

Re: Querying multiple collections in SolrCloud

2013-06-26 Thread Erick Erickson
bq: Would the above setup qualify as "multiple compatible collections"

No. While there may be enough fields in common to form a single query,
the TF/IDF calculations will not be "compatible" and the scores from the
various collections will NOT be comparable. So simply getting the list of
top N docs will probably be dominated by the docs from a single type.

bq: How does SolrCloud combine the query results from multiple collections?

It doesn't. SolrCloud sorts the results from multiple nodes in the
_same_ collection
according to whatever sort criteria are specified, defaulting to score. Say you
ask for the top 20 docs. A node from each shard returns the top 20 docs for that
shard. The node processing them just merges all the returned lists and
only keeps
the top 20.

I don't think your last two questions are really relevant, SolrCloud
isn't built to
query multiple collections and return the results coherently.

The root problem here is that you're trying to compare docs from
different collections for "goodness" to return the top N. This isn't
actually hard
_except_ when "goodness" is the score, then it just doesn't work. You can't
even compare scores from different queries on the _same_ collection, much
less different ones. Consider two collections, books and songs. One consists
of lots and lots of text and the ter frequency and inverse doc freq (TF/IDF)
will be hugely different than songs. Not to mention field length normalization.

Now, all that aside there's an option. Index all the docs in a single
collection and
use grouping (aka field collapsing) to get a single response that has the top N
docs from each type (they'll be in different sections of the original
response) and present
them to the user however makes sense. You'll get "hands on" experience in
why this isn't something that's easy to do automatically if you try to
sort these
into a single list by relevance ...

Best
Erick

On Tue, Jun 25, 2013 at 3:35 PM, Chris Toomey  wrote:
> Thanks Jack for the alternatives.  The first is interesting but has the
> downside of requiring multiple queries to get the full matching docs.  The
> second is interesting and very simple, but has the downside of not being
> modular and being difficult to configure field boosting when the
> collections have overlapping field names with different boosts being needed
> for the same field in different document types.
>
> I'd still like to know about the viability of my original approach though
> too.
>
> Chris
>
>
> On Tue, Jun 25, 2013 at 3:19 PM, Jack Krupansky 
> wrote:
>
>> One simple scenario to consider: N+1 collections - one collection per
>> document type with detailed fields for that document type, and one common
>> collection that indexes a subset of the fields. The main user query would
>> be an edismax over the common fields in that "main" collection. You can
>> then display summary results from the common collection. You can also then
>> support "drill down" into the type-specific collection based on a "type"
>> field for each document in the main collection.
>>
>> Or, sure, you actually CAN index multiple document types in the same
>> collection - add all the fields to one schema - there is no time or space
>> penalty if most of the field are empty for most documents.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Chris Toomey
>> Sent: Tuesday, June 25, 2013 6:08 PM
>> To: solr-user@lucene.apache.org
>> Subject: Querying multiple collections in SolrCloud
>>
>>
>> Hi, I'm investigating using SolrCloud for querying documents of different
>> but similar/related types, and have read through docs. on the wiki and done
>> many searches in these archives, but still have some questions.  Thanks in
>> advance for your help.
>>
>> Setup:
>> * Say that I have N distinct types of documents and I want to do queries
>> that return the best matches regardless document type.  I.e., something
>> akin to a Google search where I'd like to get the best matches from the
>> web, news, images, and maps.
>>
>> * Our main use case is supporting simple user-entered searches, which would
>> just contain terms / phrases and wouldn't specify fields.
>>
>> * The document types will not all have the same fields, though there may be
>> some overlap in the fields.
>>
>> * We plan to use a separate collection for each document type, and to use
>> the eDisMax query parser.  Each collection would have a document-specific
>> schema configuration with appropriate defaults for query fields and boosts,
>> etc.
>>
>> Questions:
>> * Would the above setup qualify as "multiple compatible collections", such
>> that we could search all N collections with a single SolrCloud query, as in
>> the example query "
>> http://localhost:8983/solr/**collection1/select?q=apple%**
>> 20pie&collection=c1,c2,..
>> .,cN"**?
>> Again, we're not querying against specific fields.
>>
>> * How does SolrCloud combine the quer

Re: How to truncate a particular field, LimitTokenCountAnalyzer or LimitTokenCountFilter?

2013-06-26 Thread Jack Krupansky

Yes, the LimitTokenCountFilterFactory will do the trick.

I have some examples in the book, showing for a given input string, what the 
output tokens will be.


Otherwise, the Solr Javadoc does given one generic example, but without 
showing how it actually works:

http://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilterFactory.html

The new Apache Solr Reference? No mention of the filter.

-- Jack Krupansky

-Original Message- 
From: Daniel Collins

Sent: Wednesday, June 26, 2013 3:38 AM
To: solr-user@lucene.apache.org
Subject: How to truncate a particular field, LimitTokenCountAnalyzer or 
LimitTokenCountFilter?


We have a requirement to grab the first N words in a particular field and
weight them differently for scoring purposes.  So I thought to use a
 and have some extra filter on the destination to truncate it
down (post tokenization).

Did a quick search and found both a LimitTokenCountAnalyzer
and LimitTokenCountFilter mentioned, if I read the wiki right, the Filter
is the correct approach for Solr as we have the schema-able analyzer chain,
so we don't need to code anything, right?

The Analyzer version would be more useful if we were explicitly coding up a
set of operations in Java, so that's what Lucene users directly would tend
to use.

Just in search of confirmation really. 



Re: how to replicate Solr Cloud

2013-06-26 Thread Erick Erickson
On the lengthy TODO list is making SolrCloud nodes "rack aware"
that should help with this, but it's not real high in the priority queue
as I recall. The current architecture sends updates and requests
all over the cluster, so there are lots of messages that go
across the presumably expensive pipe between data centers. Not
to mention the Zookeeper quorum problem.

Hmmm, "Zookeeper Quorum problem". Say 1 ZK node is in DC1
and 2 are in DC2. If DC2 goes down, DC1 will not accept updates
because there is no available ZK quorum. I've seen one proposal
where you use 3 DCs, each with a ZK node to ameliorate this.

But all this is an issue only if the communications link between the
datacenters is "expensive" where that term can mean that it literally
costs more, that it is slow, whatever.

Best
Erick

On Tue, Jun 25, 2013 at 12:14 PM, Otis Gospodnetic
 wrote:
> Uh, I remember that email, but can't recall where we did it will
> try to recall it some more and reply if I can manage to dig it out of
> my brain...
>
> Otis
> --
> Solr & ElasticSearch Support -- http://sematext.com/
> Performance Monitoring -- http://sematext.com/spm
>
>
>
> On Tue, Jun 25, 2013 at 2:24 PM, Kevin Osborn  wrote:
>> Otis,
>>
>> I did actually stumble upon this link.
>>
>> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/74870
>>
>> This was from you. You were attempting to replicate data from SolrCloud to
>> some other slaves for heavy-duty queries. You said that you accomplished
>> this. Can you provide a few pointers on how you did this? Thanks.
>>
>>
>> On Tue, Jun 25, 2013 at 10:25 AM, Otis Gospodnetic <
>> otis.gospodne...@gmail.com> wrote:
>>
>>> I think what is needed is a Leader that, while being a Leader for its
>>> own Slice in its local Cluster and Collection (I think I'm using all
>>> the latest terminology correctly here), is at the same time a Replica
>>> of its own Leader counterpart in the "Primary Cluster".
>>>
>>> Not currently possible, AFAIK.
>>> Or maybe there is a better way?
>>>
>>> Otis
>>> --
>>> Solr & ElasticSearch Support -- http://sematext.com/
>>> Performance Monitoring -- http://sematext.com/spm
>>>
>>>
>>>
>>> On Tue, Jun 25, 2013 at 1:07 PM, Kevin Osborn 
>>> wrote:
>>> > We are going to have two datacenters, each with their own SolrCloud and
>>> > ZooKeeper quorums. The end result will be that they should be replicas of
>>> > each other.
>>> >
>>> > One method that has been mentioned is that we should add documents to
>>> each
>>> > cluster separately. For various reasons, this may not be ideal for us.
>>> > Instead, we are playing around with the idea of always indexing to one
>>> > datacenter. And then having that replicate to the other datacenter. And
>>> > this is where I am having some trouble on how to proceed.
>>> >
>>> > The nice thing about SolrCloud is that there is no masters and slaves.
>>> Each
>>> > node is equals, has the same configs, etc. But in this case, I want to
>>> have
>>> > a node in one datacenter poll for changes in another data center. Before
>>> > SolrCloud, I would have used slave/master replication. But in the
>>> SolrCloud
>>> > world, I am not sure how to configure this setup?
>>> >
>>> > Or is there any better ideas on how to use replication to push or pull
>>> data
>>> > from one datacenter to another?
>>> >
>>> > In my case, NRT is not a requirement. And I will also be dealing with
>>> about
>>> > 3 collections and 5 or 6 shards.
>>> >
>>> > Thanks.
>>> >
>>> > --
>>> > *KEVIN OSBORN*
>>> > LEAD SOFTWARE ENGINEER
>>> > CNET Content Solutions
>>> > OFFICE 949.399.8714
>>> > CELL 949.310.4677  SKYPE osbornk
>>> > 5 Park Plaza, Suite 600, Irvine, CA 92614
>>> > [image: CNET Content Solutions]
>>>
>>
>>
>>
>> --
>> *KEVIN OSBORN*
>> LEAD SOFTWARE ENGINEER
>> CNET Content Solutions
>> OFFICE 949.399.8714
>> CELL 949.310.4677  SKYPE osbornk
>> 5 Park Plaza, Suite 600, Irvine, CA 92614
>> [image: CNET Content Solutions]


Re: Result Grouping

2013-06-26 Thread Bryan Bende
The field I am grouping on is a single-valued string.

It looks like in non-distributed mode if I use group=true, sort,
group.sort, and
group.limit=1, it will..

- group the results
- sort with in each group
- limit down to 1 result per group
- apply the sort between groups using the single result of each group

When I run with numShards >= 1...

- group the results
- apply the sort between groups using the document from each group based
on the sort, for example if sort= popularity desc then it uses the highest
popularity from each group
- sort with in the group
- limit down to 1 result per group

I was trying to confirm if this was the expected behavior, or if there is
something I could do to get the first behavior in a distributed configuration.

I posted this a few days ago describing the scenario in more detail if
you are interested...
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCALo_M18WVoLKvepJMu0wXk_x2H8cv3UaX9RQYtEh4-mksQHLBA%40mail.gmail.com%3E


> What type of field are you grouping on? What happens when you distribute
> ?it? I.e. what specifically goes wrong?

> Upayavira

On Tue, Jun 25, 2013, at 09:12 PM, Bryan Bende wrote:
> I was reading this documentation on Result Grouping...
> http://docs.lucidworks.com/display/solr/Result+Grouping
>
> which says...
>
> sort - sortspec - Specifies how Solr sorts the groups relative to each
> other. For example, sort=popularity desc will cause the groups to be
> sorted
> according to the highest popularity document in each group. The default
> value is score desc.
>
> group.sort - sort.spec - Specifies how Solr sorts documents within a
> single
> group. The default value is score desc.
>
> Is it possible to use these parameters such that group.sort would first
> sort with in each group, and then the overall sort would be applied
> according to the first element of each sorted group ?
>
> For example, using the scenario above where it has "sort=popularity
> desc",
> could you also have "group.sort=date asc" resulting in the the most
> recent
> document of each group being sorted by decreasing popularity ?
>
> It seems to work the way I described when running a single node Solr 4.3
> instance, but in a 2 shard configuration it appears to work differently.
>
> -Bryan


Re: Solr indexer and Hadoop

2013-06-26 Thread Erick Erickson
Well, it's been merged into trunk according to the comments, so

Try it on trunk, help with any bugs, buy Mark beer.

And, most especially, document up what it takes to make it work.
Mark is juggling a zillion things and I'm sure he'd appreciate any
help there.

Erick

On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta
 wrote:
> zomghowcanihelp? :)
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062  | c: +1 917 477 7906
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions  | g+:
> plus.google.com/appinions
> w: appinions.com 
>
>
> On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson 
> wrote:
>
>> You might be interested in following:
>> https://issues.apache.org/jira/browse/SOLR-4916
>>
>> Best
>> Erick
>>
>> On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta
>>  wrote:
>> > Jack,
>> >
>> > Sorry, but I don't agree that it's that cut and dried. I've very
>> > successfully worked with terabytes of data in Hadoop that was stored on
>> an
>> > Isilon mounted via NFS, for example. In cases like this, you're using
>> > MapReduce purely for it's execution model (which existed far before
>> Hadoop
>> > and HDFS ever did).
>> >
>> >
>> > Michael Della Bitta
>> >
>> > Applications Developer
>> >
>> > o: +1 646 532 3062  | c: +1 917 477 7906
>> >
>> > appinions inc.
>> >
>> > “The Science of Influence Marketing”
>> >
>> > 18 East 41st Street
>> >
>> > New York, NY 10017
>> >
>> > t: @appinions  | g+:
>> > plus.google.com/appinions
>> > w: appinions.com 
>> >
>> >
>> > On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky > >wrote:
>> >
>> >> ???
>> >>
>> >> Hadoop=HDFS
>> >>
>> >> If the data is not in Hadoop/HDFS, just use the normal Solr indexing
>> >> tools, including SolrCell and Data Import Handler, and possibly
>> ManifoldCF.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -Original Message- From: engy.morsy
>> >> Sent: Tuesday, June 25, 2013 8:10 AM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Solr indexer and Hadoop
>> >>
>> >>
>> >> Thank you Jack. So, I need to convert those nodes holding data to HDFS.
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context: http://lucene.472066.n3.**
>> >> nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html<
>> http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html
>> >
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>>


Get the query result from one collection and send it to other collection to for merging the result sets

2013-06-26 Thread Jilani Shaik
Hi,

We will have two categories of data, where one category will be the list of
primary data (for example products) and the other collection (it could be
spread across shards) holds the transaction data (for example product sales
data).



We have search scenario where we need to show the products along with the
number of sales for each product. For this we need to do a facet based
search on second collection and then this has to shown together along with
the primary data.


Is there any way to handle this kind of scenario. Please suggest any other
approaches to get the desired result.


Thank you,
Jilani


Re: [solr cloud] solr hangs when indexing large number of documents from multiple threads

2013-06-26 Thread Erick Erickson
Right, unfortunately this is a gremlin lurking in the weeds, see:
http://wiki.apache.org/solr/DistributedSearch#Distributed_Deadlock

There are a couple of ways to deal with this:
1> go ahead and up the limit and re-compile, if you look at
SolrCmdDistributor the semaphore is defined there.

2> https://issues.apache.org/jira/browse/SOLR-4816 should
address this as well as improve indexing throughput. I'm totally sure
Joel (the guy working on this) would be thrilled if you were able to
verify that these two points, I'd ask him (on the JIRA) whether he thinks
it's ready to test.

3> Reduce the number of threads you're indexing with

4> index docs in small packets, perhaps even one and just rack
together a zillion threads to get throughput.

FWIW,
Erick

On Tue, Jun 25, 2013 at 8:55 AM, Vinay Pothnis  wrote:
> Jason and Scott,
>
> Thanks for the replies and pointers!
> Yes, I will consider the 'maxDocs' value as well. How do i monitor the
> transaction logs during the interval between commits?
>
> Thanks
> Vinay
>
>
> On Mon, Jun 24, 2013 at 8:48 PM, Jason Hellman <
> jhell...@innoventsolutions.com> wrote:
>
>> Scott,
>>
>> My comment was meant to be a bit tongue-in-cheek, but my intent in the
>> statement was to represent hard failure along the lines Vinay is seeing.
>>  We're talking about OutOfMemoryException conditions, total cluster
>> paralysis requiring restart, or other similar and disastrous conditions.
>>
>> Where that line is is impossible to generically define, but trivial to
>> accomplish.  What any of us running Solr has to achieve is a realistic
>> simulation of our desired production load (probably well above peak) and to
>> see what limits are reached.  Armed with that information we tweak.  In
>> this case, we look at finding the point where data ingestion reaches a
>> natural limit.  For some that may be JVM GC, for others memory buffer size
>> on the client load, and yet others it may be I/O limits on multithreaded
>> reads from a database or file system.
>>
>> In old Solr days we had a little less to worry about.  We might play with
>> a commitWithin parameter, ramBufferSizeMB tweaks, or contemplate partial
>> commits and rollback recoveries.  But with 4.x we now have more durable
>> write options and NRT to consider, and SolrCloud begs to use this.  So we
>> have to consider transaction logs, the file handles they leave open until
>> commit operations occur, and how we want to manage writing to all cores
>> simultaneously instead of a more narrow master/slave relationship.
>>
>> It's all manageable, all predictable (with some load testing) and all
>> filled with many possibilities to meet our specific needs.  Considering hat
>> each person's data model, ingestion pipeline, request processors, and field
>> analysis steps will be different, 5 threads of input at face value doesn't
>> really contemplate the whole problem.  We have to measure our actual data
>> against our expectations and find where the weak chain links are to
>> strengthen them.  The symptoms aren't necessarily predictable in advance of
>> this testing, but they're likely addressable and not difficult to decipher.
>>
>> For what it's worth, SolrCloud is new enough that we're still experiencing
>> some "uncharted territory with unknown ramifications" but with continued
>> dialog through channels like these there are fewer territories without good
>> cartography :)
>>
>> Hope that's of use!
>>
>> Jason
>>
>>
>>
>> On Jun 24, 2013, at 7:12 PM, Scott Lundgren <
>> scott.lundg...@carbonblack.com> wrote:
>>
>> > Jason,
>> >
>> > Regarding your statement "push you over the edge"- what does that mean?
>> > Does it mean "uncharted territory with unknown ramifications" or
>> something
>> > more like specific, known symptoms?
>> >
>> > I ask because our use is similar to Vinay's in some respects, and we want
>> > to be able to push the capabilities of write perf - but not over the
>> edge!
>> > In particular, I am interested in knowing the symptoms of failure, to
>> help
>> > us troubleshoot the underlying problems if and when they arise.
>> >
>> > Thanks,
>> >
>> > Scott
>> >
>> > On Monday, June 24, 2013, Jason Hellman wrote:
>> >
>> >> Vinay,
>> >>
>> >> You may wish to pay attention to how many transaction logs are being
>> >> created along the way to your hard autoCommit, which should truncate the
>> >> open handles for those files.  I might suggest setting a maxDocs value
>> in
>> >> parallel with your maxTime value (you can use both) to ensure the commit
>> >> occurs at either breakpoint.  30 seconds is plenty of time for 5
>> parallel
>> >> processes of 20 document submissions to push you over the edge.
>> >>
>> >> Jason
>> >>
>> >> On Jun 24, 2013, at 2:21 PM, Vinay Pothnis  wrote:
>> >>
>> >>> I have 'softAutoCommit' at 1 second and 'hardAutoCommit' at 30 seconds.
>> >>>
>> >>> On Mon, Jun 24, 2013 at 1:54 PM, Jason Hellman <
>> >>> jhell...@innoventsolutions.com> wrote:
>> >>>
>>  Vinay,
>> 
>>  What autoCommit 

index analyzer vs query analyzer

2013-06-26 Thread Mugoma Joseph O.
Hello,

What's the criteria used in putting an analyzer at query or index? e.g. I
want to use NGramFilterFactory, is there a difference whether I put it
under  or  ?

Thanks.


Mugoma



Re: StatsComponent doesn't work if field's type is TextField - can I change field's type to String

2013-06-26 Thread Jack Krupansky
You could use an update processor to turn the text string into multiple 
string values. A short snippet  of JavaScript in a 
StatelessScriptUpdateProcessor could do the trick. The field could then be a 
multivalued string field.


-- Jack Krupansky

-Original Message- 
From: Elran Dvir

Sent: Wednesday, June 26, 2013 7:14 AM
To: solr-user@lucene.apache.org
Subject: StatsComponent doesn't work if field's type is TextField - can I 
change field's type to String


Hi all,

StatsComponent doesn't work if field's type is TextField.
I get the following message:
"Field type 
textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100,

sortMissingLast=true}} is not currently supported".

My field configuration is:

"100" sortMissingLast="true">

   
   />

   


multiValued="true"/>


So, the reason my field is of type TextField is that in the document indexed 
there may be multiple values in the field separated by new lines.
The tokenizer is splitting it to multiple values and the field is indexed as 
multi-valued field.


Is there a way I can define the field as regular String field? Or a way to 
make StatsComponent work with TextField?


Thank you very much. 



Re: Is there a way to capture div tag by id?

2013-06-26 Thread Michael Sokolov

On 06/25/2013 01:17 PM, eShard wrote:

let's say I have a div with id="myDiv"
Is there a way to set up the solr upate/extract handler to capture just that
particular div?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-capture-div-tag-by-id-tp4073120.html
Sent from the Solr - User mailing list archive at Nabble.com.
   
You might be interested in Lux (see at http://luxdb.org), which provides 
XML-aware indexing for Solr.  It indexes text in the context of every 
element, and also allows you to explicitly define indexes using any 
XPath 2.0 expression, including //div[@id='myDiv'], for example.


--
Michael Sokolov
Senior Architect
Safari Books Online



Re: multiValued field score and count

2013-06-26 Thread Flavio Pompermaier
I tried to play a little with the tools you suggested. However, I probably
miss something because the term frequency is not that expected.
My itemid field is defined (in schema.xml) as:

 

I was supposing that indexing via post.sh the xml mentioned in the first
mail, the term frequency of itemid 1000 was 3 in the first doc and 1 in the
second!
Instead, I got that result only if I change my settings to:

 
 
  

  


and I modify my populating xml as:


   1
   11
   9
   1000 1000 1000
   5000


   2
   3
   1000


Is there a way to achieve termFrequency=3 for doc1 also using my initial
settings (itemid as string and just one value per itemid-tag)?

Best,
Flavio

On Wed, Jun 26, 2013 at 12:38 PM, Upayavira  wrote:

> I mentioned two features, [explain] and termfreq(field, 'value').
> Neither of these require anything special, as they are using stuff
> central to Lucene's scoring mechanisms. I think you can turn off the
> storage of term frequencies, obviously that would spoil things, but
> that's certainly not on my default.
>
> I typed the syntax below from memory, so I might not have got it exactly
> right.
>
> Upayavira
>
> On Wed, Jun 26, 2013, at 10:22 AM, Flavio Pompermaier wrote:
> > So, in order to achieve that feature I have to declare my fileds
> > (authorid
> > and itemid) with termVectors="true" termPositions="true"
> > termOffsets="false"?
> > Should it be enough?
> >
> >
> > On Wed, Jun 26, 2013 at 10:42 AM, Upayavira  wrote:
> >
> > > Add fl=[explain],* to your query, and review the output in the new
> > > field. It will tell you how the score was calculated. Look at the TF or
> > > termfreq values, as this is the number of times the term appears.
> > >
> > > Also, you could add this to your fl= param: count:termfreq(authorid,
> > > '1000’) which would give you a new field telling you how many times the
> > > term 1000 appears in the authorid field for each document.
> > >
> > > Upayavira
> > >
> > > On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote:
> > > > Hi to everybody,
> > > > I have some multiValued (single-token) field, for example authorid
> and
> > > > itemid, and what I'd like to know if there's the possibility to know
> how
> > > > many times a match was found in that document for some field and if
> the
> > > > score is higher when multiple match are found. For example, my docs
> are:
> > > >
> > > > 
> > > >1
> > > >11
> > > >9
> > > >1000
> > > >1000
> > > >1000
> > > >5000
> > > > 
> > > > 
> > > >2
> > > >3
> > > >1000
> > > > 
> > > >
> > > > Whould the first document have an higher score than the second if I
> > > > search
> > > > for itemid=1000? Is it possible to know how many times the match was
> > > > found
> > > > (3 for the doc1 and 1 for doc2)?
> > > >
> > > > Otherwise, how could I achieve that result?
> > > >
> > > > Best,
> > > > Flavio
> > > > --
> > > >
> > > > Flavio Pompermaier
> > > > *Development Department
> > > > *___
> > > > *OKKAM**Srl **- www.okkam.it*
> > > >
> > > > *Phone:* +(39) 0461 283 702
> > > > *Fax:* + (39) 0461 186 6433
> > > > *Email:* f.pomperma...@okkam.it
> > > > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> > > > *Registered office:* Trento (Italy), via Segantini 23
> > > >
> > > > Confidentially notice. This e-mail transmission may contain legally
> > > > privileged and/or confidential information. Please do not read it if
> you
> > > > are not the intended recipient(S). Any use, distribution,
> reproduction or
> > > > disclosure by any other person is strictly prohibited. If you have
> > > > received
> > > > this e-mail in error, please notify the sender and destroy the
> original
> > > > transmission and its attachments without reading or saving it in any
> > > > manner.
> > >
> >
> >
> >
> > --
> >
> > Flavio Pompermaier
> > *Development Department
> > *___
> > *OKKAM**Srl **- www.okkam.it*
> >
> > *Phone:* +(39) 0461 283 702
> > *Fax:* + (39) 0461 186 6433
> > *Email:* f.pomperma...@okkam.it
> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> > *Registered office:* Trento (Italy), via Segantini 23
> >
> > Confidentially notice. This e-mail transmission may contain legally
> > privileged and/or confidential information. Please do not read it if you
> > are not the intended recipient(S). Any use, distribution, reproduction or
> > disclosure by any other person is strictly prohibited. If you have
> > received
> > this e-mail in error, please notify the sender and destroy the original
> > transmission and its attachments without reading or saving it in any
> > manner.
>


StatsComponent doesn't work if field's type is TextField - can I change field's type to String

2013-06-26 Thread Elran Dvir
Hi all,

StatsComponent doesn't work if field's type is TextField.
I get the following message:
"Field type 
textstring{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100,
sortMissingLast=true}} is not currently supported".

My field configuration is:









So, the reason my field is of type TextField is that in the document indexed 
there may be multiple values in the field separated by new lines.
The tokenizer is splitting it to multiple values and the field is indexed as 
multi-valued field.

Is there a way I can define the field as regular String field? Or a way to make 
StatsComponent work with TextField?

Thank you very much.


Re: Is there a way to capture div tag by id?

2013-06-26 Thread Arcadius Ahouansou
Hi.

I ran into this issue a while ago.
In my case, the div I was trying to extract was the main content of the
page.
If that is your case, boilerpipe way help.
There is a patch at https://issues.apache.org/jira/browse/SOLR-3808  that
worked for me.

Arcadius.


On 25 June 2013 18:17, eShard  wrote:

> let's say I have a div with id="myDiv"
> Is there a way to set up the solr upate/extract handler to capture just
> that
> particular div?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-there-a-way-to-capture-div-tag-by-id-tp4073120.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: multiValued field score and count

2013-06-26 Thread Upayavira
I mentioned two features, [explain] and termfreq(field, 'value').
Neither of these require anything special, as they are using stuff
central to Lucene's scoring mechanisms. I think you can turn off the
storage of term frequencies, obviously that would spoil things, but
that's certainly not on my default.

I typed the syntax below from memory, so I might not have got it exactly
right.

Upayavira

On Wed, Jun 26, 2013, at 10:22 AM, Flavio Pompermaier wrote:
> So, in order to achieve that feature I have to declare my fileds
> (authorid
> and itemid) with termVectors="true" termPositions="true"
> termOffsets="false"?
> Should it be enough?
> 
> 
> On Wed, Jun 26, 2013 at 10:42 AM, Upayavira  wrote:
> 
> > Add fl=[explain],* to your query, and review the output in the new
> > field. It will tell you how the score was calculated. Look at the TF or
> > termfreq values, as this is the number of times the term appears.
> >
> > Also, you could add this to your fl= param: count:termfreq(authorid,
> > '1000’) which would give you a new field telling you how many times the
> > term 1000 appears in the authorid field for each document.
> >
> > Upayavira
> >
> > On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote:
> > > Hi to everybody,
> > > I have some multiValued (single-token) field, for example authorid and
> > > itemid, and what I'd like to know if there's the possibility to know how
> > > many times a match was found in that document for some field and if the
> > > score is higher when multiple match are found. For example, my docs are:
> > >
> > > 
> > >1
> > >11
> > >9
> > >1000
> > >1000
> > >1000
> > >5000
> > > 
> > > 
> > >2
> > >3
> > >1000
> > > 
> > >
> > > Whould the first document have an higher score than the second if I
> > > search
> > > for itemid=1000? Is it possible to know how many times the match was
> > > found
> > > (3 for the doc1 and 1 for doc2)?
> > >
> > > Otherwise, how could I achieve that result?
> > >
> > > Best,
> > > Flavio
> > > --
> > >
> > > Flavio Pompermaier
> > > *Development Department
> > > *___
> > > *OKKAM**Srl **- www.okkam.it*
> > >
> > > *Phone:* +(39) 0461 283 702
> > > *Fax:* + (39) 0461 186 6433
> > > *Email:* f.pomperma...@okkam.it
> > > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> > > *Registered office:* Trento (Italy), via Segantini 23
> > >
> > > Confidentially notice. This e-mail transmission may contain legally
> > > privileged and/or confidential information. Please do not read it if you
> > > are not the intended recipient(S). Any use, distribution, reproduction or
> > > disclosure by any other person is strictly prohibited. If you have
> > > received
> > > this e-mail in error, please notify the sender and destroy the original
> > > transmission and its attachments without reading or saving it in any
> > > manner.
> >
> 
> 
> 
> -- 
> 
> Flavio Pompermaier
> *Development Department
> *___
> *OKKAM**Srl **- www.okkam.it*
> 
> *Phone:* +(39) 0461 283 702
> *Fax:* + (39) 0461 186 6433
> *Email:* f.pomperma...@okkam.it
> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> *Registered office:* Trento (Italy), via Segantini 23
> 
> Confidentially notice. This e-mail transmission may contain legally
> privileged and/or confidential information. Please do not read it if you
> are not the intended recipient(S). Any use, distribution, reproduction or
> disclosure by any other person is strictly prohibited. If you have
> received
> this e-mail in error, please notify the sender and destroy the original
> transmission and its attachments without reading or saving it in any
> manner.


Re: multiValued field score and count

2013-06-26 Thread Flavio Pompermaier
So, in order to achieve that feature I have to declare my fileds (authorid
and itemid) with termVectors="true" termPositions="true"
termOffsets="false"?
Should it be enough?


On Wed, Jun 26, 2013 at 10:42 AM, Upayavira  wrote:

> Add fl=[explain],* to your query, and review the output in the new
> field. It will tell you how the score was calculated. Look at the TF or
> termfreq values, as this is the number of times the term appears.
>
> Also, you could add this to your fl= param: count:termfreq(authorid,
> '1000’) which would give you a new field telling you how many times the
> term 1000 appears in the authorid field for each document.
>
> Upayavira
>
> On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote:
> > Hi to everybody,
> > I have some multiValued (single-token) field, for example authorid and
> > itemid, and what I'd like to know if there's the possibility to know how
> > many times a match was found in that document for some field and if the
> > score is higher when multiple match are found. For example, my docs are:
> >
> > 
> >1
> >11
> >9
> >1000
> >1000
> >1000
> >5000
> > 
> > 
> >2
> >3
> >1000
> > 
> >
> > Whould the first document have an higher score than the second if I
> > search
> > for itemid=1000? Is it possible to know how many times the match was
> > found
> > (3 for the doc1 and 1 for doc2)?
> >
> > Otherwise, how could I achieve that result?
> >
> > Best,
> > Flavio
> > --
> >
> > Flavio Pompermaier
> > *Development Department
> > *___
> > *OKKAM**Srl **- www.okkam.it*
> >
> > *Phone:* +(39) 0461 283 702
> > *Fax:* + (39) 0461 186 6433
> > *Email:* f.pomperma...@okkam.it
> > *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> > *Registered office:* Trento (Italy), via Segantini 23
> >
> > Confidentially notice. This e-mail transmission may contain legally
> > privileged and/or confidential information. Please do not read it if you
> > are not the intended recipient(S). Any use, distribution, reproduction or
> > disclosure by any other person is strictly prohibited. If you have
> > received
> > this e-mail in error, please notify the sender and destroy the original
> > transmission and its attachments without reading or saving it in any
> > manner.
>



-- 

Flavio Pompermaier
*Development Department
*___
*OKKAM**Srl **- www.okkam.it*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pomperma...@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.


Re: multiValued field score and count

2013-06-26 Thread Upayavira
Add fl=[explain],* to your query, and review the output in the new
field. It will tell you how the score was calculated. Look at the TF or
termfreq values, as this is the number of times the term appears.

Also, you could add this to your fl= param: count:termfreq(authorid,
'1000’) which would give you a new field telling you how many times the
term 1000 appears in the authorid field for each document.

Upayavira

On Wed, Jun 26, 2013, at 09:34 AM, Flavio Pompermaier wrote:
> Hi to everybody,
> I have some multiValued (single-token) field, for example authorid and
> itemid, and what I'd like to know if there's the possibility to know how
> many times a match was found in that document for some field and if the
> score is higher when multiple match are found. For example, my docs are:
> 
> 
>1
>11
>9
>1000
>1000
>1000
>5000
> 
> 
>2
>3
>1000
> 
> 
> Whould the first document have an higher score than the second if I
> search
> for itemid=1000? Is it possible to know how many times the match was
> found
> (3 for the doc1 and 1 for doc2)?
> 
> Otherwise, how could I achieve that result?
> 
> Best,
> Flavio
> -- 
> 
> Flavio Pompermaier
> *Development Department
> *___
> *OKKAM**Srl **- www.okkam.it*
> 
> *Phone:* +(39) 0461 283 702
> *Fax:* + (39) 0461 186 6433
> *Email:* f.pomperma...@okkam.it
> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> *Registered office:* Trento (Italy), via Segantini 23
> 
> Confidentially notice. This e-mail transmission may contain legally
> privileged and/or confidential information. Please do not read it if you
> are not the intended recipient(S). Any use, distribution, reproduction or
> disclosure by any other person is strictly prohibited. If you have
> received
> this e-mail in error, please notify the sender and destroy the original
> transmission and its attachments without reading or saving it in any
> manner.


multiValued field score and count

2013-06-26 Thread Flavio Pompermaier
Hi to everybody,
I have some multiValued (single-token) field, for example authorid and
itemid, and what I'd like to know if there's the possibility to know how
many times a match was found in that document for some field and if the
score is higher when multiple match are found. For example, my docs are:


   1
   11
   9
   1000
   1000
   1000
   5000


   2
   3
   1000


Whould the first document have an higher score than the second if I search
for itemid=1000? Is it possible to know how many times the match was found
(3 for the doc1 and 1 for doc2)?

Otherwise, how could I achieve that result?

Best,
Flavio
-- 

Flavio Pompermaier
*Development Department
*___
*OKKAM**Srl **- www.okkam.it*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pomperma...@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.


Re: Result Grouping

2013-06-26 Thread Upayavira
What type of field are you grouping on? What happens when you distribute
it? I.e. what specifically goes wrong?

Upayavira

On Tue, Jun 25, 2013, at 09:12 PM, Bryan Bende wrote:
> I was reading this documentation on Result Grouping...
> http://docs.lucidworks.com/display/solr/Result+Grouping
> 
> which says...
> 
> sort - sortspec - Specifies how Solr sorts the groups relative to each
> other. For example, sort=popularity desc will cause the groups to be
> sorted
> according to the highest popularity document in each group. The default
> value is score desc.
> 
> group.sort - sort.spec - Specifies how Solr sorts documents within a
> single
> group. The default value is score desc.
> 
> Is it possible to use these parameters such that group.sort would first
> sort with in each group, and then the overall sort would be applied
> according to the first element of each sorted group ?
> 
> For example, using the scenario above where it has "sort=popularity
> desc",
> could you also have "group.sort=date asc" resulting in the the most
> recent
> document of each group being sorted by decreasing popularity ?
> 
> It seems to work the way I described when running a single node Solr 4.3
> instance, but in a 2 shard configuration it appears to work differently.
> 
> -Bryan


How to truncate a particular field, LimitTokenCountAnalyzer or LimitTokenCountFilter?

2013-06-26 Thread Daniel Collins
We have a requirement to grab the first N words in a particular field and
weight them differently for scoring purposes.  So I thought to use a
 and have some extra filter on the destination to truncate it
down (post tokenization).

Did a quick search and found both a LimitTokenCountAnalyzer
and LimitTokenCountFilter mentioned, if I read the wiki right, the Filter
is the correct approach for Solr as we have the schema-able analyzer chain,
so we don't need to code anything, right?

The Analyzer version would be more useful if we were explicitly coding up a
set of operations in Java, so that's what Lucene users directly would tend
to use.

Just in search of confirmation really.


Re: URL search and indexing

2013-06-26 Thread Flavio Pompermaier
Ok thank you all for the great help!
Now I'm ready to start playing with my index!

Best,
Flavio


On Tue, Jun 25, 2013 at 11:40 PM, Jack Krupansky wrote:

> Yeah, URL Classify does only do so much. That's why you need to combine
> multiple methods.
>
> As a fourth method, you could code up a short JavaScript "**
> StatelessScriptUpdateProcessor**" that did something like take a full
> domain name (such as output by URL Classify) and turn it into multiple
> values, each with more of the prefix removed, so that "lucene.apache.org"
> would index as:
>
> lucene.apache.org
> apache.org
> apache
> .org
> org
>
> And then the user could query by any of those partial domain names.
>
> But, if you simply tokenize the URL (copy the URL string to a text field),
> you automatically get most of that. The user can query by a URL fragment,
> such as "apache.org", ".org", "lucene.apache.org", etc. and the
> tokenization will strip out the punctuation.
>
> I'll add this script to my list of examples to add in the next rev of my
> book.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Flavio Pompermaier
> Sent: Tuesday, June 25, 2013 10:06 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: URL search and indexing
>
> I bought the book and looking at the example I still don't understand if it
> possible query all sub-urls of my URL.
> For example, if the URLClassifyProcessorFactory takes in input "url_s":"
> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html"
> and makes some
> outputs like
> - "url_domain_s":"lucene.apache.**org "
> - "url_canonical_s":"
> http://lucene.apache.org/solr/**4_0_0/changes/Changes.html
> "
> How should I configure url_domain_s in order to be able to makes query like
> '*.apache.org'?
> How should I configure url_canonical_s in order to be able to makes query
> like 'http://lucene.apache.org/**solr/* 
> '?
> Is it better to have two different fields for the two queries or could I
> create just one field for the two kind of queries (obviously for the former
> case then I should query something like *://.apache.org/*)?
>
>
> On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky *
> *wrote:
>
>  There are examples in my book:
>> http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-
>> early-access-release-1/ebook/product-21079719.html> www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
>> early-access-release-1/ebook/**product-21079719.html
>> >
>>
>>
>> But... I still think you should use a tokenized text field as well - use
>> all three: raw string, tokenized text, and URL classification fields.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Flavio Pompermaier
>> Sent: Tuesday, June 25, 2013 9:02 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: URL search and indexing
>>
>>
>> That's sound exactly what I'm looking for! However I cannot find an
>> example
>> of how to use it..could you help me please?
>> Moreover, about id field, isn't true that id field shouldn't be analyzed
>> as
>> suggested in
>> http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document
>> 
>> >
>>
>> ?
>>
>>
>> On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl 
>> wrote:
>>
>>  Sure you can query the url directly. Or if you choose you can split it up
>>
>>> in multiple components, e.g. using
>>> http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/
>>> solr/update/processor/URLClassifyProcessor.html>> ://lucene.apache.org/solr/4_3_**0/solr-core/org/apache/solr/**
>>> update/processor/**URLClassifyProcessor.html
>>> >
>>>
>>>
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>>
>>> 25. juni 2013 kl. 14:10 skrev Flavio Pompermaier :
>>>
>>> > Sorry but maybe I miss something here..could I declare url as key field
>>> and
>>> > query it too..?
>>> > At the moment, my schema.xml looks like:
>>> >
>>> > 
>>> > >> > required="true" multiValued="false" />
>>> >
>>> >   
>>> >   
>>> >  ...
>>> >   
>>> >
>>> > 
>>> > url
>>> >
>>> > Is it ok? or should I add a "baseurl" field of some kind to be able to
>>> > query all url coming from a certain domain (1st or 2nd level as well)?
>>> >
>>> > Best,
>>> > Flavio
>>> >
>>> >
>>> > On Tue, Jun 25, 2013 at 12:28 PM, Jan Høydahl 
>>> wrote:
>>>

OOM fieldCache problem

2013-06-26 Thread Markus Klose
Hi all,

I have some memory problems (OOM) with Solr 3.5.0 and I suppose that it has
something to do with the fieldCache. The entries count of the fieldCache
grows and grows, why is it not rebuilt after a commit? I commit every 60
seconds, but the memory consumption of Solr increased within one day from
2GB to 10GB (index size: ~200MB). 

I tried to solve the problem by reducing the other cache sizes (filterCache,
documentCache, queryResultCache). It delayed the OOM exception but it did
not solve the problem that the memory consumption increases continuously. Is
it possible to reset the fieldCache explicitly?

Markus



Re: Shard identification

2013-06-26 Thread Daniel Collins
When you say you move to different machines, did you copy the zoo_data from
your old setup, or did you just start up zookeeper and your shards one by
one?  Also did you use collection API to create the collection or just
start up your cores and let them attach to ZK.  I believe the ZK rules for
assigning shards has changed somewhere around 4.2.  We had a setup with 4.0
and it simply assigned them in order, shard 1, shard 2, shard 3, etc then
when all shards were filled, it started with replicas.

In 4.3 (we skipped the intermediates) the ordering wasn't obvious, I had to
do a bit of trial an error to determine the right order to start things in
order to get shard assignments correct, but that isn't really the
recommended way of doing it.

If you want specific assignments (cores to shards) then I think the core
API/collection API are the recommended way to go.  Create a collection
using the Collection API (http://wiki.apache.org/solr/SolrCloud) and then
copy the data to the right servers once it has assigned the shards (it
should make sure that replicas don't exist on the same machine, and things
like that).

I believe the general direction (of the next major Solr release) is to
start a system with a blank solr.xml and create cores/collections that way
rather than have a file and then have to connect to ZK and merge the data
with what's there.

We have a slightly odd requirement in that we need to determine the DataDir
for each core, and I haven't yet worked out the right sequence of commands
(Collection API doesn't support DataDir but CoreAPI does). It should be
possible though, just haven't found the time to get to it!



On 25 June 2013 18:40, Erick Erickson  wrote:

> Try sending requests to your shards with &distrib=false. See if the
> results agree with the SolrCloud graph or whether the docs you get
> back are inconsistent with the shard labels in the admin page. The
> &distrib=false bit keeps the query from going to other shards and
> will tell you if the current state is consistent or not.
>
> Best
> Erick
>
> On Tue, Jun 25, 2013 at 1:02 AM, Shalin Shekhar Mangar
>  wrote:
> > Firstly, using 1 zookeeper machine is not at all ideal. See
> > http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7
> >
> > I've never personally seen such an issue. Can you give screen shots of
> > the cloud graph on each node? Use an image hosting service because the
> > mailing list won't allow attachments.
> >
> > On Tue, Jun 18, 2013 at 2:07 PM, Ophir Michaeli 
> wrote:
> >> Hi,
> >>
> >> I built a 2 shards and 2 replicas system that works ok on a local
> machine, 1
> >> zookeeper on shard 1.
> >> It appears ok on the solar monitor page, cloud tab
> >> (http://localhost:8983/solr/#/~cloud).
> >> When I move to using different machines, each shard/replica on a
> different
> >> machine I get a wrong cloud-graph on the Solr monitoring page.
> >> The machine that has Shard 2 appears on the graph on shard 1, and the
> >> replicas are also mixed, shard 2 appears as 1 and shard 1 appears as 2.
> >>
> >> Any ideas why this happens?
> >>
> >> Thanks,
> >> Ophir
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
>


Fwd: facet.pivot and facet.sort does not work with fq

2013-06-26 Thread jotpe
Hello again!

The missing pivot facet when sorting by index can also be repeated in solr
4.3.1
Does anyone have an idea, how to debug this?

Best regards Johannes

-- Forwarded message --
From: jotpe 
Date: 2013/6/25
Subject: facet.pivot and facet.sort does not work with fq
To: solr-user@lucene.apache.org


Hello

I'm trying to display a hierachical structur with a facet.pivot, which
should be sorted by index.
I followed the idea
fromhttp://wiki.apache.org/solr/HierarchicalFaceting#Pivot_Facets and
created
"path_levelX" fields from 0 to 7.

My tokens are not unique per level and i need to sort it like in the
original structure. So i added a prefix with a sortorder number with static
length and an unique id (always 8 nums). Later this prefix will be hide by
using substring.
Format: SORTORDER/UNIQUE_ID/NAME_TO_DISPLAY

example:
path_level0:"000/123/Chief"
path_level0:"000/123/Chief"  path_level1:"000/124/Staff"
path_level0:"000/123/Chief"  path_level1:"000/124/Staff"
path_level2:"00/125/Chief"
path_level0:"001/126/Legal Adviser"


Displaying the pivot works fine.

Sorted by count OK
 
http://localhost:8080/solr/collection1/select?wt=xml&q=*:*&rows=2&facet=on&facet.pivot=path_level1,path_level2,path_level3&facet.pivot.mincount=1&facet.sort=count
Sorted by index OK
 
http://localhost:8080/solr/collection1/select?wt=xml&q=*:*&rows=2&facet=on&facet.pivot=path_level1,path_level2,path_level3&facet.pivot.mincount=1&facet.sort=index


Now I must reduce my global structure to one office by using the fq parameter.

Reduced to one office, sorted by count OK
 
http://localhost:8080/solr/collection1/select?wt=xml&q=*:*&rows=2&facet=on&facet.pivot=path_level1,path_level2,path_level3&facet.pivot.mincount=1&facet.sort=count&fq=office:xyz
Reduced to one office, sorted by index : failure
 
http://localhost:8080/solr/collection1/select?wt=xml&q=*:*&rows=2&facet=on&facet.pivot=path_level1,path_level2,path_level3&facet.pivot.mincount=1&facet.sort=count&fq=office:xyz

The facet.pivot elements stays empty. So what is wrong?



Maybe this is a bug... On the other hand, maybe this is a bad way to obtain
a hierchacial structure with a custom sort. Better ideas?

Best regards Johannes