Indexing a Multivalued field using ContentStreamUpdateRequest in Solr

2015-02-19 Thread Ashish Vishwas Kaduskar
Hello,



I use below code snippet to index data from a text file into solr. My text data 
is a tsv file with 3 fields - id,title and types. The field "types" is a 
multivalued field and these values are available as comma separated in the text 
file itself.
Here is an example: 123 building house,skyscraper,hut

id is 123 title is building types is house,skyscraper,hut

How do I modify my code to store types as a multivalued field in Solr?

  HttpSolrServer server = new HttpSolrServer("/my/solr/home");

  ModifiableSolrParams solrparams = new ModifiableSolrParams(new 
ModifiableSolrParams());

  solrparams.set("fieldnames", "id,title,types");

  ContentStreamUpdateRequest request = new 
ContentStreamUpdateRequest("/update");

  request.setParams(params);

  ContentStream readFile = new ContentStreamBase.FileStream(new 
File("myFile.txt"));

  request.addContentStream(readFile);

  SolrResponseBase response = null;

  try {

response = (SolrResponseBase) request.process(server);

  }catch(Exception e){

  e.printStackTrace();

  }


Regards,
Ashish

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread Chris Hostetter

: 1. name:DocumentOne^7 => doc1(score=7)
: 2. name:DocumentOne^7 AND place:notExist^3 => doc1(score=7)
: 3. place:(34\ High\ Street)^3 => doc1(score=3), doc2(score=3)
: 4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=10),
: doc2(score=3)
...
: > it's not clear why you need any sort of unique document identification for
: > you scoring algorithm .. from what you described, matches on fieldA should
: > get score "A" matches on fieldB should get score "B" ... why does it mater
: > which doc is which?
: 
: For case #3, for example, method SimScorer.score is called 3 times for each of
: these documents, total 6 times for both. I have added a
: ThreadLocal> to my custom similarity, which is cleared every
: time before new scoring session (after each query execution). This HashSet
: stores strings consisting of fieldName + docID. Every time score() is called,

Ah HA! ... this is why it's an XY problem... you've decided that you need 
a unique identifier for each doc so you can maintain a HashSet of all the 
times a doc matches a term in the query so you can count them ... you 
don't need to do any of that.

from all the examples of what you've described, i'm fairly certain all you 
really need is a TFIDF based Similarity where coord(), idf(), tf() and 
queryNorm() return 1 allways, and you omitNorms from all fields.

that's it ... that should literally be everything you need to do.

(You didn't give any examples of what you expect to happen with exclusion 
clauses in your BooleanQueries, but the approach you were describing 
wouldn't give you any aded advantages towards interesting MUST_NOT clauses 
either ... it would in fact only increase the scores for those docs in a 
way that is almost certainly not what you want)


-Hoss
http://www.lucidworks.com/


Re: [ANNOUNCE] Apache Gora 0.6 Released

2015-02-19 Thread Talat Uyarer
Congras!
On Feb 20, 2015 2:59 AM, "Lewis John Mcgibbney" 
wrote:

> Hi Folks,
>
> The Apache Gora team are pleased to announce the immediate availability of
> Apache Gora 0.6.
>
> This release addresses a modest 47 issues 
> with some being major improvements, new functionality and dependency
> upgrades. Most notably the release involves key upgrades to Hadoop, HBase
> and Solr dependencies as well as some extremely important bug fixes for the
> MongoDB module.
>
> Suggested Gora database support is as follows
>
>- Apache Avro 1.7.6
>- Apache Hadoop 1.2.1 and 2.5.2
>- Apache HBase 0.98.8-hadoop2
>- Apache Cassandra 2.0.2
>- Apache Solr 4.10.3
>- MongoDB 2.6.X
>- Apache Accumlo 1.5.1
>
> Gora is released as both source code, downloads for which can be found at
> our downloads page  as well as
> Maven artifacts which can be found on Maven central
> .
> Thank you
> Lewis
> (on behalf of Gora PMC)
>
>
> --
> *Lewis*
>


[ANNOUNCE] Apache Gora 0.6 Released

2015-02-19 Thread Lewis John Mcgibbney
Hi Folks,

The Apache Gora team are pleased to announce the immediate availability of
Apache Gora 0.6.

This release addresses a modest 47 issues 
with some being major improvements, new functionality and dependency
upgrades. Most notably the release involves key upgrades to Hadoop, HBase
and Solr dependencies as well as some extremely important bug fixes for the
MongoDB module.

Suggested Gora database support is as follows

   - Apache Avro 1.7.6
   - Apache Hadoop 1.2.1 and 2.5.2
   - Apache HBase 0.98.8-hadoop2
   - Apache Cassandra 2.0.2
   - Apache Solr 4.10.3
   - MongoDB 2.6.X
   - Apache Accumlo 1.5.1

Gora is released as both source code, downloads for which can be found at
our downloads page  as well as Maven
artifacts which can be found on Maven central
.
Thank you
Lewis
(on behalf of Gora PMC)


-- 
*Lewis*


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro

how are you defining/specifying these field weights?


I define weights inside of a query (name:SomeName^7).



it would help if you could give a concrete example of some sample docs, a
sample query, and what results you would expect ... the sample input and
sample output of the system you are interested in.


Sure. Imagine we have 2 docs:

doc1
-
name:DocumentOne
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

doc2
-
name:DocumentTwo
place:34 High Street (StandardTokenizerFactory, i.e. 3 tokens created)

I want the following queries return docs with scores:

1. name:DocumentOne^7 => doc1(score=7)
2. name:DocumentOne^7 AND place:notExist^3 => doc1(score=7)
3. place:(34\ High\ Street)^3 => doc1(score=3), doc2(score=3)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 => doc1(score=10), 
doc2(score=3)



If you're curious about why do I need it, i.e. about my very initial 
"problem X", then I need this scoring to be able to calculate matching 
percentage. That's a separate topic, I read a lot about it (including 
http://wiki.apache.org/lucene-java/ScoresAsPercentages) and people say 
it's either not doable or very-very complicated with SOLR. So I just 
want to give it a try. For case #3 from above matching percentage is 
100% for both docs. For case #4 it's doc1:100% and doc2:30%.




it's not clear why you need any sort of unique document identification for
you scoring algorithm .. from what you described, matches on fieldA should
get score "A" matches on fieldB should get score "B" ... why does it mater
which doc is which?


For case #3, for example, method SimScorer.score is called 3 times for 
each of these documents, total 6 times for both. I have added a 
ThreadLocal> to my custom similarity, which is cleared 
every time before new scoring session (after each query execution). This 
HashSet stores strings consisting of fieldName + docID. Every time 
score() is called, I check this HashSet - if fieldName + docID exists, I 
return 0 as score, otherwise field weight.
If there was no docID in this string (only field name), then case #3 
would return the following: doc1(score=3), doc2(score=0). If there was 
no HashSet at all, case #3 would return: doc1(score=9), doc2(score=9) 
since query matched all 3 tokens for every doc.


I know that what I'm doing is a "hack", but that's the only way I've 
found so far to implement percentage matching. I just want to play 
around with it, see how it performs and decide whether to use it or not. 
But for that I need to uniquely identify a document while scoring :)


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread Chris Hostetter

: Sure, sorry I did not do it before, I just wanted to take minimum of your
: valuable time. So in my custom Similarity class I am trying to implement such
: a logic, where score calculation is only based on field weight and a field
: match - that's it. In other words, if a field matches the query, I want
: "score" method to return this field's weight only, regardless of factors like:
: norms; coord; doc frequencies; fact that field was multivalued and more than
: one value matched; fact that field was tokenized as multiple tokens and more
: than one token matched, etc. As far as I know, there is no such a similarity
: in list of existing ones.

how are you defining/specifying these field weights?

it would help if you could give a concrete example of some sample docs, a 
sample query, and what results you would expect ... the sample input and 
sample output of the system you are interested in.

: In order to implement this, I am trying to score only once for a combination
: of a specific field + doc unique identifier. And I don't care what is this
: unique doc identifier - it can be unique key or it can be internal doc ID.

it's not clear why you need any sort of unique document identification for 
you scoring algorithm .. from what you described, matches on fieldA should 
get score "A" matches on fieldB should get score "B" ... why does it mater 
which doc is which?



-Hoss
http://www.lucidworks.com/


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro
Thank you for your answer, Chris. I will reply with inline comments as 
well. Please see below.



: I need to uniquely identify a document inside of a Similarity class during
: scoring. Is it possible to get value of unique key of a document at this
: point?

Can you tell us a bit more about your usecase ... your problem description
is a bit vague, and sounds like it may be an "XY Problem"...


Sure, sorry I did not do it before, I just wanted to take minimum of 
your valuable time. So in my custom Similarity class I am trying to 
implement such a logic, where score calculation is only based on field 
weight and a field match - that's it. In other words, if a field matches 
the query, I want "score" method to return this field's weight only, 
regardless of factors like: norms; coord; doc frequencies; fact that 
field was multivalued and more than one value matched; fact that field 
was tokenized as multiple tokens and more than one token matched, etc. 
As far as I know, there is no such a similarity in list of existing ones.
In order to implement this, I am trying to score only once for a 
combination of a specific field + doc unique identifier. And I don't 
care what is this unique doc identifier - it can be unique key or it can 
be internal doc ID.
I had my implementation working, but as I understood from your answer, I 
had it working only for one segment. So now I need to add segment ID or 
something like this to my combination.




Assuming the method you are refering to (you didn't give a specific
class/interface name) is SimScorer.score(doc,req) then the javadocs say...

 doc - document id within the inverted index segment
 freq - sloppy term frequency

...so for #1, yes this is definitely the per-segment docId.


Yes, it's ExactSimScorer.score(int doc, int freq). Ah! Per segment! Here 
we go, then I understand why it's 0 every new commit! SOLR doc says new 
docs are written to a new segment. Then question #1 is clear for me. 
Thanks, Chris!




for #2: the methor for providing a SimScorer to lucene is by implementing
Similarity.simScorer(...) -- that method gets as an argument an
AtomicReaderContext context, which not only has an AtomicReader for the
individual segment, but also details about that segments role in the
larger index.


Interesting details, that may be exactly what I need. If I can somehow 
uniquely identify a document using its internal doc id + data from 
context (like segment id or something), that would be awesome. I have 
checked AtomicReaderContext, it has 'ord' (The readers ord in the 
top-level's leaves array) and 'docBase' (The readers absolute doc base) 
- probably what I need. Do you have any more information (maybe links to 
wikis) about this AtomicReaderContext, DocValues, "low" and "top" levels 
(other than javadoc in source code)? I have a high-level understanding, 
but it's obviously not enough for the problem I am solving. I would be 
more than happy to understand it.


Thank you very much for your time, Chris and other people who spend time 
on reading/answering this thread!


Re: Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread Chris Hostetter

: I need to uniquely identify a document inside of a Similarity class during
: scoring. Is it possible to get value of unique key of a document at this
: point?

Can you tell us a bit more about your usecase ... your problem description 
is a bit vague, and sounds like it may be an "XY Problem"...

https://people.apache.org/~hossman/#xyproblem
Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341

: 1. Is docIds behavior described above a bug or a feature? Obviously, if it's a
: bug and I can use docID to uniquely identify a document, then my question is
: answered after this bug is fixed.
: 2. If docIds behavior described above is normal, then what is an alternative
: way of uniquely identify a document inside of a Similarity class during
: scoring? Can I get unique key of a scoring document in Similarity?

Assuming the method you are refering to (you didn't give a specific 
class/interface name) is SimScorer.score(doc,req) then the javadocs say...

doc - document id within the inverted index segment
freq - sloppy term frequency

...so for #1, yes this is definitely the per-segment docId.

for #2: the methor for providing a SimScorer to lucene is by implementing 
Similarity.simScorer(...) -- that method gets as an argument an 
AtomicReaderContext context, which not only has an AtomicReader for the 
individual segment, but also details about that segments role in the 
larger index.

As far as getting the Solr uniqueKey ... it's non trivial, and there are 
different things you could do depending on what your ultimate goal is (ie: 
see my earlier question about XY problem) ... my guess is from this low 
level down in the code you want to use DocValues (aka: FieldCache in older 
versions of lucene) on your uniqueKey field, then ask it for the 
fieldvalue of each internal docId that gets passed to your method -- 
either by using the per-segment DocValues, or by using the 
AtomicReaderContext's base information to determine the "top level" 
internal docId and use the "top level" DocValues/FieldCache

(the per-segment vs "top level" DocValues and internalId stuff can be kind 
of confusing -- start with whichever seems simpler based on your 
understanding of the internal lucene/solr APIs and worry about maybe 
switching to the other approach later once you have something working and 
see if it helps or hinders performance for your usecases)

-Hoss
http://www.lucidworks.com/


Re: Solr date retrieve back UTC

2015-02-19 Thread vsriram30
Thanks Chris for additional info.

Thanks,
Sriram



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-date-retrieve-back-UTC-tp4187449p4187503.html
Sent from the Solr - User mailing list archive at Nabble.com.


Getting unique key of a document inside of a Similarity class.

2015-02-19 Thread J-Pro

Good afternoon.

I need to uniquely identify a document inside of a Similarity class 
during scoring. Is it possible to get value of unique key of a document 
at this point?


For some time I though I can use internal docID for achieving that. 
Method score(int doc, float freq) is called after every query execution 
for each matched doc. For each indexed doc it equals 0, 1, 2, etc. But 
this is only when documents indexed in a bulk, i.e. in single HTTP 
request. But when docs are indexed in separate requests, these docIds 
equal 0 for all documents.


To summarize, here are 2 final questions:

1. Is docIds behavior described above a bug or a feature? Obviously, if 
it's a bug and I can use docID to uniquely identify a document, then my 
question is answered after this bug is fixed.
2. If docIds behavior described above is normal, then what is an 
alternative way of uniquely identify a document inside of a Similarity 
class during scoring? Can I get unique key of a scoring document in 
Similarity?


FYI: I have asked 1st question in #solr IRC channel. The person named 
hoss answered the following: "you're seeing the *internal* docIds ... 
you can't assign any special meaning to them ... i believe that at the 
level of the Similarity class, these may even be per segment, which 
means that in the context of a SegmentReader they can be used to get 
things like docValues, but they odn't have any meaning compared to your 
uniqueKey (for example)". This kinda makes me think that answer for the 
1st question is "it's a feature". But I am still not sure and don't know 
the answer to the 2nd question. Please help.


Thank you very much in advance.


Re: Solr date retrieve back UTC

2015-02-19 Thread Chris Hostetter

: to get the UTC back which I thought might not be required as already the
: cDate field in that Date class is having the UTC date.

general suggestion: your life will be a lot easier if you stop looking at 
the implementation details of JVM classes -- just because your current JVM 
implements the Date class with some internal "cDate" field doesn't mean 
that the next JVM release (or some JVM sold by another company) will be 
implemented the same way.

: The toString() doesn't actually give me timestamp in UTC format. It gives,

the representation returned by toString() is entirely dependent on your 
locale.  please read my comment fully.  there are *LOTS* of tutorials on 
the internet about dealing with Date & DateFormat objects in java...


>> Date objects in Java do not have any intrinsic TimeZone -- they 
>> represent absolute fixed moments in time.  to "see" a Date in UTC (or 
>> any other time zone) you must convert it to a String -- either by using 
>> hte detault "toString()" representation, or by using a DateFormat.



-Hoss
http://www.lucidworks.com/


Re: Solr date retrieve back UTC

2015-02-19 Thread vsriram30
Thanks Chris for your quick reply. As you said, I need to do some conversion
to get the UTC back which I thought might not be required as already the
cDate field in that Date class is having the UTC date.

The toString() doesn't actually give me timestamp in UTC format. It gives,

Mon Sep 15 12:52:08 PDT 2014

Thanks,
V.Sriram



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-date-retrieve-back-UTC-tp4187449p4187456.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr date retrieve back UTC

2015-02-19 Thread Chris Hostetter

: But when I use SolrJ and get it as object, I am seeing that the UTC date is
: of type Date and I am not able to retrieve back the UTC date from it and I
: get only long timestamp from that object.
: 
: I also see a private variable in that Date class called as cDate which has
: what I want (Date in UTC format). But I am not able to get the UTC value out
: of that variable. Is there any better ways to get UTC timestamp out of that
: field?

Date objects in Java do not have any intrinsic TimeZone -- they represent 
absolute fixed moments in time.  to "see" a Date in UTC (or any other 
time zone) you must convert it to a String -- either by using hte detault 
"toString()" representation, or by using a DateFormat.




-Hoss
http://www.lucidworks.com/


Solr date retrieve back UTC

2015-02-19 Thread vsriram30
Hi,

I am having a date field in my solr schema and I am indexing a proper UTC
date to that field. If I am directly querying Solr, I am able to see the
field with UTC time in that in the JSON response.

But when I use SolrJ and get it as object, I am seeing that the UTC date is
of type Date and I am not able to retrieve back the UTC date from it and I
get only long timestamp from that object.

I also see a private variable in that Date class called as cDate which has
what I want (Date in UTC format). But I am not able to get the UTC value out
of that variable. Is there any better ways to get UTC timestamp out of that
field?

Thanks,
Sriram



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-date-retrieve-back-UTC-tp4187449.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Lazy startup - load-on-startup missing from web.xml?

2015-02-19 Thread Chris Hostetter
: Hi! Solr is starting up "dormant" for me, until a client wake it up with a
: REST request, or I open admin UI, only then the remaining initializing
: happens.
: Is it something known?

based on my recollection of the servlet spec, that sounds like a 
bug/glitch/config option in your Servlet container...

Googling "WebSphere init Filters on startup" turns up this IBM bug report 
with noted fix versions...
http://www-01.ibm.com/support/docview.wss?uid=swg1PK86553


: I can't see any load-on-startup in the web.xml, in Solr.war.

The bulk of Solr exists as a "Filter".  Filters are not permitted 
by the servlet spec to specify "load-on-startup" value (only 
Servlets can specify that, and the only Servlets in Solr are for 
supporting legacy paths -- the load order doesn't matter for them)


: Running Solr 4.7.2 over WebSphere 8.5
: 
: App loading message as the server starts up:
: [2/*16*/15 12:17:19:956 GMT] 0056 ApplicationMg A   WSVR0221I:
: Application started: solr-4.7.2
: [2/*16*/15 12:17:20:319 GMT] 0001 WsServerImpl  A   WSVR0001I:
: Server serverSolr open for e-business
: The the next start up message in the log is on the next day once I enter
: Solr admin UI:
: [2/*17*/15 10:20:13:827 GMT] 0098 SolrDispatchF I
: org.apache.solr.servlet.SolrDispatchFilter init SolrDispatchFilter.init()
: ...
: 

-Hoss
http://www.lucidworks.com/


Re: Collections API - HTTP verbs

2015-02-19 Thread Hrishikesh Gadre
Thanks Mark and Scott. Adding quotes around the URL fixed the problem.

Regards
Hrishikesh

On Thu, Feb 19, 2015 at 7:30 AM, Scott Dawson  wrote:

> Hrishikesh,
> If you're running on Linux or Unix, the first ampersand in the URL is
> interpreted as the shell's "run this in the background" operator and
> anything beyond the ampersand will not be passed to curl. So Mark is right
> -- put single quotes around the URL so that it's not interpreted by the
> shell.
>
> Regards,
> Scott
>
> On Wed, Feb 18, 2015 at 9:31 PM, Mark Miller 
> wrote:
>
> > Perhaps try quotes around the url you are providing to curl. It's not
> > complaining about the http method - Solr has historically always taken
> > simple GET's for http - for good or bad, you pretty much only post
> > documents / updates.
> >
> > It's saying the name param is required and not being found and since you
> > are trying to specify the name, I'm guessing something about the command
> is
> > not working. You might try just shoving it in a browser url bar as well.
> >
> > - Mark
> >
> > On Wed Feb 18 2015 at 8:56:26 PM Hrishikesh Gadre 
> > wrote:
> >
> > > Hi,
> > >
> > > Can we please document which HTTP method is supposed to be used with
> each
> > > of these APIs?
> > >
> > > https://cwiki.apache.org/confluence/display/solr/Collections+API
> > >
> > > I am trying to invoke following API
> > >
> > > curl http://
> > >
> :8983/solr/admin/collections?action=CLUSTERPROP&name=urlScheme&
> > > val=https
> > >
> > > This request is failing due to following error,
> > >
> > > 2015-02-18 17:29:39,965 INFO
> org.apache.solr.servlet.SolrDispatchFilter:
> > > [admin] webapp=null path=/admin/collections params={action=CLUSTERPROP}
> > > status=400 QTime=20
> > >
> > > org.apache.solr.core.SolrCore: org.apache.solr.common.SolrException:
> > > Missing required parameter: name
> > >
> > > at
> > > org.apache.solr.common.params.RequiredSolrParams.get(
> > > RequiredSolrParams.java:49)
> > >
> > > at
> > > org.apache.solr.common.params.RequiredSolrParams.check(
> > > RequiredSolrParams.java:153)
> > >
> > > at
> > > org.apache.solr.handler.admin.CollectionsHandler.handleProp(
> > > CollectionsHandler.java:238)
> > >
> > > at
> > > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(
> > > CollectionsHandler.java:200)
> > >
> > > at
> > > org.apache.solr.handler.RequestHandlerBase.handleRequest(
> > > RequestHandlerBase.java:135)
> > >
> > > at
> > > org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(
> > > SolrDispatchFilter.java:770)
> > >
> > > at
> > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > > SolrDispatchFilter.java:271)
> > >
> > > I am using Solr 4.10.3 version.
> > >
> > > Thanks
> > >
> > > Hrishikesh
> > >
> >
>


Re: what order does solr return the results in if the search is *:*

2015-02-19 Thread Erik Hatcher
It’ll return them in order of them being indexed, generally.  If documents are 
being updated (delete/re-add, effectively) then order would change but still by 
default ordered as they are in the underlying Lucene index.

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 




> On Feb 19, 2015, at 11:57 AM, Tang, Rebecca  wrote:
> 
> If user searches for *:*, what order does solr return the results in?  I 
> expected the results to be returned in index order.  (I indexed the documents 
> in the order of the numeric document id from 0 -> ~15,000,000).
> 
> So when I searched with *:*, I expected the first 10 documents returned to 
> have ids from 0 -> 9.
> 
> But the first 10 id's were:
> 146263 146266 146254 146264 146265 146274 146277 146271 146279 146268
> 
> Documents seem to be consistently returned in this order.
> 
> How does solr order the results when the search is a blanket *:*?
> 
> Rebecca Tang
> Applications Developer, UCSF CKM
> Industry Documents Digital Libraries
> E: rebecca.t...@ucsf.edu
> 



Re: what order does solr return the results in if the search is *:*

2015-02-19 Thread Erick Erickson
I'm pretty sure by internal document id, which changes upon segment
merge. If you depend on this, you need to include a field at index
time that'll be unchanging and then sort on that.

bq: Documents seem to be consistently returned in this order

This won't be invariant as the index changes and segments get merged.

Best,
Erick

On Thu, Feb 19, 2015 at 8:57 AM, Tang, Rebecca  wrote:
> If user searches for *:*, what order does solr return the results in?  I 
> expected the results to be returned in index order.  (I indexed the documents 
> in the order of the numeric document id from 0 -> ~15,000,000).
>
> So when I searched with *:*, I expected the first 10 documents returned to 
> have ids from 0 -> 9.
>
> But the first 10 id's were:
> 146263 146266 146254 146264 146265 146274 146277 146271 146279 146268
>
> Documents seem to be consistently returned in this order.
>
> How does solr order the results when the search is a blanket *:*?
>
> Rebecca Tang
> Applications Developer, UCSF CKM
> Industry Documents Digital Libraries
> E: rebecca.t...@ucsf.edu
>


Re: is there a constant for _vesion_-fieldname?

2015-02-19 Thread Erick Erickson
Grepping shows
VersionInfo.VERSION_FIELD

Best,
Erick

On Thu, Feb 19, 2015 at 1:45 AM, Clemens Wyss DEV  wrote:
> Does Solr provider a (Java)constant for "the name of the version field" (ie 
> _version_)?


what order does solr return the results in if the search is *:*

2015-02-19 Thread Tang, Rebecca
If user searches for *:*, what order does solr return the results in?  I 
expected the results to be returned in index order.  (I indexed the documents 
in the order of the numeric document id from 0 -> ~15,000,000).

So when I searched with *:*, I expected the first 10 documents returned to have 
ids from 0 -> 9.

But the first 10 id's were:
146263 146266 146254 146264 146265 146274 146277 146271 146279 146268

Documents seem to be consistently returned in this order.

How does solr order the results when the search is a blanket *:*?

Rebecca Tang
Applications Developer, UCSF CKM
Industry Documents Digital Libraries
E: rebecca.t...@ucsf.edu



Re: Collections API - HTTP verbs

2015-02-19 Thread Scott Dawson
Hrishikesh,
If you're running on Linux or Unix, the first ampersand in the URL is
interpreted as the shell's "run this in the background" operator and
anything beyond the ampersand will not be passed to curl. So Mark is right
-- put single quotes around the URL so that it's not interpreted by the
shell.

Regards,
Scott

On Wed, Feb 18, 2015 at 9:31 PM, Mark Miller  wrote:

> Perhaps try quotes around the url you are providing to curl. It's not
> complaining about the http method - Solr has historically always taken
> simple GET's for http - for good or bad, you pretty much only post
> documents / updates.
>
> It's saying the name param is required and not being found and since you
> are trying to specify the name, I'm guessing something about the command is
> not working. You might try just shoving it in a browser url bar as well.
>
> - Mark
>
> On Wed Feb 18 2015 at 8:56:26 PM Hrishikesh Gadre 
> wrote:
>
> > Hi,
> >
> > Can we please document which HTTP method is supposed to be used with each
> > of these APIs?
> >
> > https://cwiki.apache.org/confluence/display/solr/Collections+API
> >
> > I am trying to invoke following API
> >
> > curl http://
> > :8983/solr/admin/collections?action=CLUSTERPROP&name=urlScheme&
> > val=https
> >
> > This request is failing due to following error,
> >
> > 2015-02-18 17:29:39,965 INFO org.apache.solr.servlet.SolrDispatchFilter:
> > [admin] webapp=null path=/admin/collections params={action=CLUSTERPROP}
> > status=400 QTime=20
> >
> > org.apache.solr.core.SolrCore: org.apache.solr.common.SolrException:
> > Missing required parameter: name
> >
> > at
> > org.apache.solr.common.params.RequiredSolrParams.get(
> > RequiredSolrParams.java:49)
> >
> > at
> > org.apache.solr.common.params.RequiredSolrParams.check(
> > RequiredSolrParams.java:153)
> >
> > at
> > org.apache.solr.handler.admin.CollectionsHandler.handleProp(
> > CollectionsHandler.java:238)
> >
> > at
> > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(
> > CollectionsHandler.java:200)
> >
> > at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(
> > RequestHandlerBase.java:135)
> >
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(
> > SolrDispatchFilter.java:770)
> >
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > SolrDispatchFilter.java:271)
> >
> > I am using Solr 4.10.3 version.
> >
> > Thanks
> >
> > Hrishikesh
> >
>


Re: Committed before 500

2015-02-19 Thread Shawn Heisey
On 2/19/2015 6:30 AM, NareshJakher wrote:
> I am using Solr cloud with 3 nodes, at times following error is observed in
> logs during delete operation. Is it a performance issue ? What can be done
> to resolve this issue
> 
> "Committed before 500 {msg=Software caused connection abort: socket write
> error,trace=org.eclipse.jetty.io.EofException"
> 
> I did search on old topics but couldn't find anything concrete related to
> Solr cloud. Would appreciate any help on the issues as I am relatively new
> to Solr.

A jetty EofException indicates that one specific thing is happening:

The TCP connection from the client was severed before Solr responded to
the request.  Usually this happens because the client has been
configured with an absolute timeout or an inactivity timeout, and the
timeout was reached.

Configuring timeouts so that you can be sure clients don't get stuck is
a reasonable idea, but any configured timeouts should be VERY long.
You'd want to use a value like five minutes, rather than 10, 30, or 60
seconds.

The timeouts MIGHT be in the HttpShardHandler config that Solr and
SolrCloud use for distributed searches, and they also might be in
operating-system-level config.

https://wiki.apache.org/solr/SolrConfigXml?highlight=%28HttpShardHandler%29#Configuration_of_Shard_Handlers_for_Distributed_searches

Thanks,
Shawn



Re: Divide 4 Nodes into 100 nodes in Solr Cloud

2015-02-19 Thread Nitin Solanki
Okay, thanks Shawn..

On Thu, Feb 19, 2015 at 7:59 PM, Shawn Heisey  wrote:

> On 2/19/2015 4:18 AM, Nitin Solanki wrote:
> >Sorry, I think, you both are taking about
> > shard splitting but I want node splitting. I have 4 nodes. Each node has
> 2
> > shards, So, Now, I want 100 Nodes from that 4 nodes and each having 2
> > shards. Any Ideas?
>
> Node splitting does not exist as a discrete command, but shard splitting
> is the first step in node splitting.  The full procedure would be:
>
> *) Split one or more shards.  Wait for that to complete.
> *) Do the ADDREPLICA action for some of the new shards to other hosts.
> *) Wait for the replication to the new core(s) to complete
> *) Do the DELETEREPLICA action for those shards on the original hosts.
> *) Delete the originally-split shard(s) at your leisure.
>
> The overall procedure will be labor intensive and might be prone to
> error, plus as already mentioned, the core names might become very
> convoluted.  It is MUCH cleaner to reindex into a new collection.
>
> Thanks,
> Shawn
>
>


Re: Divide 4 Nodes into 100 nodes in Solr Cloud

2015-02-19 Thread Shawn Heisey
On 2/19/2015 4:18 AM, Nitin Solanki wrote:
>Sorry, I think, you both are taking about
> shard splitting but I want node splitting. I have 4 nodes. Each node has 2
> shards, So, Now, I want 100 Nodes from that 4 nodes and each having 2
> shards. Any Ideas?

Node splitting does not exist as a discrete command, but shard splitting
is the first step in node splitting.  The full procedure would be:

*) Split one or more shards.  Wait for that to complete.
*) Do the ADDREPLICA action for some of the new shards to other hosts.
*) Wait for the replication to the new core(s) to complete
*) Do the DELETEREPLICA action for those shards on the original hosts.
*) Delete the originally-split shard(s) at your leisure.

The overall procedure will be labor intensive and might be prone to
error, plus as already mentioned, the core names might become very
convoluted.  It is MUCH cleaner to reindex into a new collection.

Thanks,
Shawn



Committed before 500

2015-02-19 Thread NareshJakher
I am using Solr cloud with 3 nodes, at times following error is observed in
logs during delete operation. Is it a performance issue ? What can be done
to resolve this issue

"Committed before 500 {msg=Software caused connection abort: socket write
error,trace=org.eclipse.jetty.io.EofException"

I did search on old topics but couldn't find anything concrete related to
Solr cloud. Would appreciate any help on the issues as I am relatively new
to Solr.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question on CloudSolrServer API

2015-02-19 Thread Shalin Shekhar Mangar
No, you should reuse the same CloudSolrServer instance for all requests. It
is a thread safe object. You could also create a static/common HttpClient
instance and pass it to the constructor of CloudSolrServer but even if you
don't, it will create one internally and use it for all requests so that
connections can be pooled.
On 19-Feb-2015 1:44 pm, "Manohar Sripada"  wrote:

> Hi All,
>
> I am using CloudSolrServer API of SolrJ library from my application to
> query Solr. Here, I am creating a new connection to Solr for every search
> that I am doing. Once I got the results I am closing the connection.
>
> Is this the correct way? How does Solr create connections internally? Does
> it maintain a pool of connections (if so how to configure it)?
>
> Thanks,
> Manohar
>


Auto-correct the phrase/query

2015-02-19 Thread Nitin Solanki
Hello,
  I want to do same like google phrase/spell correction. If anyone
type a query "the dark night" then I need a suggestion like "the dark
knight" in Solr. Is there anyway to do this?


Re: Divide 4 Nodes into 100 nodes in Solr Cloud

2015-02-19 Thread Nitin Solanki
Hi Yago & Shawn,
   Sorry, I think, you both are taking about
shard splitting but I want node splitting. I have 4 nodes. Each node has 2
shards, So, Now, I want 100 Nodes from that 4 nodes and each having 2
shards. Any Ideas?


On Wed, Feb 18, 2015 at 9:25 PM, Shawn Heisey  wrote:

> On 2/18/2015 8:17 AM, Nitin Solanki wrote:
> > I have created 4 nodes having 8 shards. Now, I want to divide
> those
> > 4 Nodes into 100 Nodes without any failure/ or re-indexing the data. Any
> > help please?
>
> I think your only real option within a strict interpretation of your
> requirements is shard splitting.  You will probably have to do it
> several times, and the resulting core names could get very ugly.
>
>
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud#ShardsandIndexingDatainSolrCloud-ShardSplitting
>
> Reindexing is a LOT cleaner and is likely to work better.  If you build
> a new collection sharded the way you want across all the new nodes, you
> can delete the old collection and set up an alias pointing the old name
> at the new collection, no need to change any applications, as long as
> they use the collection name rather than the actual core names.  The
> delete and alias might take long enough that there would be a few
> seconds of downtime, but that's probably all you'd see.  Both indexing
> and queries would work with the alias.
>
>
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-DeleteaCollection
>
>
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CreateormodifyanAliasforaCollection
>
> Thanks,
> Shawn
>
>


Re: How to place whole indexed data on cache

2015-02-19 Thread Nitin Solanki
Thanks Dominique. Got your view..

On Wed, Feb 18, 2015 at 11:55 PM, Dominique Bejean <
dominique.bej...@eolya.fr> wrote:

> Hi,
>
> As Shawn said, install enough memory in order that all free direct memory
> (non heap memory) be used as disk cache.
> Use 40% maximum of the available memory for heap memory (Xmx JVM
> parameter), but never more than 32 Gb
>
> And avoid your server to swap.
> For most Linux systems, this is configured using the /etc/sysctl.conf
> value:
> vm.swappiness = 1
> This prevents swapping under normal circumstances, but still allows the OS
> to swap under emergency memory situations.
> A swappiness of 1 is better than 0, since on some kernel versions a
> swappiness of 0 can invoke the OOM-killer
>
> http://askubuntu.com/questions/103915/how-do-i-configure-swappiness
>
> http://unix.stackexchange.com/questions/88693/why-is-swappiness-set-to-60-by-default
>
> Dominique
> http://www.eolya.fr/
>
>
> 2015-02-18 14:39 GMT+01:00 Shawn Heisey :
>
> > On 2/18/2015 4:20 AM, Nitin Solanki wrote:
> > >  How can I place whole indexed data on cache by which if I will
> > > search any query then I will get response, suggestions, collations
> > rapidly.
> > > And also how to view that which documents are on cache and how to
> verify
> > it?
> >
> > Simply install enough extra memory in your machine for the entire index
> > to fit in RAM that is not being used by programs ... and then do NOT
> > allocate that extra memory to any program.
> >
> > The operating system will automatically do the caching for you as part
> > of normal operation, no config required.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
> >
> > Relevant articles referenced by that wiki page:
> >
> > http://en.wikipedia.org/wiki/Page_cache
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: spellcheck.count v/s spellcheck.alternativeTermCount

2015-02-19 Thread Nitin Solanki
I have 48GB of indexed data.
I have set spellcheck.count=1 & spellcheck.alternativeTermCount=10 but I am
getting only 1 suggestions in suggestion block but Suggestions for
collations are coming.

*PFA*. for details

On Thu, Feb 19, 2015 at 1:50 AM, Dyer, James 
wrote:

> It will try to give you suggestions up to the number you specify, but if
> fewer are available it will not give you any more.
>
> James Dyer
> Ingram Content Group
>
> -Original Message-
> From: Nitin Solanki [mailto:nitinml...@gmail.com]
> Sent: Tuesday, February 17, 2015 11:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: spellcheck.count v/s spellcheck.alternativeTermCount
>
> Thanks James,
>   I tried the same thing
> spellcheck.count=10&spellcheck.alternativeTermCount=5. And I got 5
> suggestions of both "life" and "hope" but not like this * The spellchecker
> will try to return you up to 10 suggestions for "hope", but only up to 5
> suggestions for "life". *
>
>
> On Wed, Feb 18, 2015 at 1:10 AM, Dyer, James  >
> wrote:
>
> > Here is an example to illustrate what I mean...
> >
> > - query q=text:(life AND
> > hope)&spellcheck.count=10&spellcheck.alternativeTermCount=5
> > - suppose at least one document in your dictionary field has "life" in it
> > - also suppose zero documents in your dictionary field have "hope" in
> them
> > - The spellchecker will try to return you up to 10 suggestions for
> "hope",
> > but only up to 5 suggestions for "life"
> >
> > James Dyer
> > Ingram Content Group
> >
> >
> > -Original Message-
> > From: Nitin Solanki [mailto:nitinml...@gmail.com]
> > Sent: Tuesday, February 17, 2015 11:35 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: spellcheck.count v/s spellcheck.alternativeTermCount
> >
> > Hi James,
> > How can you say that "count" doesn't use
> > index/dictionary then from where suggestions come.
> >
> > On Tue, Feb 17, 2015 at 10:29 PM, Dyer, James <
> > james.d...@ingramcontent.com>
> > wrote:
> >
> > > See http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.count
> and
> > > the following section, for details.
> > >
> > > Briefly, "count" is the # of suggestions it will return for terms that
> > are
> > > *not* in your index/dictionary.  "alternativeTermCount" are the # of
> > > alternatives you want returned for terms that *are* in your dictionary.
> > > You can set them to the same value, unless you want fewer suggestions
> > when
> > > the terms is in the dictionary.
> > >
> > > James Dyer
> > > Ingram Content Group
> > >
> > > -Original Message-
> > > From: Nitin Solanki [mailto:nitinml...@gmail.com]
> > > Sent: Tuesday, February 17, 2015 5:27 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: spellcheck.count v/s spellcheck.alternativeTermCount
> > >
> > > Hello Everyone,
> > >   I got confusion between spellcheck.count and
> > > spellcheck.alternativeTermCount in Solr. Any help in details?
> > >
> >
>


is there a constant for _vesion_-fieldname?

2015-02-19 Thread Clemens Wyss DEV
Does Solr provider a (Java)constant for "the name of the version field" (ie 
_version_)?


Re: Discrepancy between Full import and Delta import query

2015-02-19 Thread Aniket Bhoi
On Tue, Feb 17, 2015 at 8:21 PM, Aniket Bhoi  wrote:

> Hi Folks,
>
> I am running Solr 3.4 and using DIH for importing data from a SQL server
> backend.
>
> The query for Full import and Delta import is the same ie both pull the
> same data.
>
> Full and Delta import query:
>
> SELECT KB_ENTRY.ADDITIONAL_INFO ,KB_ENTRY.KNOWLEDGE_REF
> ID,SU_ENTITY_TYPE.REF ENTRY_TYPE_REF,KB_ENTRY.PROFILE_REF,
> KB_ENTRY.ITEM_REF, KB_ENTRY.TITLE, KB_ENTRY.ABSTRACT, KB_ENTRY.SOLUTION,
> KB_ENTRY.SOLUTION_HTML, KB_ENTRY.FREE_TEXT, KB_ENTRY.DATE_UPDATED,
> KB_ENTRY.STATUS_REF, KB_ENTRY.CALL_NUMBER, SU_ENTITY_TYPE.DISPLAY
> ENTRY_TYPE, KB_PROFILE.NAME PROFILE_TYPE, AR_PRIMARY_ASSET.ASSET_REF
> SERVICE_TYPE, AR_PERSON.FULL_NAME CONTRIBUTOR, IN_SYS_SOURCE.NAME SOURCE,
> KB_ENTRY_STATUS.NAME STATUS,(SELECT COUNT (CL_KB_REFER.CALL_NUMBER) FROM
> CL_KB_REFER WHERE CL_KB_REFER.ARTICLE_REF = KB_ENTRY.KNOWLEDGE_REF)
> LINK_RATE FROM KB_ENTRY, SU_ENTITY_TYPE, KB_PROFILE, AR_PRIMARY_ASSET,
> AR_PERSON, IN_SYS_SOURCE, KB_ENTRY_STATUS WHERE KB_ENTRY.PARTITION = 1 AND
> KB_ENTRY.STATUS = 'A' AND AR_PERSON.OFFICER_IND = 1 AND
> KB_ENTRY.CREATED_BY_REF = AR_PERSON.REF AND KB_ENTRY.SOURCE =
> IN_SYS_SOURCE.REF AND KB_ENTRY.STATUS_REF = KB_ENTRY_STATUS.REF AND
> KB_ENTRY_STATUS.STATUS = 'A' AND KB_ENTRY.PROFILE_REF = KB_PROFILE.REF AND
> KB_ENTRY.ITEM_REF = AR_PRIMARY_ASSET.ITEM_REF AND KB_ENTRY.ENTITY_TYPE =
> SU_ENTITY_TYPE.REF AND KB_ENTRY.KNOWLEDGE_REF='${dataimporter.delta.ID}'"
>
>
> Delta query:select KNOWLEDGE_REF as ID from KB_ENTRY where (DATE_UPDATED
> > '${dataimporter.last_index_time}' OR DATE_CREATED >
> '${dataimporter.last_index_time}')">
>
>
> The Problem here is that When I run the full Import ,everything works fine
> and all the feilds .data are displayed fine in the search
>
> However When I run Delta import,for some records the ENTRY_TYPE field is
> not returned from the database,
>
> Let me illustrate it with an example:
>
> Search result After running Full Import:
>
> Record Name:John Doe
> Entry ID:500
> Entry Type:Worker
>
> Search result after running Delta import:
>
> Record Name:John Doe
> Entry ID:500
> Entry Type:
>
>
> FYI:I have run the Full and Delta import queries(Though both are the same)
> on the SQL Server IDE  and both return The Entry Type feild correctly.
>
> Not sure why the entry Type feild vanishes from Solr when Delta import is
> run.
>
> Any idea why this would happen.
>
> Thanks,
>
> Aniket
>
>

Hi folks,

Anyone with any luck or knowledge on this?

Regards

Aniket


Solr Lazy startup - load-on-startup missing from web.xml?

2015-02-19 Thread Gili Nachum
Hi! Solr is starting up "dormant" for me, until a client wake it up with a
REST request, or I open admin UI, only then the remaining initializing
happens.
Is it something known?

I can't see any load-on-startup in the web.xml, in Solr.war.
Running Solr 4.7.2 over WebSphere 8.5

App loading message as the server starts up:
[2/*16*/15 12:17:19:956 GMT] 0056 ApplicationMg A   WSVR0221I:
Application started: solr-4.7.2
[2/*16*/15 12:17:20:319 GMT] 0001 WsServerImpl  A   WSVR0001I:
Server serverSolr open for e-business
The the next start up message in the log is on the next day once I enter
Solr admin UI:
[2/*17*/15 10:20:13:827 GMT] 0098 SolrDispatchF I
org.apache.solr.servlet.SolrDispatchFilter init SolrDispatchFilter.init()
...


Question on CloudSolrServer API

2015-02-19 Thread Manohar Sripada
Hi All,

I am using CloudSolrServer API of SolrJ library from my application to
query Solr. Here, I am creating a new connection to Solr for every search
that I am doing. Once I got the results I am closing the connection.

Is this the correct way? How does Solr create connections internally? Does
it maintain a pool of connections (if so how to configure it)?

Thanks,
Manohar