date:20130717


: reference to a few "special" values, such as "id" and "score". Neither 
: of them are technically "stored" fields either, but afaik you dont need 
: to use "field(id), field(score)" for those.  Can you honestly say that 
: is consistent?

Nope. 

I wasn't defending the quirks of the API, or trying give the impression 
that I thought it was consistent.  My goal was simply to try and explain 
what the example Alan gave you was actually doing, and why/how it worked 
-- so that you weren't left with the string "fl=field(eff_field_name)" as 
some magical, inexplicable, black-box.



-Hoss

Re: SolrCloud group.query error "shard X did not set sort field values" or how i can set fillFields=true on IndexSearcher.search


You've found a general bug in the grouping code, and i've opened SOLR-5046 
to track it (no idea how hard it is to fix) but in general keep in mind 
the major caveat assocaited with grouping and distributed search ...

https://wiki.apache.org/solr/SolrCloud#Known_Limitations

"The Grouping feature only works if groups are in the same shard. You must 
use the custom sharding feature to use the Grouping feature. "



: Date: Mon, 15 Jul 2013 13:19:22 +0400
: From: Evgeny Salnikov 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: SolrCloud group.query error
: "shard X did not set sort field values" or how i can set fillFields=true
: on IndexSearcher.search
: 
: Thank you!
: I really need to eventually increase the number of shards, so I can not
: directly use numshards = X and the only way out - splitshards, but then I
: encountered the following problem:
: 
: 1. run empty node1
: java -Dbootstrap_confdir=./solr/collection1/conf
: -Dcollection.configName=myconf -DzkRun -jar start.jar -DnumShards=1
: 2. run empty node2
: java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
: 3. cluster is - collection1 -> shard1 -> master (node1) and replica (node2)
: 4. add some data (10 docs)
: 5. http://node1:8983/solr/collection1/select?q=*:*
: 
: 
: 0
: 5
: 
: *:*
: 
: 
: 
: ...
: ...
: 
: 
: 6. try group.query
: 
http://node1:8983/solr/collection1/select?q=*:*&group=true&group.query=street:%D0%9A%D0%BE%D1%80%D0%BE%D0%BB%D0%B5%D0%B2%D0%B0
: 
: 
: 0
: 13
: 
: *:*
: street:Королева
: true
: 
: 
: 
: 
: 10
: 
: 
: cdb1c990-d00c-4d2c-95ba-4f496e559be3
: Королева
: 7
: 62
: Сидоров
: Дела отлично!
: 1440614179417358336
: 
: 
: 
: 
: 
: 7. try split shard1
: 
http://node1:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1
: 
:  
:   0
:   9288
:  
:  
:   
:
: 0
: 2441
:
:collection1_shard1_1_replica1
:/home/evgenysalnikov/solrtest/node1/example/solr/solr.xml
:   
:   
:
: 0
: 2479
:
:collection1_shard1_0_replica1
:/home/evgenysalnikov/solrtest/node1/example/solr/solr.xml
:   
:   
:
: 0
: 5002
:
:   
:   
:
: 0
: 5002
:
:   
:   
:
: 0
: 141
:
:   
:   
:
: 0
: 0
:
:collection1_shard1_0_replica1
:EMPTY_BUFFER
:   
:   
:
: 0
: 1
:
:collection1_shard1_1_replica1
:EMPTY_BUFFER
:   
:   
:
: 0
: 2515
:
:collection1_shard1_1_replica2
:/home/evgenysalnikov/solrtest/node2/example/solr/solr.xml
:   
:   
:
: 0
: 2554
:
:collection1_shard1_0_replica2
:/home/evgenysalnikov/solrtest/node2/example/solr/solr.xml
:   
:   
:
: 0
: 4001
:
:   
:   
:
: 0
: 4002
:
:   
:  
: 
: 8. Claster state change to
: shard1 - master (inactive),
: shard1 - slave (inactive)
: shard1_0 - master,
: shard1_0 - slave,
: shard1_1 - master,
: shard1_1 - slave
: 9. Commit http://node1:8983/solr/collection1/update?commit=true
: 10. Reload http://node1:8983/solr/collection1/select?q=*:* gives me
: different results numFound 5,0,10 (i add 10 docs)
: Node2 core info is
: collection1 - shard1 - 10 docs
: collection1_shard1_0_replica2 - 0 docs
: collection1_shard1_1_replica2 - 0 docs
: 11. I restart node2
:Node2 core info is
:collection1 - shard1 - 10 docs
:collection1_shard1_0_replica2 - 5 docs
:collection1_shard1_1_replica2 - 5 docs
: 12. http://node1:8983/solr/collection1/select?q=*:* always gives the
: correct result - 10 documents
: 
: But
: 
http://node1:8983/solr/collection1/select?q=*:*&group=true&group.query=street:%D0%9A%D0%BE%D1%80%D0%BE%D0%BB%D0%B5%D0%B2%D0%B0
: returns the familiar error
: shard 0 did not set sort field values (FieldDoc.fields is null); you must
: pass fillFields=true to IndexSearcher.search on each shard
: 
: I somehow did not operate correctly splitshard?
: 
: 
: Also, I tried once to indicate the number of shard 2
: 1. run empty node1
: java -Dbootstrap_confdir=./solr/collection1/conf
: -Dcollection.configName=myconf -DzkRun -jar start.jar -DnumShards=2
: 2. run empty node2
: java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar
: 3. cluster is - collection1 -> shard1 -> master (node1) and collection1 ->
: shard2 -> master (node2)
: 4. add some data (10 docs)
: 5. http://node1:8983/solr/collection1/select?q=*:*
: 
: 
: 0
: 5
: 
: *:*
: 
: 
: 
: ...
: ...
: 
: 
: 6. try group.query
: 
http://node1:8983/solr/collection1/select?q=*:*&group=

Re: external file field and fl parameter

2013-07-17 Thread Chris Collins

Chris, the confusion from my perspective is the general inconsistency and 
natural growth of the API which is somewhat expected based upon its history.

Obviously this isnt sql, there is no ansi body defining the query language.  I 
understand well the difference between stored, indexed etc. 

Going off of the apache wiki docs (which perhaps is not the correct place to go 
for documentation buts its what google gives me :-})

http://wiki.apache.org/solr/CommonQueryParameters

The fl parameter doesnt actually mention stored.  It actually gives reference 
to a few "special" values, such as "id" and "score". Neither of them are 
technically "stored" fields either, but afaik you dont need to use "field(id), 
field(score)" for those.  Can you honestly say that is consistent?

So

On Jul 17, 2013, at 5:30 PM, Chris Hostetter  wrote:

> 
> : Yes that worked, thanks Alan.  The consistency of this api is "challenging".
> 
> It's important to understand what's happening here.
> 
> fl, by default, only returns "stored" fields -- but you can also request 
> "psuedo-fields" such as the results of functions, or the result of a "Doc 
> Transformer" ...
> 
> http://wiki.apache.org/solr/CommonQueryParameters#fl
> 
> On the other side of things, the ExtenalFileField has some very special 
> behavior that allows it to be used as the input to a function, but it does 
> not act as a true stored or indexed field -- it's completley external 
> to the index...
> 
> https://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/schema/ExternalFileField.html
> 
> The syntax Alan suggested tells the solr response writer to generate a 
> psuedo-field for each doc in the response which contains the results of a 
> function call -- that function just so happens to be a simple field()
> function that returns the numeric value of the specified field name -- 
> which works for EFF since (as mentioned before) EFF can be used as the 
> input to any function.
> 
> 
> -Hoss
>

Re: How can I learn the total count of how many documents indexed and how many documents updated?

On 7/17/2013 8:06 AM, Furkan KAMACI wrote:
> I have crawled some web pages and indexed them at my SolrCloud(Solr 4.2.1).
> However before I index them there was already some indexes. I can calculate
> the difference between current and previous document count. However it
> doesn't mean that I have indexed that count of documents. Because urls of
> websites are unique ids at my system. So it means that some of documents
> updated and they did not increased document count.
> 
> My question is that: How can I learn the total count of how many documents
> indexed and how many documents updated?

Look at the update handler statistics.  Your application should record
the numbers there, then you can check the handler statistics again and
note the differences.  Here's a URL that can give you those statistics.

http://server:port/solr/mycollectionname/admin/mbeans?stats=true

They are also available in the UI on the UPDATEHANDLER section of
Plugins / Stats, but you can't really use that in a program.

By setting the request handler path on a query object to /admin/mbeans
and setting the stats parameter, you can get this information with SolrJ.

Thanks,
Shawn

Re: external file field and fl parameter


: Yes that worked, thanks Alan.  The consistency of this api is "challenging".

It's important to understand what's happening here.

fl, by default, only returns "stored" fields -- but you can also request 
"psuedo-fields" such as the results of functions, or the result of a "Doc 
Transformer" ...

http://wiki.apache.org/solr/CommonQueryParameters#fl

On the other side of things, the ExtenalFileField has some very special 
behavior that allows it to be used as the input to a function, but it does 
not act as a true stored or indexed field -- it's completley external 
to the index...

https://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/schema/ExternalFileField.html

The syntax Alan suggested tells the solr response writer to generate a 
psuedo-field for each doc in the response which contains the results of a 
function call -- that function just so happens to be a simple field()
function that returns the numeric value of the specified field name -- 
which works for EFF since (as mentioned before) EFF can be used as the 
input to any function.


-Hoss

Re: MoinMoin Dump


: There was a thread about viewing Solr Wiki offline, About 6 months ago. I'm
: intersted, too.
: 
: It seems that a manual (cron?) dump will do the work...
: 
: Would it be too much to ask that one of the admins will manually create
: such a dump? (http://moinmo.in/HelpOnMoinCommand/ExportDump)

No one (that i know of) involved with Lucene/Solr has the shell access 
needed to do this.

Even if we did, the general policy of the ASF is that because the content 
of the MoinMoin wiki is created ad-hoc by the community, w/o any audit 
trail or clear grant of license on the content written by the community 
members at large, we can't "distribute" it.

Moving forward, we have a new Solr Refrence Guide that we wll be 
officially releasing in PDF form for each Minor release of Slr 
starting with 4.4...

  https://cwiki.apache.org/confluence/display/solr/
  https://issues.apache.org/jira/browse/SOLR-4618

...and we will be phasing out of using MoinMoin for "refrence" 
documentation, it will just contain the more organic, non release specific 
types of documentation, as well as tips & tricks from community members...

https://cwiki.apache.org/confluence/display/solr/Internal+-+Maintaining+Documentation#Internal-MaintainingDocumentation-WhatShouldandShouldNotbeIncludedinThisDocumentation



-Hoss

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Hello Oleg,


On Wed, Jul 17, 2013 at 3:49 PM, Oleg Burlaca  wrote:

> Hello Roman and all,
>
> > sorry, haven't the previous thread in its entirety, but few weeks back
> that
> > Yonik's proposal got implemented, it seems ;)
>
> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter
>
> In that post I see a reference to your plugin BitSetQParserPlugin, right ?
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java
>
> I understood it as follows:
> 1. query the core and get ALL search results,
>search results == (id1, id2, id7 .. id28263)   // a long arrays of
> Unique IDs
> 2. Generate a bitset from this array of IDs
> 3. search a core using a bitsetfilter
>
> Correct?
>

yes, the BitSetQParserPlugin does the 3rd step

the unittest, may explain it better:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java



>
> I was thinking that pseudo-joins can help exactly with this situation
> (actually didn't even tried yet pseudo-joins, still watching the mail
> list).
> i.e. to make the first step efficient and at the same time perform a second
> query without to send a lot of data to the client and then receiving this
> data back.
>
> I have a feeling that such a situation: a list of Unique IDs from query1
> participates in filter in query2
> happens frequently, and would be very useful if SOLR has an optimized
> approach to handle it.
> mmm, it's transform the pseudo-join in a real JOIN like in SQL world.
>
> I think I'll just test to see the performance of pseudo-joins with large
> datasets (was waiting to find the perfect solution).
>

I'd be very curious,if you do some experiments, please let us know. Thanks,

roman


>
> Thanks for all the ideas/links, now I have a better view of the situation.
>
> Regards.
>
>
>
>
> On Wed, Jul 17, 2013 at 3:34 PM, Erick Erickson  >wrote:
>
> > Roman:
> >
> > I think that SOLR-1913 is completely different. It's
> > about having a field in a document and being able
> > to do bitwise operations on it. So say I have a
> > field in a Solr doc with the value 6 in it. I can then
> > form a query like
> > {!bitwise field=myfield op=AND source=2}
> > and it would match.
> >
> > You're talking about a much different operation as I
> > understand it.
> >
> > In which case, go ahead and open up a JIRA, there's
> > no harm in it.
> >
> > Best
> > Erick
> >
> > On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla 
> > wrote:
> > > Erick,
> > >
> > > I wasn't sure this issue is important, so I wanted first solicit some
> > > feedback. You and Otis expressed interest, and I could create the JIRA
> -
> > > however, as Alexandre, points out, the SOLR-1913 seems similar
> (actually,
> > > closer to the Otis request to have the elasticsearch named filter) but
> > the
> > > SOLR-1913 was created in 2010 and is not integrated yet, so I am
> > wondering
> > > whether this new feature (somewhat overlapping, but still different
> from
> > > SOLR-1913) is something people would really want and the effort on the
> > JIRA
> > > is well spent. What's your view?
> > >
> > > Thanks,
> > >
> > >   roman
> > >
> > >
> > >
> > >
> > > On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch
> > > wrote:
> > >
> > >> Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ?
> > >>
> > >> Regards,
> > >>Alex.
> > >>
> > >> Personal website: http://www.outerthoughts.com/
> > >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > >> - Time is the quality of nature that keeps events from happening all
> at
> > >> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> > book)
> > >>
> > >>
> > >> On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson <
> > erickerick...@gmail.com
> > >> >wrote:
> > >>
> > >> > Roman:
> > >> >
> > >> > Did this ever make into a JIRA? Somehow I missed it if it did, and
> > this
> > >> > would
> > >> > be pretty cool
> > >> >
> > >> > Erick
> > >> >
> > >> > On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla  >
> > >> > wrote:
> > >> > > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca  >
> > >> > wrote:
> > >> > >
> > >> > >> Hello Erick,
> > >> > >>
> > >> > >> > Join performance is most sensitive to the number of values
> > >> > >> > in the field being joined on. So if you have lots and lots of
> > >> > >> > distinct values in the corpus, join performance will be
> affected.
> > >> > >> Yep, we have a list of unique Id's that we get by first searching
> > for
> > >> > >> records
> > >> > >> where loggedInUser IS IN (userIDs)
> > >> > >> This corpus is stored in memory I suppose? (not a problem) and
> then
> > >> the
> > >> > >> bottleneck is to match this huge set with the core where I'm
> > >> searching?
> > >> > >>
> > >> > >> Somewhere in maillist archive people were talking about "external
> > list
> > >> > of
> > >> > >> Solr unique IDs"
> > >> > >> but didn't find if there is a s

Re: Searching w/explicit Multi-Word Synonym Expansion

Hi Dave,



On Wed, Jul 17, 2013 at 2:03 PM, dmarini  wrote:

> Roman,
>
> As a developer, I understand where you are coming from. My issue is that I
> specialize in .NET, haven't done java dev in over 10 years. As an
> organization we're new to solr (coming from endeca) and we're looking to
> use
> it more across the organization, so for us, we are looking to do the
> classic
> time/payoff justification for most features that are causing a bit of
> friction. I have seen custom query parsers that are out there that seem
> like
> they will do what we're looking to do, but I worry that they might fix a
> custom case and not necessarily work for us.
>

been in the same position 2 years back, that's why I have developed the
ANTLR query parser (before that, I went through the phase of hacking
different query parsers, but it was always obvious to me it cannot work for
anything but simple cases)


>
> Also, Roman, are you suggesting that I can have an indexed document titled
> "hubble telescope" and as long as I separate multi-word synonyms with the
> null character \0 in the synonyms.txt file the query expansion will just
> work? if so, that would suffice for our needs.. can you elaborate or will

the query parser still foil the system. I ask because I've seen instances
>

First, bit of explanation of indexing/tokenization operates:

input text: "hubble space telescope is in the space"

let's say we are tokenizing on empty space and we use stopwords; this is
what gets indexed:

hubble
space
telescope
space

these tokens can have different positions, but let's ignore that for a
moment - the first three are adjacent


> where I can use the admin analysis tool against a custom field type to
> expand a multi-word synonym where it appears it's expanding the terms
> properly but when I run a search against it using the actual handler, it
> doesn't behave the same way and the debugQuery shows that indeed it split
> my
> term and did not expand it.
>

this is because the solr analysis tool is seeing the whole input as one
string "hubble space telescope", WHILST the standard query parser first
tokenizes, then builds the query *out of every token* - so it is seeing 3
tokens instead of 1 big token, and builds the following query

field:hubble field:space field:telescope field:space

HOWEVER, when you send the phrase query, it arrives as one token - the
synonym filter will see it, it will recognize it as a multi-token synonym
and it will expand it

BUT, the standard behaviour is to insert the new token into the position of
the first token, so you will get a phrase query

"(hubble | HST) space telescope space"

So really, the problem of the multi-token synonym expansion is in essence a
problem of a query parser - it must know how to harvest tokens, expand
them, and how to build a proper query - int this case, the HST [one token]
spans over 3 original tokens, so the parser must be smart enough to build:

"hubble space telescope space" OR "HST in the space"

So, the synonym expansion part is standard FST, already in the Lucene/SOLR
core. The parser that can handle these cases (and not just them, but also
many others) is also inside Lucene - it is called 'flexible' and has been
contributed by IBM few years back. But so far it has been a sleeping beauty.

I haven't seen LucidWorks parser, but from the description it seems it does
much better job than the standard parser (if, when you do quoted phrase
search for "hubble space telescope in the space" and the result is: "hubble
space telescope space" OR "HST in the space", you can be reasonably sure it
does everything - well, to be 100% sure: "HST in the space" should also
produce the same query; but that's a much longer discussion about
index-time XOR query-time analysis)

roman



>
> Jack,
>
> Is there a link where I can read more about the LucidWorks search parser
> and
> how we can perchance tie into that so I can test to see if it yields better
> results?
>
> Thanks again for the help and suggestions. As an organization, we've
> learned
> much of solr since we started in 4.1 (especially with the cloud). The devs
> are doing phenomenal work and my query is really meant more as confirmation
> that I'm taking the correct approach than to beg for a specific feature :)
>
> --Dave
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078675.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-17 Thread Ali, Saqib

Thanks Erick!

I have added the instructions for running SolrCloud on Jboss:
http://wiki.apache.org/solr/SolrCloud%20using%20Jboss

I will refine the instructions further, and also post some screenshots.

Thanks.


On Sun, Jul 14, 2013 at 5:05 AM, Erick Erickson wrote:

> Done, sorry it took so long, hadn't looked at the list in a couple of days.
>
>
> Erick
>
> On Fri, Jul 12, 2013 at 5:46 PM, Ali, Saqib  wrote:
> > username: saqib
> >
> >
> > On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib 
> wrote:
> >
> >> Hello,
> >>
> >> Can you please add me to the ContributorsGroup? I would like to add
> >> instructions for setting up SolrCloud using Jboss.
> >>
> >> thanks.
> >>
> >>
>

Re: How to optimize a search?

So does the example! Anyway, this is just an attempt to give additional
options.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Jul 17, 2013 at 4:14 PM, padcoe  wrote:

> I'm using Solr 3!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531p4078715.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: How to optimize a search?

2013-07-17 Thread padcoe

I'm using Solr 3!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531p4078715.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to optimize a search?

Not fully following the problem, but is it similar to:
http://robotlibrarian.billdueber.com/boosting-on-exactish-anchored-phrase-matching-in-solr-sst-4/
 ?

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Jul 17, 2013 at 4:03 PM, padcoe  wrote:

> Erick,
>
> Awesome answer, buddy. I totally agree with you.
>
> Right now, i'm facing this problem...just someone waving their hands and
> saying "because I
> like it better"..
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531p4078711.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: How to optimize a search?

2013-07-17 Thread padcoe

Erick,

Awesome answer, buddy. I totally agree with you.

Right now, i'm facing this problem...just someone waving their hands and
saying "because I
like it better"..



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531p4078711.html
Sent from the Solr - User mailing list archive at Nabble.com.

MoinMoin Dump

2013-07-17 Thread Isaac Hebsh

Hi,

There was a thread about viewing Solr Wiki offline, About 6 months ago. I'm
intersted, too.

It seems that a manual (cron?) dump will do the work...

Would it be too much to ask that one of the admins will manually create
such a dump? (http://moinmo.in/HelpOnMoinCommand/ExportDump)

Otis, is there any progress made on this in Apache Infra?

Re: How to optimize a search?

2013-07-17 Thread padcoe

How i use fuzzy? Could you give an example, please?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531p4078708.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ACL implementation: Pseudo-join performance & Atomic Updates

2013-07-17 Thread Oleg Burlaca

Hello Roman and all,

> sorry, haven't the previous thread in its entirety, but few weeks back
that
> Yonik's proposal got implemented, it seems ;)
http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter

In that post I see a reference to your plugin BitSetQParserPlugin, right ?
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java

I understood it as follows:
1. query the core and get ALL search results,
   search results == (id1, id2, id7 .. id28263)   // a long arrays of
Unique IDs
2. Generate a bitset from this array of IDs
3. search a core using a bitsetfilter

Correct?

I was thinking that pseudo-joins can help exactly with this situation
(actually didn't even tried yet pseudo-joins, still watching the mail list).
i.e. to make the first step efficient and at the same time perform a second
query without to send a lot of data to the client and then receiving this
data back.

I have a feeling that such a situation: a list of Unique IDs from query1
participates in filter in query2
happens frequently, and would be very useful if SOLR has an optimized
approach to handle it.
mmm, it's transform the pseudo-join in a real JOIN like in SQL world.

I think I'll just test to see the performance of pseudo-joins with large
datasets (was waiting to find the perfect solution).

Thanks for all the ideas/links, now I have a better view of the situation.

Regards.




On Wed, Jul 17, 2013 at 3:34 PM, Erick Erickson wrote:

> Roman:
>
> I think that SOLR-1913 is completely different. It's
> about having a field in a document and being able
> to do bitwise operations on it. So say I have a
> field in a Solr doc with the value 6 in it. I can then
> form a query like
> {!bitwise field=myfield op=AND source=2}
> and it would match.
>
> You're talking about a much different operation as I
> understand it.
>
> In which case, go ahead and open up a JIRA, there's
> no harm in it.
>
> Best
> Erick
>
> On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla 
> wrote:
> > Erick,
> >
> > I wasn't sure this issue is important, so I wanted first solicit some
> > feedback. You and Otis expressed interest, and I could create the JIRA -
> > however, as Alexandre, points out, the SOLR-1913 seems similar (actually,
> > closer to the Otis request to have the elasticsearch named filter) but
> the
> > SOLR-1913 was created in 2010 and is not integrated yet, so I am
> wondering
> > whether this new feature (somewhat overlapping, but still different from
> > SOLR-1913) is something people would really want and the effort on the
> JIRA
> > is well spent. What's your view?
> >
> > Thanks,
> >
> >   roman
> >
> >
> >
> >
> > On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch
> > wrote:
> >
> >> Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ?
> >>
> >> Regards,
> >>Alex.
> >>
> >> Personal website: http://www.outerthoughts.com/
> >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >> - Time is the quality of nature that keeps events from happening all at
> >> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> >>
> >>
> >> On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson <
> erickerick...@gmail.com
> >> >wrote:
> >>
> >> > Roman:
> >> >
> >> > Did this ever make into a JIRA? Somehow I missed it if it did, and
> this
> >> > would
> >> > be pretty cool
> >> >
> >> > Erick
> >> >
> >> > On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla 
> >> > wrote:
> >> > > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca 
> >> > wrote:
> >> > >
> >> > >> Hello Erick,
> >> > >>
> >> > >> > Join performance is most sensitive to the number of values
> >> > >> > in the field being joined on. So if you have lots and lots of
> >> > >> > distinct values in the corpus, join performance will be affected.
> >> > >> Yep, we have a list of unique Id's that we get by first searching
> for
> >> > >> records
> >> > >> where loggedInUser IS IN (userIDs)
> >> > >> This corpus is stored in memory I suppose? (not a problem) and then
> >> the
> >> > >> bottleneck is to match this huge set with the core where I'm
> >> searching?
> >> > >>
> >> > >> Somewhere in maillist archive people were talking about "external
> list
> >> > of
> >> > >> Solr unique IDs"
> >> > >> but didn't find if there is a solution.
> >> > >> Back in 2010 Yonik posted a comment:
> >> > >> http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd
> >> > >>
> >> > >
> >> > > sorry, haven't the previous thread in its entirety, but few weeks
> back
> >> > that
> >> > > Yonik's proposal got implemented, it seems ;)
> >> > >
> >> > >
> >> >
> >>
> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter
> >> > >
> >> > > You could use this to send very large bitset filter (which can be
> >> > > translated into any integers, if you can come up with a mapping
> >> > function).
> >> > >
> >> > > roman
> >> > >
> >> > >
> >> > >>
> >> > >> > bq: I su

Re: solr with java service wrapper

Which Operating System? I have a write up for Windows:
http://blog.outerthoughts.com/2013/07/setting-up-apache-solr-on-windows-as-a-service/

To search the mailing list, there is several options, try
http://search-lucene.com/ and narrow down by mailing lists/keywords.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

On Wed, Jul 17, 2013 at 2:23 PM, Katie McCorkell
wrote:

> Hello,
>
> I was wondering if people had experience using solr with jetty and a java
> service wrapper for automatic deployment? I thought a service wrapper might
> be included in the solr download, but I didn't see one.
>
> How does one search the mailing list archive? Are there any previous topics
> about this you could lead me to ? (I don't have specific questions yet)
>
> Thanks!!
>

solr with java service wrapper

2013-07-17 Thread Katie McCorkell

Hello,

I was wondering if people had experience using solr with jetty and a java
service wrapper for automatic deployment? I thought a service wrapper might
be included in the solr download, but I didn't see one.

How does one search the mailing list archive? Are there any previous topics
about this you could lead me to ? (I don't have specific questions yet)

Thanks!!

Re: Searching w/explicit Multi-Word Synonym Expansion


LucidWorks Search:

http://docs.lucidworks.com/display/lweug21/Synonyms%2C+Stop+Words%2C+and+Stemming

"There can be an unlimited number of terms and phrases which are defined as 
synonyms. If the Lucid query parser encounters any of those terms or phrases 
in a query term list, additional (optional) clauses will be automatically 
added to the user query so that the query will match either the specified 
term or phrase or any of the synonym terms or phrases."


It just works.

Phrase just means a sequence of terms with no intervening operators, no 
quotes (otherwise it would be a quoted phrase.)


-- Jack Krupansky

-Original Message- 
From: dmarini

Sent: Wednesday, July 17, 2013 2:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

Roman,

As a developer, I understand where you are coming from. My issue is that I
specialize in .NET, haven't done java dev in over 10 years. As an
organization we're new to solr (coming from endeca) and we're looking to use
it more across the organization, so for us, we are looking to do the classic
time/payoff justification for most features that are causing a bit of
friction. I have seen custom query parsers that are out there that seem like
they will do what we're looking to do, but I worry that they might fix a
custom case and not necessarily work for us.

Also, Roman, are you suggesting that I can have an indexed document titled
"hubble telescope" and as long as I separate multi-word synonyms with the
null character \0 in the synonyms.txt file the query expansion will just
work? if so, that would suffice for our needs.. can you elaborate or will
the query parser still foil the system. I ask because I've seen instances
where I can use the admin analysis tool against a custom field type to
expand a multi-word synonym where it appears it's expanding the terms
properly but when I run a search against it using the actual handler, it
doesn't behave the same way and the debugQuery shows that indeed it split my
term and did not expand it.

Jack,

Is there a link where I can read more about the LucidWorks search parser and
how we can perchance tie into that so I can test to see if it yields better
results?

Thanks again for the help and suggestions. As an organization, we've learned
much of solr since we started in 4.1 (especially with the cloud). The devs
are doing phenomenal work and my query is really meant more as confirmation
that I'm taking the correct approach than to beg for a specific feature :)

--Dave



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078675.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread dmarini

Roman,

As a developer, I understand where you are coming from. My issue is that I
specialize in .NET, haven't done java dev in over 10 years. As an
organization we're new to solr (coming from endeca) and we're looking to use
it more across the organization, so for us, we are looking to do the classic
time/payoff justification for most features that are causing a bit of
friction. I have seen custom query parsers that are out there that seem like
they will do what we're looking to do, but I worry that they might fix a
custom case and not necessarily work for us.

Also, Roman, are you suggesting that I can have an indexed document titled
"hubble telescope" and as long as I separate multi-word synonyms with the
null character \0 in the synonyms.txt file the query expansion will just
work? if so, that would suffice for our needs.. can you elaborate or will
the query parser still foil the system. I ask because I've seen instances
where I can use the admin analysis tool against a custom field type to
expand a multi-word synonym where it appears it's expanding the terms
properly but when I run a search against it using the actual handler, it
doesn't behave the same way and the debugQuery shows that indeed it split my
term and did not expand it.

Jack,

Is there a link where I can read more about the LucidWorks search parser and
how we can perchance tie into that so I can test to see if it yields better
results?

Thanks again for the help and suggestions. As an organization, we've learned
much of solr since we started in 4.1 (especially with the cloud). The devs
are doing phenomenal work and my query is really meant more as confirmation
that I'm taking the correct approach than to beg for a specific feature :)

--Dave

--
View this message in context:
http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078675.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to pass null OR empty values to fq?


You've asked this question several times w/o really providing a clear 
concrete set of examples of what you are trying to do ... in several of 
your duplicated threads, people have suggested using hte switch qparser, 
and you've dismissed that suggestion for various reasons that you also 
haven't fully explained.

Here now you've provided an example of what kind of syntax you seem to 
want to support (w/o really going into full details of how you wnat it to 
behavave in various edge cases) but it doesn't really make sense to me for 
a variety of reasons...


1) you've said your "optional" parser shouldn't throw an error if the 
input variable is not specified, but you haven't said what you do want it 
to do instead ... match all docs? match no docs? ... this detail matters a 
lot, especially when you embed it in a nested boolena query like your 
example


2) assuming you want to match all docs when the variable isn't specified, 
then your "optional" example seems to directly match the simplest usecase 
of the switch parser...

  a) add an invariant that handle the parser options that you want 
 to use when the variable ($where) is specified) ...

 
 (
_query_:"{!dismax df='addr' qs=1 v=$where}"^6.2 OR
_query_:"{!dismax df='addr_i' qs=1 v=$where}"^6.2
 )
 

  b) use our variable as the 'default' in a switch parser instance, 
 and specify case='*:*' to match all docs when the variable
 isn't specified...

  {!switch case=*:* default=$where_clause v=$where}


3) unrelated to your question, the example you've provided makes very 
little sense to me because of how you are using the dismax parser with 
only a single field in the "qf", but then combining multiples instances 
(with diff fields and diff boosts but the same query string) using a 
wrapper query.

i suspect that in general what you really want instead of things like 
this...

 (
   _query_:"{!dismax qf=person_name v=$fname}"^3.9 OR
   _query_:"{!dismax qf=name_phonetic_i v=$fname}"^0.9 OR
 )


...is sometihng like this...

  _query_:{!dismax tie=1.0 qf='person_name^3.9 name_phonetic_i^0.9 v=$fname}"



: Date: Mon, 15 Jul 2013 09:24:56 -0700 (PDT)
: From: SolrLover
: Subject: Re: How to pass null OR empty values to fq?
: 
: Jack,
: 
: First, thanks a lot for your response.
: 
: We hardcode certain queries directly in search component as its easy for us
: to make changes to the query from SOLR side compared to changing in
: applications (as many applications - mobile, desktop etc.. use single SOLR
: instance). We don't want to change the code which forms the query every time
: the query changes rather just changing the query in SOLR should do the
: job...Search team controls the boost and other matching criteria hence
: search team changes the boost more often without affecting the
: application...Now whenever a particular value is not passed in the query, we
: are trying to do a pass through so that the entire query doesn't fail (we
: pass through only when the custom plugin is used along with the query - for
: ex: !optional is the custom plugin that shouldn't throw any error if a value
: for any particular variable is not present)...
: 
: 
: 
: (
:   _query_:"{!dismax qf=lname_i v=$lname}"^8.3 OR
:   _query_:"{!dismax qf=lname_phonetic v=$lname}"^8.6
: )
: (
:   _query_:"{!optional df='addr' qs=1 v=$where}"^6.2 OR
:   _query_:"{!optional df='addr_i' qs=1 v=$where}"^6.2 
: )
: (
:   _query_:"{!dismax qf=person_name v=$fname}"^3.9 OR
:   _query_:"{!dismax qf=name_phonetic_i v=$fname}"^0.9 OR
: )
: 
:   
: 
: 
: 
: 
: --
: View this message in context: 
http://lucene.472066.n3.nabble.com/Re-How-to-pass-null-OR-empty-values-to-fq-tp4078085p4078094.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 

-Hoss

Returning Hierarchical / Data Relationships In Solr 3.6 (or Solr 4 via Solr Join)

2013-07-17 Thread Mike L.

 
Solr User Group,
 
 I would like to return a hierarchical data relationship when somebody 
queries for a parent doc in solr. This sort of relationship doesn't currently 
exist in our core as the use-case has been to search for a specific document 
only. However, here's kind of an example of what's being asked: (not the same 
kind of relationship though, but a smiliar concept. However, there will always 
be only 1 parent to many children)
 
User search's for a parent value and also gets child docs as part of the 
response (not child names as multi-valued fields)
 
For example, say: select?qt=parentvalueearch&q=[parentValue]
 
 
    
  1 
  [parentValue]  
  parent 
  John 
  Doe 
  M 
  
    
  2 
  child 
  Chris 
  Doe 
  M 
  
    
  3 
  child 
  Stacy 
  Doe 
  F 
  

 
At first I was thinking I could just add a field within each child doc to 
represent the parentValue, however, this family relationship is a bit more 
complex as children can be associated to many different parents (parent docs) 
so I don't want to tie the relationship off the child. On the flip side, it 
seems, I could have a multi-valued field with all the childnames within the 
parent doc and then requery the core for the child docs and append them to the 
response...the caveat there is this parent may have a few hundred children and 
not sure if a multi-valued field would make sense to store the children 
references...also this approach would dramatically increase the response time 
from on average 20ms to ~4sec assumming a parent has 200 children. 
 
Anybody solve for a similiar issue or have thoughts on best way to tackle this 
with version 3.6? Also could Solr Joins introduced in 4.X address this issue? 
(Not too familiar with it but seems to be related)
 
Thanks in advance!
Mike

Re: Ability to specify the server where shard splits go?

2013-07-17 Thread Timothy Potter

Ok, thanks for the answer Yonik. After looking closer at the index
splitting code, definitely seems like you wouldn't want to pay the
network I/O cost when creating the sub-shard indexes. Might be cool to
be able to specify a different local disk path for the new cores so
that we can get some extra disks working in parallel during the split
(icing on the cake of course).

Cheers,
Tim

On Wed, Jul 17, 2013 at 10:40 AM, Yonik Seeley  wrote:
> On Wed, Jul 17, 2013 at 12:26 PM, Timothy Potter  wrote:
>> This is not a problem per se, just want to verify that we're not able
>> to specify which server shard splits are created as of 4.3.1? From
>> what I've seen, the new cores for the sub-shards are created on the
>> leader of the shard being split.
>>
>> Of course it's easy enough to migrate the new sub-shards to another
>> node after the fact especially since replication occurs automatically
>> for the splits.
>>
>> Seems like if the shard being split is large enough that doing the
>> split on the same node could cause some resource issues so might be
>> better to do the split on another server. Or is my assumption that the
>> split operation is pretty expensive incorrect?
>
> I think it will be mostly IO - it may or may not be expensive
> depending on how IO bound your box already is.
>
> Splitting directly to a different servers would be cool, but would
> seem to require some sort of Directory implementation that streams
> things over the network rather than just locally store on disk.  It's
> something I think we want in the future, but was a bit too much to
> bite off for the first iteration of this feature.
>
>> Lastly, also seems like we don't have control over where the replicas
>> of the split shards go?
>
> Seems like a good idea to optionally allow this...
>
> -Yonik
> http://lucidworks.com

Re: Ability to specify the server where shard splits go?

2013-07-17 Thread Yonik Seeley

On Wed, Jul 17, 2013 at 12:26 PM, Timothy Potter  wrote:
> This is not a problem per se, just want to verify that we're not able
> to specify which server shard splits are created as of 4.3.1? From
> what I've seen, the new cores for the sub-shards are created on the
> leader of the shard being split.
>
> Of course it's easy enough to migrate the new sub-shards to another
> node after the fact especially since replication occurs automatically
> for the splits.
>
> Seems like if the shard being split is large enough that doing the
> split on the same node could cause some resource issues so might be
> better to do the split on another server. Or is my assumption that the
> split operation is pretty expensive incorrect?

I think it will be mostly IO - it may or may not be expensive
depending on how IO bound your box already is.

Splitting directly to a different servers would be cool, but would
seem to require some sort of Directory implementation that streams
things over the network rather than just locally store on disk.  It's
something I think we want in the future, but was a bit too much to
bite off for the first iteration of this feature.

> Lastly, also seems like we don't have control over where the replicas
> of the split shards go?

Seems like a good idea to optionally allow this...

-Yonik
http://lucidworks.com

Re: Searching w/explicit Multi-Word Synonym Expansion

By all means, feel free to write about how users can in fact do custom code 
for Solr, but just keep a clear distinction between what could be developed 
and what is actually available off the shelf.


Yes, this list does have a mix of pure users and those who are willing to 
customize code as well. I didn't mean to discourage or denigrate the later, 
just to highlight that doing custom code is not the same as solutions being 
available off the shelf.


-- Jack Krupansky

-Original Message- 
From: Roman Chyla

Sent: Wednesday, July 17, 2013 12:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

As I don't see in the heads of the users, I can make different assumptions
- but OK, seems reasonable that only minority of users here are actually
willing to do more (btw, I've received coding advice in the past here in
this list). I am working under the assumption that Lucene/SOLR devs are
swamped (there are always more requests and many unclosed JIRA issues), so
where else do they get helping hand than from users of this list? Users
like me, for example.

roman


On Wed, Jul 17, 2013 at 11:59 AM, Jack Krupansky 
wrote:



Remember, this is the "users" list, not the "dev" list. Users want to know
what they can do and use off the shelf today, not what "could" be
developed. Hopefully, the situation will be brighter in six months or a
year, but today... is today, not tomorrow.

(And, in fact, users can use LucidWorks Search for query-time phrase
synonyms, off-the-shelf, today, no patches required.)


-- Jack Krupansky

-Original Message- From: Roman Chyla
Sent: Wednesday, July 17, 2013 11:44 AM

To: solr-user@lucene.apache.org
Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

OK, let's do a simple test instead of making claims - take your solr
instance, anything bigger or equal to version 4.0

In your schema.xml, pick a field and add the synonym filter



in your synonyms.txt, add these entries:

hubble\0space\0telescope, HST

ATTENTION: the \0 is a null byte, you must be written as null byte! You 
can

do it with: python -c "print \"hubble\0space\0telescope,**HST\"" >
synonyms.txt

send a phrase query q=field:"hubble space telescope"&debugQuery=true

if you have done it right, you will see 'HST' is in the list - this means,
solr is able to recognize the multi-token synonym! As far as recognition 
is

concerned, there is no need for more work on FST.

I have written a big unittest that proves the point (9 months ago,
LUCENE-4499) making no changes in the way how FST works. What is missing 
is

the query parser that can take advantage - another JIRA issue.

I'll repeat my claim now: the solution(s) are there, they solve the 
problem

completely - they are not inside one JIRA issue, but they are there. They
need to be proven wrong, NOT proclaimed incomplete.


roman


On Wed, Jul 17, 2013 at 10:22 AM, Jack Krupansky 
**wrote:

 To the best of my knowledge, there is no patch or collection of patches

which constitutes a "working solution" - just partial solutions.

Yes, it is true, there is some FST work underway (active??) that shows
promise depending on query parser implementation, but again, this is all 
a

longer-term future, not a "here and now". Maybe in the 5.0 timeframe?

I don't want anyone to get the impression that there are off-the-shelf
patches that completely solve the synonym phrase problem. Yes, progress 
is

being made, but we're not there yet.

-- Jack Krupansky

-Original Message- From: Roman Chyla
Sent: Wednesday, July 17, 2013 9:58 AM
To: solr-user@lucene.apache.org

Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

Hi all,

What I find very 'sad' is that Lucene/SOLR contain all the necessary
components for handling multi-token synonyms; the Finite State Automaton
works perfectly for matching these items; the biggest problem is IMO the
old query parser which split things on spaces and doesn't know to be
smarter.

THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
none was committed...sigh, we are re-inventing wheel all the time...)

LUCENE-1622
LUCENE-4381
LUCENE-4499


The problem of synonym expansion is more difficult becuase of the parsing
-
the default parsers are not flexible and they split on empty space -
recently I have proposed a solution which makes also the multi-token
synonym expansion simple

this is the ticket:
https://issues.apache.org/jira/browse/LUCENE-5014
https://issues.apache.org/jira/browse/LUCENE-5014>
>


that query parser is able to split on spaces, then look back, do the
second
pass to see whether to expand with synonyms - and even discover different
parse paths and construct different queries based on that. if you want to
see some complex examples, look at:
https://github.com/romanchyla/montysolr/blob/master/**contrib/**
adsabs

Ability to specify the server where shard splits go?

2013-07-17 Thread Timothy Potter

This is not a problem per se, just want to verify that we're not able
to specify which server shard splits are created as of 4.3.1? From
what I've seen, the new cores for the sub-shards are created on the
leader of the shard being split.

Of course it's easy enough to migrate the new sub-shards to another
node after the fact especially since replication occurs automatically
for the splits.

Seems like if the shard being split is large enough that doing the
split on the same node could cause some resource issues so might be
better to do the split on another server. Or is my assumption that the
split operation is pretty expensive incorrect?

Lastly, also seems like we don't have control over where the replicas
of the split shards go?

Cheers,
Tim

Re: Solr index lot of pdf, doc, txt

You don't seem to be too creative with your doc_id values, so perhaps you
can use Solr 4's post.jar recursive option:
http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29

Otherwise, you need to correlate the ID and the source file somehow, so you
probably need a file with ID and location fields and then use
DataImportHandler with nested entities to do so.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Jul 17, 2013 at 12:15 PM, sodoo  wrote:

> Hi guys.
>
> I need a lot of pdf, doc, txt files.
> Now I index manually below command.
>
> # PDF INDEX
> curl
> "
> http://localhost:8983/solr/update/extract?stream.file=/opt/solr/documents/test.pdf&literal.doc_id=pdf_1&commit=true
> "
>
> # TXT INDEX
> curl
> "
> http://localhost:8983/solr/update/extract?stream.file=/opt/solr/documents/test1.txt&literal.doc_id=txt_1&commit=true
> "
>
> # WORD DOC INDEX
> curl
> "
> http://localhost:8983/solr/update/extract?stream.file=/opt/solr/documents/test2.docx&literal.doc_id=doc_1&commit=true
> "
>
> But this is bad solution. Because I have almost 100 pdf, 200 docx and 50
> txt. Then add to day by day all of documents.
>
> I need a good solution.
>
> Please assist me on this and advice me.
>
> Thanks.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-index-lot-of-pdf-doc-txt-tp4078651.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Solr index lot of pdf, doc, txt

2013-07-17 Thread sodoo

Hi guys. 

I need a lot of pdf, doc, txt files. 
Now I index manually below command. 

# PDF INDEX
curl
"http://localhost:8983/solr/update/extract?stream.file=/opt/solr/documents/test.pdf&literal.doc_id=pdf_1&commit=true";

# TXT INDEX
curl
"http://localhost:8983/solr/update/extract?stream.file=/opt/solr/documents/test1.txt&literal.doc_id=txt_1&commit=true";

# WORD DOC INDEX
curl
"http://localhost:8983/solr/update/extract?stream.file=/opt/solr/documents/test2.docx&literal.doc_id=doc_1&commit=true";

But this is bad solution. Because I have almost 100 pdf, 200 docx and 50
txt. Then add to day by day all of documents.

I need a good solution.

Please assist me on this and advice me.

Thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-index-lot-of-pdf-doc-txt-tp4078651.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: and performance

On 7/17/2013 9:35 AM, Ayman Plaha wrote:
> In my solrconfig.xml I've got these caching config by default which I 
don't

> think I will need. Since my index is updated with new documents every 3
> minutes caching anything would be pointless. Am I on the right ?
>
> 

That depends on how many queries you expect to have in those three 
minutes.  The Solr caches help performance because they cause Solr to 
entirely skip the process of gathering the data for a query that has 
been done before.

I would say that you should never turn Solr's caches completely off, and 
small autowarmCount values are probably a good idea unless you are 
indexing extremely often.  Every three minutes isn't often.  I index 
once a minute, and I don't consider that to be often.

Further info: you can't directly warm the documentCache, so putting a 
value in that autowarmCount doesn't do anything.  The example config 
should have a comment that says this.

> If yes, saying I have an index of 80GB and since I won't be using cache I
> won't have to worry about having too much RAM ? May be just enough 
ram for

> processing ? say 8GB ram and 240GB SSD ?

Solr caches and the OS disk cache are very different things.  The OS 
disk cache is something that the operating system does at all times for 
most programs, if there is free memory.  Solr performance will be 
TERRIBLE if you don't have enough RAM for the OS to cache the index. 
The reason for this is simple - reading off the disk is exponentially 
slower than reading from RAM.  SSD is very fast, but still not as fast 
as RAM.

If you have an index of 80GB, then your OS disk cache should be at LEAST 
40GB, and 80GB is better, so a total memory size of 64-128GB would be 
about right for an 80GB index on spinning disks, assuming Solr is the 
only thing on the machine.

If you have your index on SSD, then I would say you should have between 
20 and 40GB for your OS disk cache, which means that 24-48GB of total 
RAM would be the right size for an 80GB index on SSD.

No matter what kind of disks you have, more RAM is better.

Thanks,
Shawn

Re: Searching w/explicit Multi-Word Synonym Expansion

As I don't see in the heads of the users, I can make different assumptions
- but OK, seems reasonable that only minority of users here are actually
willing to do more (btw, I've received coding advice in the past here in
this list). I am working under the assumption that Lucene/SOLR devs are
swamped (there are always more requests and many unclosed JIRA issues), so
where else do they get helping hand than from users of this list? Users
like me, for example.

roman


On Wed, Jul 17, 2013 at 11:59 AM, Jack Krupansky wrote:

> Remember, this is the "users" list, not the "dev" list. Users want to know
> what they can do and use off the shelf today, not what "could" be
> developed. Hopefully, the situation will be brighter in six months or a
> year, but today... is today, not tomorrow.
>
> (And, in fact, users can use LucidWorks Search for query-time phrase
> synonyms, off-the-shelf, today, no patches required.)
>
>
> -- Jack Krupansky
>
> -Original Message- From: Roman Chyla
> Sent: Wednesday, July 17, 2013 11:44 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Searching w/explicit Multi-Word Synonym Expansion
>
> OK, let's do a simple test instead of making claims - take your solr
> instance, anything bigger or equal to version 4.0
>
> In your schema.xml, pick a field and add the synonym filter
>
> ignoreCase="true" expand="true"
> tokenizerFactory="solr.**KeywordTokenizerFactory" />
>
> in your synonyms.txt, add these entries:
>
> hubble\0space\0telescope, HST
>
> ATTENTION: the \0 is a null byte, you must be written as null byte! You can
> do it with: python -c "print \"hubble\0space\0telescope,**HST\"" >
> synonyms.txt
>
> send a phrase query q=field:"hubble space telescope"&debugQuery=true
>
> if you have done it right, you will see 'HST' is in the list - this means,
> solr is able to recognize the multi-token synonym! As far as recognition is
> concerned, there is no need for more work on FST.
>
> I have written a big unittest that proves the point (9 months ago,
> LUCENE-4499) making no changes in the way how FST works. What is missing is
> the query parser that can take advantage - another JIRA issue.
>
> I'll repeat my claim now: the solution(s) are there, they solve the problem
> completely - they are not inside one JIRA issue, but they are there. They
> need to be proven wrong, NOT proclaimed incomplete.
>
>
> roman
>
>
> On Wed, Jul 17, 2013 at 10:22 AM, Jack Krupansky 
> **wrote:
>
>  To the best of my knowledge, there is no patch or collection of patches
>> which constitutes a "working solution" - just partial solutions.
>>
>> Yes, it is true, there is some FST work underway (active??) that shows
>> promise depending on query parser implementation, but again, this is all a
>> longer-term future, not a "here and now". Maybe in the 5.0 timeframe?
>>
>> I don't want anyone to get the impression that there are off-the-shelf
>> patches that completely solve the synonym phrase problem. Yes, progress is
>> being made, but we're not there yet.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Roman Chyla
>> Sent: Wednesday, July 17, 2013 9:58 AM
>> To: solr-user@lucene.apache.org
>>
>> Subject: Re: Searching w/explicit Multi-Word Synonym Expansion
>>
>> Hi all,
>>
>> What I find very 'sad' is that Lucene/SOLR contain all the necessary
>> components for handling multi-token synonyms; the Finite State Automaton
>> works perfectly for matching these items; the biggest problem is IMO the
>> old query parser which split things on spaces and doesn't know to be
>> smarter.
>>
>> THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
>> none was committed...sigh, we are re-inventing wheel all the time...)
>>
>> LUCENE-1622
>> LUCENE-4381
>> LUCENE-4499
>>
>>
>> The problem of synonym expansion is more difficult becuase of the parsing
>> -
>> the default parsers are not flexible and they split on empty space -
>> recently I have proposed a solution which makes also the multi-token
>> synonym expansion simple
>>
>> this is the ticket:
>> https://issues.apache.org/jira/browse/LUCENE-5014
>> https://issues.apache.org/jira/browse/LUCENE-5014>
>> >
>>
>>
>> that query parser is able to split on spaces, then look back, do the
>> second
>> pass to see whether to expand with synonyms - and even discover different
>> parse paths and construct different queries based on that. if you want to
>> see some complex examples, look at:
>> https://github.com/romanchyla/montysolr/blob/master/**contrib/**
>> adsabs/src/test/org/apache/solr/analysis/**
>> TestAdsabsTypeFulltextParsing.java> romanchyla/montysolr/blob/**master/contrib/adsabs/src/**
>> test/org/apache/solr/analysis/**TestAdsabsTypeFulltextParsing.**java

Re: Searching w/explicit Multi-Word Synonym Expansion

Remember, this is the "users" list, not the "dev" list. Users want to know 
what they can do and use off the shelf today, not what "could" be developed. 
Hopefully, the situation will be brighter in six months or a year, but 
today... is today, not tomorrow.


(And, in fact, users can use LucidWorks Search for query-time phrase 
synonyms, off-the-shelf, today, no patches required.)


-- Jack Krupansky

-Original Message- 
From: Roman Chyla

Sent: Wednesday, July 17, 2013 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

OK, let's do a simple test instead of making claims - take your solr
instance, anything bigger or equal to version 4.0

In your schema.xml, pick a field and add the synonym filter



in your synonyms.txt, add these entries:

hubble\0space\0telescope, HST

ATTENTION: the \0 is a null byte, you must be written as null byte! You can
do it with: python -c "print \"hubble\0space\0telescope,HST\"" >
synonyms.txt

send a phrase query q=field:"hubble space telescope"&debugQuery=true

if you have done it right, you will see 'HST' is in the list - this means,
solr is able to recognize the multi-token synonym! As far as recognition is
concerned, there is no need for more work on FST.

I have written a big unittest that proves the point (9 months ago,
LUCENE-4499) making no changes in the way how FST works. What is missing is
the query parser that can take advantage - another JIRA issue.

I'll repeat my claim now: the solution(s) are there, they solve the problem
completely - they are not inside one JIRA issue, but they are there. They
need to be proven wrong, NOT proclaimed incomplete.


roman


On Wed, Jul 17, 2013 at 10:22 AM, Jack Krupansky 
wrote:



To the best of my knowledge, there is no patch or collection of patches
which constitutes a "working solution" - just partial solutions.

Yes, it is true, there is some FST work underway (active??) that shows
promise depending on query parser implementation, but again, this is all a
longer-term future, not a "here and now". Maybe in the 5.0 timeframe?

I don't want anyone to get the impression that there are off-the-shelf
patches that completely solve the synonym phrase problem. Yes, progress is
being made, but we're not there yet.

-- Jack Krupansky

-Original Message- From: Roman Chyla
Sent: Wednesday, July 17, 2013 9:58 AM
To: solr-user@lucene.apache.org

Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

Hi all,

What I find very 'sad' is that Lucene/SOLR contain all the necessary
components for handling multi-token synonyms; the Finite State Automaton
works perfectly for matching these items; the biggest problem is IMO the
old query parser which split things on spaces and doesn't know to be
smarter.

THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
none was committed...sigh, we are re-inventing wheel all the time...)

LUCENE-1622
LUCENE-4381
LUCENE-4499


The problem of synonym expansion is more difficult becuase of the 
parsing -

the default parsers are not flexible and they split on empty space -
recently I have proposed a solution which makes also the multi-token
synonym expansion simple

this is the ticket:
https://issues.apache.org/**jira/browse/LUCENE-5014

that query parser is able to split on spaces, then look back, do the 
second

pass to see whether to expand with synonyms - and even discover different
parse paths and construct different queries based on that. if you want to
see some complex examples, look at:
https://github.com/romanchyla/**montysolr/blob/master/contrib/**
adsabs/src/test/org/apache/**solr/analysis/**
TestAdsabsTypeFulltextParsing.**java
-
eg. line 373, 483


Lucene/SOLR developers are already doing great work and have much to do -
they need help from everybody who is able to apply patch, test it and
report back to JIRA.

roman



On Wed, Jul 17, 2013 at 9:37 AM, dmarini 
wrote:

 iorixxx,


Thanks for pointing me in the direction of the QueryElevation component.
If
it did not require that the target documents be keyed by the unique key
field it would be ideal, but since our Sku field is not the Unique field
(we
have an internal id which serves as the key while this is the client's
key)
it doesn't seem like it will match unless I make a larger scope change.

Jack,

I agree that out of the box there hasn't been a generalized solution for
this yet. I guess what I'm looking for is confirmation that I've gone as
far
as I can properly and from this point need to consider using something
like
the HON custom query parser component (which we're leery of using because
from my reading it solves a specific scenario that may overcompensate 
what

we're attempting to fix). I would personally rather stay IN solr than add
custom .jar files fr

Re: Searching w/explicit Multi-Word Synonym Expansion

OK, let's do a simple test instead of making claims - take your solr
instance, anything bigger or equal to version 4.0

In your schema.xml, pick a field and add the synonym filter



in your synonyms.txt, add these entries:

hubble\0space\0telescope, HST

ATTENTION: the \0 is a null byte, you must be written as null byte! You can
do it with: python -c "print \"hubble\0space\0telescope,HST\"" >
synonyms.txt

send a phrase query q=field:"hubble space telescope"&debugQuery=true

if you have done it right, you will see 'HST' is in the list - this means,
solr is able to recognize the multi-token synonym! As far as recognition is
concerned, there is no need for more work on FST.

I have written a big unittest that proves the point (9 months ago,
LUCENE-4499) making no changes in the way how FST works. What is missing is
the query parser that can take advantage - another JIRA issue.

I'll repeat my claim now: the solution(s) are there, they solve the problem
completely - they are not inside one JIRA issue, but they are there. They
need to be proven wrong, NOT proclaimed incomplete.


roman


On Wed, Jul 17, 2013 at 10:22 AM, Jack Krupansky wrote:

> To the best of my knowledge, there is no patch or collection of patches
> which constitutes a "working solution" - just partial solutions.
>
> Yes, it is true, there is some FST work underway (active??) that shows
> promise depending on query parser implementation, but again, this is all a
> longer-term future, not a "here and now". Maybe in the 5.0 timeframe?
>
> I don't want anyone to get the impression that there are off-the-shelf
> patches that completely solve the synonym phrase problem. Yes, progress is
> being made, but we're not there yet.
>
> -- Jack Krupansky
>
> -Original Message- From: Roman Chyla
> Sent: Wednesday, July 17, 2013 9:58 AM
> To: solr-user@lucene.apache.org
>
> Subject: Re: Searching w/explicit Multi-Word Synonym Expansion
>
> Hi all,
>
> What I find very 'sad' is that Lucene/SOLR contain all the necessary
> components for handling multi-token synonyms; the Finite State Automaton
> works perfectly for matching these items; the biggest problem is IMO the
> old query parser which split things on spaces and doesn't know to be
> smarter.
>
> THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
> none was committed...sigh, we are re-inventing wheel all the time...)
>
> LUCENE-1622
> LUCENE-4381
> LUCENE-4499
>
>
> The problem of synonym expansion is more difficult becuase of the parsing -
> the default parsers are not flexible and they split on empty space -
> recently I have proposed a solution which makes also the multi-token
> synonym expansion simple
>
> this is the ticket:
> https://issues.apache.org/**jira/browse/LUCENE-5014
>
> that query parser is able to split on spaces, then look back, do the second
> pass to see whether to expand with synonyms - and even discover different
> parse paths and construct different queries based on that. if you want to
> see some complex examples, look at:
> https://github.com/romanchyla/**montysolr/blob/master/contrib/**
> adsabs/src/test/org/apache/**solr/analysis/**
> TestAdsabsTypeFulltextParsing.**java
> -
> eg. line 373, 483
>
>
> Lucene/SOLR developers are already doing great work and have much to do -
> they need help from everybody who is able to apply patch, test it and
> report back to JIRA.
>
> roman
>
>
>
> On Wed, Jul 17, 2013 at 9:37 AM, dmarini 
> wrote:
>
>  iorixxx,
>>
>> Thanks for pointing me in the direction of the QueryElevation component.
>> If
>> it did not require that the target documents be keyed by the unique key
>> field it would be ideal, but since our Sku field is not the Unique field
>> (we
>> have an internal id which serves as the key while this is the client's
>> key)
>> it doesn't seem like it will match unless I make a larger scope change.
>>
>> Jack,
>>
>> I agree that out of the box there hasn't been a generalized solution for
>> this yet. I guess what I'm looking for is confirmation that I've gone as
>> far
>> as I can properly and from this point need to consider using something
>> like
>> the HON custom query parser component (which we're leery of using because
>> from my reading it solves a specific scenario that may overcompensate what
>> we're attempting to fix). I would personally rather stay IN solr than add
>> custom .jar files from around the web if at all possible.
>>
>> Thanks for the replies.
>>
>> --Dave
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.**nabble.com/Searching-w-**
>> explicit-Multi-Word-Synonym-**Expansion-tp4078469p4078610.**html
>> Sent from the Solr - User mailing list archive at Nabble.

Re: shards param fails with cores

2013-07-17 Thread Dmitry Kan

All clear. There seems to be a mis-config on my side as the vanilla solr
4.3.1 package works just fine with the described setup.


On Wed, Jul 17, 2013 at 4:21 PM, Dmitry Kan  wrote:

> Hi list,
>
> I have set up two cores (=collections):
>
> http://localhost:8983/solr/core0
> http://localhost:8983/solr/core1
>
> In addition the following has been set up:
> http://localhost:8984/solr/core0
> http://localhost:8984/solr/core1
>
> I'm trying to query the first via the second like this:
>
>
> http://localhost:8984/solr/core1/select?q=test&shards=localhost:8983/solr/core0
>
> But an error comes as a response:
>
> "Server at http://localhost:8983/solr returned non ok status:404,
> message:Not Found"
>
> What am I doing wrong?
>
> Thanks,
> Dmitry
>

Re: How can I learn the total count of how many documents indexed and how many documents updated?

Maybe as a first step, it would be nice to have logging that summarized the 
count of actual inserts, replacements, actual deletions, and even 
atomic/partial updates.


The LogUpdateProcessor outputs some information, like a subset of the 
document IDs, but not the insert vs. replace/update counts.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Wednesday, July 17, 2013 10:55 AM
To: solr-user@lucene.apache.org
Subject: Re: How can I learn the total count of how many documents indexed 
and how many documents updated?


I will open a Jira for it and apply a patch, thanks.

2013/7/17 Jack Krupansky 


I don't think that breakdown is readily available from Solr.

Sounds like a good Jira request for improvement in the response.

-- Jack Krupansky

-Original Message- From: Furkan KAMACI
Sent: Wednesday, July 17, 2013 10:06 AM
To: solr-user@lucene.apache.org
Subject: How can I learn the total count of how many documents indexed and
how many documents updated?


I have crawled some web pages and indexed them at my SolrCloud(Solr 
4.2.1).
However before I index them there was already some indexes. I can 
calculate

the difference between current and previous document count. However it
doesn't mean that I have indexed that count of documents. Because urls of
websites are unique ids at my system. So it means that some of documents
updated and they did not increased document count.

My question is that: How can I learn the total count of how many documents
indexed and how many documents updated?

Re: and performance

2013-07-17 Thread Ayman Plaha

Wow! Thanks Shawn. That's great info and helped and thanks for the
SolrPerformance article link, great article, helped a lot :)

I can't use Cloud hosting now since they charge on basis of the memory used
and it will be too expensive and like you said RAM and SSD is what I need
for SOLR performance.

In my solrconfig.xml I've got these caching config by default which I don't
think I will need. Since my index is updated with new documents every 3
minutes caching anything would be pointless. Am I on the right ?



If yes, saying I have an index of 80GB and since I won't be using cache I
won't have to worry about having too much RAM ? May be just enough ram for
processing ? say 8GB ram and 240GB SSD ?


On Thu, Jul 18, 2013 at 12:00 AM, Shawn Heisey  wrote:

> On 7/17/2013 1:22 AM, Ayman Plaha wrote:
> >*will this effect the query performance of the client website if the
> >index grew to 10 million records ? I mean while the commit is
> happening
> >does that *effect the performance of queries* and how will this effect
> >the queries if the index grew to 10 million records ?
>
> Every time you commit and open a new searcher, any data in the caches
> that Solr itself creates is wiped.  If you have configured autowarming,
> then it will use keys from the old cache to repopulate the new cache, by
> using those keys as queries on the index.  If autowarmCount is high,
> those warming queries can take a long time and put quite a load on the
> index.  While the warming is happening, the old searcher continues to
> process queries.
>
> >- What *hosting specs* should I get ? How much RAM ? Considering my
> >- client application is very simple that just register users to
> database
> >and queries SOLR and displays SOLR results.
>
> This is almost impossible to answer.  Even if you can give us more
> statistics about your setup, the only way to REALLY know is to
> experiment.  I can give you some basic guidelines:
>
> 1) Get as much processing power as you can reasonably afford, but
> understand that I/O and RAM are likely to play a bigger role in Solr
> performance than bleeding-edge CPU power.
>
> 2) Multi-disk RAID10 or SSD performs best for an I/O layer.
>
> 3) For RAM, if Solr is the only thing running on the machine, the ideal
> amount is the size of your index on disk, plus the Solr JVM size, plus a
> little bit (1GB or less) for the OS.  This lets the OS cache the entire
> index in RAM.  Because the OS disk cache is very smart, you may be able
> to run effectively with less RAM, especially if you use SSD.  If the
> available OS disk cache is too small, performance will really suffer.
>
> If Solr is not the only thing running on the machine, then you need to
> add the RAM requirements of the other processes.  Those RAM requirements
> may extend beyond the memory required for the processes themselves,
> because other programs usually benefit from OS disk caching as well.
>
> Running only Solr on the server is recommended.  If you are running in
> SolrCloud mode, it's normal to also run one of the required zookeeper
> instances on the same hardware, because zookeeper requirements are very
> small.
>
> Some basic information about RAM sizing can be found on this wiki page:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>

Re: How can I learn the total count of how many documents indexed and how many documents updated?

I will open a Jira for it and apply a patch, thanks.

2013/7/17 Jack Krupansky 

> I don't think that breakdown is readily available from Solr.
>
> Sounds like a good Jira request for improvement in the response.
>
> -- Jack Krupansky
>
> -Original Message- From: Furkan KAMACI
> Sent: Wednesday, July 17, 2013 10:06 AM
> To: solr-user@lucene.apache.org
> Subject: How can I learn the total count of how many documents indexed and
> how many documents updated?
>
>
> I have crawled some web pages and indexed them at my SolrCloud(Solr 4.2.1).
> However before I index them there was already some indexes. I can calculate
> the difference between current and previous document count. However it
> doesn't mean that I have indexed that count of documents. Because urls of
> websites are unique ids at my system. So it means that some of documents
> updated and they did not increased document count.
>
> My question is that: How can I learn the total count of how many documents
> indexed and how many documents updated?
>

RE: Config changes in solr.DirectSolrSpellCheck after index is built?

2013-07-17 Thread Dyer, James

DirectSorlSpellChecker does not create a dictionary.  It uses the field you 
specify and uses the Lucene term dictionary.  It uses the some of the same code 
Fuzzy Search uses to calculate distance between user input and indexed terms.

If you're wondering about the affect of configuration changes you make, let us 
see the "before" and "after" configuration, and we can probably give more 
specifics.

James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, July 17, 2013 7:20 AM
To: solr-user@lucene.apache.org
Subject: Re: Config changes in solr.DirectSolrSpellCheck after index is built?

I don't know the code well, but anything that
mentions "index based spellcheck" would
presumably require re-indexing.

But I'd also guess it depends on the changes.
Any changes to _how_ the index is _used_
shouldn't require re-indexing. But changing
how the tokens are put _into_ the index should.

But like I said, I'm speculating a bit.

Best
Erick

On Tue, Jul 16, 2013 at 10:48 AM, Brendan Grainger
 wrote:
> Hi All,
>
> Can you change the configuration of a spellchecker
> using solr.DirectSolrSpellCheck after you've built an index? I know that
> this spellchecker doesn't build and index off to the side like
> the IndexBasedSpellChecker so I'm wondering what's happening internally to
> create a spellchecking dictionary.
>
> Thanks
> Brendan
>
> --
> Brendan Grainger
> www.kuripai.com

Re: How can I learn the total count of how many documents indexed and how many documents updated?


I don't think that breakdown is readily available from Solr.

Sounds like a good Jira request for improvement in the response.

-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Wednesday, July 17, 2013 10:06 AM
To: solr-user@lucene.apache.org
Subject: How can I learn the total count of how many documents indexed and 
how many documents updated?


I have crawled some web pages and indexed them at my SolrCloud(Solr 4.2.1).
However before I index them there was already some indexes. I can calculate
the difference between current and previous document count. However it
doesn't mean that I have indexed that count of documents. Because urls of
websites are unique ids at my system. So it means that some of documents
updated and they did not increased document count.

My question is that: How can I learn the total count of how many documents
indexed and how many documents updated?

Re: Searching w/explicit Multi-Word Synonym Expansion

To the best of my knowledge, there is no patch or collection of patches 
which constitutes a "working solution" - just partial solutions.


Yes, it is true, there is some FST work underway (active??) that shows 
promise depending on query parser implementation, but again, this is all a 
longer-term future, not a "here and now". Maybe in the 5.0 timeframe?


I don't want anyone to get the impression that there are off-the-shelf 
patches that completely solve the synonym phrase problem. Yes, progress is 
being made, but we're not there yet.


-- Jack Krupansky

-Original Message- 
From: Roman Chyla

Sent: Wednesday, July 17, 2013 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

Hi all,

What I find very 'sad' is that Lucene/SOLR contain all the necessary
components for handling multi-token synonyms; the Finite State Automaton
works perfectly for matching these items; the biggest problem is IMO the
old query parser which split things on spaces and doesn't know to be
smarter.

THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
none was committed...sigh, we are re-inventing wheel all the time...)

LUCENE-1622
LUCENE-4381
LUCENE-4499


The problem of synonym expansion is more difficult becuase of the parsing -
the default parsers are not flexible and they split on empty space -
recently I have proposed a solution which makes also the multi-token
synonym expansion simple

this is the ticket:
https://issues.apache.org/jira/browse/LUCENE-5014

that query parser is able to split on spaces, then look back, do the second
pass to see whether to expand with synonyms - and even discover different
parse paths and construct different queries based on that. if you want to
see some complex examples, look at:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java
-
eg. line 373, 483


Lucene/SOLR developers are already doing great work and have much to do -
they need help from everybody who is able to apply patch, test it and
report back to JIRA.

roman



On Wed, Jul 17, 2013 at 9:37 AM, dmarini  wrote:


iorixxx,

Thanks for pointing me in the direction of the QueryElevation component. 
If

it did not require that the target documents be keyed by the unique key
field it would be ideal, but since our Sku field is not the Unique field
(we
have an internal id which serves as the key while this is the client's 
key)

it doesn't seem like it will match unless I make a larger scope change.

Jack,

I agree that out of the box there hasn't been a generalized solution for
this yet. I guess what I'm looking for is confirmation that I've gone as
far
as I can properly and from this point need to consider using something 
like

the HON custom query parser component (which we're leery of using because
from my reading it solves a specific scenario that may overcompensate what
we're attempting to fix). I would personally rather stay IN solr than add
custom .jar files from around the web if at all possible.

Thanks for the replies.

--Dave





--
View this message in context:
http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078610.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Searching w/explicit Multi-Word Synonym Expansion

I would also note that the LucidWorks Search query parser implements 
query-time synonym phrases. I don't know if anybody has anything better than 
that. Unfortunately, that is proprietary and is kind of a workaround for 
current Lucene/Solr limitations than a long-term solution.


-- Jack Krupansky

-Original Message- 
From: dmarini

Sent: Wednesday, July 17, 2013 9:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching w/explicit Multi-Word Synonym Expansion

iorixxx,

Thanks for pointing me in the direction of the QueryElevation component. If
it did not require that the target documents be keyed by the unique key
field it would be ideal, but since our Sku field is not the Unique field (we
have an internal id which serves as the key while this is the client's key)
it doesn't seem like it will match unless I make a larger scope change.

Jack,

I agree that out of the box there hasn't been a generalized solution for
this yet. I guess what I'm looking for is confirmation that I've gone as far
as I can properly and from this point need to consider using something like
the HON custom query parser component (which we're leery of using because
from my reading it solves a specific scenario that may overcompensate what
we're attempting to fix). I would personally rather stay IN solr than add
custom .jar files from around the web if at all possible.

Thanks for the replies.

--Dave





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078610.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does early EOF Results With Document Loss To Index?

On 7/17/2013 8:02 AM, Furkan KAMACI wrote:
> At my indexing process to my SolrCloud(Solr 4.2.1) from Hadoop I got an
> error. What is the reason, does it results with document loss for indexing?
> 
> ERROR - 2013-07-17 16:30:01.453; org.apache.solr.common.SolrException;
> java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException]
> early EOF

Every time I've seen this particular exception, it has been caused by
the client disconnecting before the request completes.  If you are
seeing it when doing index updates, then your client is behaving badly
and not following the HTTP and/or TCP specifications, or it has been
defined with a socket timeout that is not long enough for the update to
complete.  It's generally not a good idea to define a socket timeout for
any kind of request other than a query.

As for whether it is resulting in data loss, only you can answer that.
Is the stacktrace preceded or followed in the log by an update request
that has a list of ID values?  If so, then the updates are probably
working, but your client is not getting the response.

Thanks,
Shawn

Re: Solr 4.3.1: Errors When Attempting to Index LatLon Fields

2013-07-17 Thread Smiley, David W.

Another problem in addition to dynamicField being declared in the wrong
place, is that you've declared that your geoFindspot field is
multi-valued. LatLonType can't handle that.  Use location_rpt in the
example schema to get a multi-value capable geo field.

~ David

On 7/15/13 5:10 PM, "Scott Vanderbilt"  wrote:

>I'm trying to index documents containing geo-spatial coordinates using
>Solr 4.3.1 and am running into some difficulties. Whenever I attempt to
>index a particular document containing a geospatial coordinate pair
>(using post.jar), the operation fails as follows:
>
>   SimplePostTool version 1.5
>   Posting files to base url http://localhost:8080/solr/update using
>   content-type application/xml..
>   POSTing file rib1.xml
>   SimplePostTool: WARNING: Solr returned an error #400 Bad Request
>   SimplePostTool: WARNING: IOException while reading response:
>  java.io.IOException: Server returned HTTP response code: 400 for
>  URL: http://localhost:8080/solr/update
>   1 files indexed.
>   COMMITting Solr index changes to http://localhost:8080/solr/update..
>   Time spent: 0:00:00.063
>
>The solr log shows the following:
>
>   08:30:39 ERROR SolrCore org.apache.solr.common.SolrException:
> undefined field: "geoFindspot_0_coordinate"
>
>There relevant parts of my schema.xml are:
>
> stored="true" multiValued="true"/>
>   ...
>  subFieldSuffix="_coordinate"/>
>  stored="false" />
>
>The document I am attempting to index has this field:
>
>51.512332,-0.090588
>
>As far as I can tell, my configuration complies with the instructions on
>the relevant Wiki page (http://wiki.apache.org/solr/SpatialSearch) and I
>can see nothing amiss.
>
>Any suggestions as to why this is failing would be greatly appreciated.
>Thank you!
>

How can I learn the total count of how many documents indexed and how many documents updated?

I have crawled some web pages and indexed them at my SolrCloud(Solr 4.2.1).
However before I index them there was already some indexes. I can calculate
the difference between current and previous document count. However it
doesn't mean that I have indexed that count of documents. Because urls of
websites are unique ids at my system. So it means that some of documents
updated and they did not increased document count.

My question is that: How can I learn the total count of how many documents
indexed and how many documents updated?

Does early EOF Results With Document Loss To Index?

At my indexing process to my SolrCloud(Solr 4.2.1) from Hadoop I got an
error. What is the reason, does it results with document loss for indexing?

ERROR - 2013-07-17 16:30:01.453; org.apache.solr.common.SolrException;
java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException]
early EOF
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:948)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.eclipse.jetty.io.EofException: early EOF
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65)
at java.io.InputStream.read(InputStream.java:101)
at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
at
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 36 more

ERROR - 2013-07-17 16:30:01.455; org.apache.solr.common.SolrException;
null:java.lang.RuntimeException: [was class
org.eclipse.jetty.io.EofException] early EOF
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
at
org.apache.solr.handler.loader.XMLLoader.processUpda

Re: and performance

On 7/17/2013 1:22 AM, Ayman Plaha wrote:
>*will this effect the query performance of the client website if the
>index grew to 10 million records ? I mean while the commit is happening
>does that *effect the performance of queries* and how will this effect
>the queries if the index grew to 10 million records ?

Every time you commit and open a new searcher, any data in the caches
that Solr itself creates is wiped.  If you have configured autowarming,
then it will use keys from the old cache to repopulate the new cache, by
using those keys as queries on the index.  If autowarmCount is high,
those warming queries can take a long time and put quite a load on the
index.  While the warming is happening, the old searcher continues to
process queries.

>- What *hosting specs* should I get ? How much RAM ? Considering my
>- client application is very simple that just register users to database
>and queries SOLR and displays SOLR results.

This is almost impossible to answer.  Even if you can give us more
statistics about your setup, the only way to REALLY know is to
experiment.  I can give you some basic guidelines:

1) Get as much processing power as you can reasonably afford, but
understand that I/O and RAM are likely to play a bigger role in Solr
performance than bleeding-edge CPU power.

2) Multi-disk RAID10 or SSD performs best for an I/O layer.

3) For RAM, if Solr is the only thing running on the machine, the ideal
amount is the size of your index on disk, plus the Solr JVM size, plus a
little bit (1GB or less) for the OS.  This lets the OS cache the entire
index in RAM.  Because the OS disk cache is very smart, you may be able
to run effectively with less RAM, especially if you use SSD.  If the
available OS disk cache is too small, performance will really suffer.

If Solr is not the only thing running on the machine, then you need to
add the RAM requirements of the other processes.  Those RAM requirements
may extend beyond the memory required for the processes themselves,
because other programs usually benefit from OS disk caching as well.

Running only Solr on the server is recommended.  If you are running in
SolrCloud mode, it's normal to also run one of the required zookeeper
instances on the same hardware, because zookeeper requirements are very
small.

Some basic information about RAM sizing can be found on this wiki page:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Re: Searching w/explicit Multi-Word Synonym Expansion

Hi all,

What I find very 'sad' is that Lucene/SOLR contain all the necessary
components for handling multi-token synonyms; the Finite State Automaton
works perfectly for matching these items; the biggest problem is IMO the
old query parser which split things on spaces and doesn't know to be
smarter.

THIS IS A LONG-TIME PROBLEM - THERE EXIST SEVERAL WORKING SOLUTIONS (but
none was committed...sigh, we are re-inventing wheel all the time...)

LUCENE-1622
LUCENE-4381
LUCENE-4499

The problem of synonym expansion is more difficult becuase of the parsing -
the default parsers are not flexible and they split on empty space -
recently I have proposed a solution which makes also the multi-token
synonym expansion simple

this is the ticket:
https://issues.apache.org/jira/browse/LUCENE-5014

that query parser is able to split on spaces, then look back, do the second
pass to see whether to expand with synonyms - and even discover different
parse paths and construct different queries based on that. if you want to
see some complex examples, look at:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/analysis/TestAdsabsTypeFulltextParsing.java
-
eg. line 373, 483

Lucene/SOLR developers are already doing great work and have much to do -
they need help from everybody who is able to apply patch, test it and
report back to JIRA.

roman

On Wed, Jul 17, 2013 at 9:37 AM, dmarini  wrote:

> iorixxx,
>
> Thanks for pointing me in the direction of the QueryElevation component. If
> it did not require that the target documents be keyed by the unique key
> field it would be ideal, but since our Sku field is not the Unique field
> (we
> have an internal id which serves as the key while this is the client's key)
> it doesn't seem like it will match unless I make a larger scope change.
>
> Jack,
>
> I agree that out of the box there hasn't been a generalized solution for
> this yet. I guess what I'm looking for is confirmation that I've gone as
> far
> as I can properly and from this point need to consider using something like
> the HON custom query parser component (which we're leery of using because
> from my reading it solves a specific scenario that may overcompensate what
> we're attempting to fix). I would personally rather stay IN solr than add
> custom .jar files from around the web if at all possible.
>
> Thanks for the replies.
>
> --Dave
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078610.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Searching w/explicit Multi-Word Synonym Expansion

2013-07-17 Thread dmarini

iorixxx,

Thanks for pointing me in the direction of the QueryElevation component. If
it did not require that the target documents be keyed by the unique key
field it would be ideal, but since our Sku field is not the Unique field (we
have an internal id which serves as the key while this is the client's key)
it doesn't seem like it will match unless I make a larger scope change.

Jack,

I agree that out of the box there hasn't been a generalized solution for
this yet. I guess what I'm looking for is confirmation that I've gone as far
as I can properly and from this point need to consider using something like
the HON custom query parser component (which we're leery of using because
from my reading it solves a specific scenario that may overcompensate what
we're attempting to fix). I would personally rather stay IN solr than add
custom .jar files from around the web if at all possible.

Thanks for the replies.

--Dave





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-w-explicit-Multi-Word-Synonym-Expansion-tp4078469p4078610.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why "Sort" Doesn't Work?

In general, sorting doesn't work well for multivalued and tokenized fields. 
You need to copy your tokenized url to a "utl_str" string field and then 
sort that field.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Wednesday, July 17, 2013 5:54 AM
To: solr-user@lucene.apache.org
Subject: Why "Sort" Doesn't Work?

I run a query at my Solr 4.2.1 SolrCloud:

/solr/select?q=*:*&rows=300&wt=csv&fl=url&sort=url asc

result is as follows:

http://goethetc.blogspot.com/
http://about.deviantart.com/contact/
http://browse.deviantart.com/designbattle/
http://browse.deviantart.com/digitalart/
http://hayathepbahar.blogspot.com/
http://corporateoutfitter.cabelas.com/
http://german.alibaba.com/
...

url if defined as follows at my schema:








...


*Why it is not sorted?*

shards param fails with cores

2013-07-17 Thread Dmitry Kan

Hi list,

I have set up two cores (=collections):

http://localhost:8983/solr/core0
http://localhost:8983/solr/core1

In addition the following has been set up:
http://localhost:8984/solr/core0
http://localhost:8984/solr/core1

I'm trying to query the first via the second like this:

http://localhost:8984/solr/core1/select?q=test&shards=localhost:8983/solr/core0

But an error comes as a response:

"Server at http://localhost:8983/solr returned non ok status:404,
message:Not Found"

What am I doing wrong?

Thanks,
Dmitry

Re: Search with punctuations

Yes, the Word Delimiter filter does in fact break up a token into discrete 
words. In fact it seems antithetical that you are combining the keyword 
tokenizer that doesn't break up a string into words with the WDF that does.


Maybe you should drop back to standard tokenization coupled with the Edge 
n-gram token filter with a min and max of 3 so that It will index 
"INTERNATIONAL" as itself plus "INT".


And then maybe add a regex char filter to combine "INT'L" into "INTL".

-- Jack Krupansky

-Original Message- 
From: kobe.free.wo...@gmail.com

Sent: Wednesday, July 17, 2013 8:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Search with punctuations

Hi Erick,

I modified the SOLR schema file for the field as follows and re-indexed the
schema,


 
   
   
   
   

 
 


   
   

 
   

My previous scenario seems to be working fine i.e., when I search for
"INTL", I get both the records containing string like "INTL" and "INT'L".
But, I am not able to perform a STARTS WITH search i.e., my schema field has
values like "INTERNATIONAL XYZ LOCAL" and "PLAY OF INTERNATIONAL XYZ", when
I perform a STARTS WITH search for the keyword "INTERNATIONAL" it is
returning both the values but, ideally it should return only "INTERNATIONAL
XYZ LOCAL". To perform the STARTS WITH search I append the keyword with "*"
i.e., the keyword in my case becomes "INTERNATIONAL*".

It seems that the STARTS WITH search has started behaving like CONTAINS
search. Please suggest me how should I achieve this scenario of performing
the STARTS WITH search on the same field type.

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-with-punctuations-tp4077510p4078591.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to optimize a search?

bq: "Rocket Banana (Single)" should be first because its the closest to "Rocket
Banana".

OK, you've given us nothing to go on here. "it's closest" doesn't mean
anything, it's just someone waving their hands and saying "because I
like it better".

I'm being deliberately obtuse here and trying to think like a
computer. Unless and until you can provide some measurable way to
define "closest", you have no actionable items.

It's actually quite difficult to define "better" query results. Walter
mentions A/B testing which is what lots of people fall back on. But
you'll _never_ get "ideal" results, the best I hope for is "good
enough".

Best
Erick

On Tue, Jul 16, 2013 at 5:25 PM, padcoe  wrote:
> "Rocket Banana (Single)" should be first because its the closest to "Rocket
> Banana".
>
> How can i get a "ideal" rank to return closests words in firsts position?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531p4078470.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Roman:

I think that SOLR-1913 is completely different. It's
about having a field in a document and being able
to do bitwise operations on it. So say I have a
field in a Solr doc with the value 6 in it. I can then
form a query like
{!bitwise field=myfield op=AND source=2}
and it would match.

You're talking about a much different operation as I
understand it.

In which case, go ahead and open up a JIRA, there's
no harm in it.

Best
Erick

On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla  wrote:
> Erick,
>
> I wasn't sure this issue is important, so I wanted first solicit some
> feedback. You and Otis expressed interest, and I could create the JIRA -
> however, as Alexandre, points out, the SOLR-1913 seems similar (actually,
> closer to the Otis request to have the elasticsearch named filter) but the
> SOLR-1913 was created in 2010 and is not integrated yet, so I am wondering
> whether this new feature (somewhat overlapping, but still different from
> SOLR-1913) is something people would really want and the effort on the JIRA
> is well spent. What's your view?
>
> Thanks,
>
>   roman
>
>
>
>
> On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch
> wrote:
>
>> Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ?
>>
>> Regards,
>>Alex.
>>
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all at
>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>
>>
>> On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson > >wrote:
>>
>> > Roman:
>> >
>> > Did this ever make into a JIRA? Somehow I missed it if it did, and this
>> > would
>> > be pretty cool
>> >
>> > Erick
>> >
>> > On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla 
>> > wrote:
>> > > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca 
>> > wrote:
>> > >
>> > >> Hello Erick,
>> > >>
>> > >> > Join performance is most sensitive to the number of values
>> > >> > in the field being joined on. So if you have lots and lots of
>> > >> > distinct values in the corpus, join performance will be affected.
>> > >> Yep, we have a list of unique Id's that we get by first searching for
>> > >> records
>> > >> where loggedInUser IS IN (userIDs)
>> > >> This corpus is stored in memory I suppose? (not a problem) and then
>> the
>> > >> bottleneck is to match this huge set with the core where I'm
>> searching?
>> > >>
>> > >> Somewhere in maillist archive people were talking about "external list
>> > of
>> > >> Solr unique IDs"
>> > >> but didn't find if there is a solution.
>> > >> Back in 2010 Yonik posted a comment:
>> > >> http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd
>> > >>
>> > >
>> > > sorry, haven't the previous thread in its entirety, but few weeks back
>> > that
>> > > Yonik's proposal got implemented, it seems ;)
>> > >
>> > >
>> >
>> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter
>> > >
>> > > You could use this to send very large bitset filter (which can be
>> > > translated into any integers, if you can come up with a mapping
>> > function).
>> > >
>> > > roman
>> > >
>> > >
>> > >>
>> > >> > bq: I suppose the delete/reindex approach will not change soon
>> > >> > There is ongoing work (search the JIRA for "Stacked Segments")
>> > >> Ah, ok, I was feeling it affects the architecture, ok, now the only
>> > hope is
>> > >> Pseudo-Joins ))
>> > >>
>> > >> > One way to deal with this is to implement a "post filter", sometimes
>> > >> called
>> > >> > a "no cache" filter.
>> > >> thanks, will have a look, but as you describe it, it's not the best
>> > option.
>> > >>
>> > >> The approach
>> > >> "too many documents, man. Please refine your query. Partial results
>> > below"
>> > >> means faceting will not work correctly?
>> > >>
>> > >> ... I have in mind a hybrid approach, comments welcome:
>> > >> Most of the time users are not searching, but browsing content, so our
>> > >> "virtual filesystem" stored in SOLR will use only the index with the
>> Id
>> > of
>> > >> the file and the list of users that have access to it. i.e. not
>> touching
>> > >> the fulltext index at all.
>> > >>
>> > >> Files may have metadata (EXIF info for images for ex) that we'd like
>> to
>> > >> filter by, calculate facets.
>> > >> Meta will be stored in both indexes.
>> > >>
>> > >> In case of a fulltext query:
>> > >> 1. search FT index (the fulltext index), get only the number of search
>> > >> results, let it be Rf
>> > >> 2. search DAC index (the index with permissions), get number of search
>> > >> results, let it be Rd
>> > >>
>> > >> let maxR be the maximum size of the corpus for the pseudo-join.
>> > >> *That was actually my question: what is a reasonable number? 10, 100,
>> > 1000
>> > >> ?
>> > >> *
>> > >>
>> > >> if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join onto
>> > the
>> > >> second one.
>> > >> this happens when (only a few documents contain

Re: Doc's FunctionQuery result field in my custom SearchComponent class ?

Where are you getting the syntax
freq:termfreq(product,'spider')
? Try just

termfreq(product,'spider')
you'll get an element in the doc labeled 'termfreq', at least
I do.

Best
Erick

On Tue, Jul 16, 2013 at 1:03 PM, Tony Mullins  wrote:
> OK, So thats why I cannot see the FunctionQuery fields in my
> SearchComponent class.
> So then question would be how can I apply my custom processing/logic to
> these FunctionQuery ? Whats the ExtensionPoint in Solr for such scenarios ?
>
> Basically I want to call termfreq() for each document and then apply the
> sum to all doc's termfreq() results and show in one aggregated TermFreq
> field in my query response.
>
> Thanks.
> Tony
>
>
>
> On Tue, Jul 16, 2013 at 6:01 PM, Jack Krupansky 
> wrote:
>
>> Basically, the evaluation of function queries in the "fl" parameter occurs
>> when the response writer is composing the document results. That's AFTER
>> all of the search components are done.
>>
>> SolrReturnFields.**getTransformer() gets the DocTransformer, which is
>> really a DocTransformers, and then a call to DocTransformers.transform() in
>> each response writer will evaluate the embedded function queries and insert
>> their values in the results as they are being written.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Tony Mullins
>> Sent: Tuesday, July 16, 2013 1:37 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Doc's FunctionQuery result field in my custom SearchComponent
>> class ?
>>
>>
>> No sorry, I am still not getting the termfreq() field in my 'doc' object.
>> I do get the _version_ field in my 'doc' object which I think is
>> realValue=StoredField.
>>
>> At which point termfreq() or any other FunctionQuery field becomes the part
>> of doc object in Solr ? And at that point can I perform some custom logic
>> and append the response ?
>>
>> Thanks.
>> Tony
>>
>>
>>
>>
>>
>> On Tue, Jul 16, 2013 at 1:34 AM, Patanachai Tangchaisin <
>> patanachai.tangchaisin@**wizecommerce.com>
>> wrote:
>>
>>  Hi,
>>>
>>> I think the process of retrieving a stored field (through fl) is happens
>>> after SearchComponent.
>>>
>>> One solution: If you wrap a q params with function your score will be a
>>> result of the function.
>>> For example,
>>>
>>> http://localhost:8080/solr/collection2/demoendpoint?q=**
>>> termfreq%28product,%27spider%27%29&wt=xml&indent=true&fl=***,**score<
>>> http://localhost:**8080/solr/collection2/**demoendpoint?q=termfreq%**
>>> 28product,%27spider%27%29&wt=**xml&indent=true&fl=*,score
>>> >
>>>
>>>
>>>
>>> Now your score is going to be a result of termfreq(product,'spider')
>>>
>>>
>>> --
>>> Patanachai Tangchaisin
>>>
>>>
>>>
>>> On 07/15/2013 12:01 PM, Tony Mullins wrote:
>>>
>>>  any help plz !!!


 On Mon, Jul 15, 2013 at 4:13 PM, Tony Mullins >>> >*
 *wrote:


  Please any help on how to get the value of 'freq' field in my custom

> SearchComponent ?
>
>
> http://localhost:8080/solr/collection2/demoendpoint?q=**
> spider&wt=xml&indent=true&fl=*,freq:termfreq%28product,%**
> 27spider%27%29 collection2/demoendpoint?q=**spider&wt=xml&indent=true&fl=***
> ,freq:termfreq%28product,%**27spider%27%29
> >
>
>
> 11Video Games name="format">xbox 360The Amazing
> Spider-Man11 name="_version_">1439994081345273856
> name="freq">1
>
>
>
> Here is my code
>
> DocList docs = rb.getResults().docList;
>  DocIterator iterator = docs.iterator();
>  int sumFreq = 0;
>  String id = null;
>
>  for (int i = 0; i < docs.size(); i++) {
>  try {
>  int docId = iterator.nextDoc();
>
> // Document doc = searcher.doc(docId, fieldSet);
>  Document doc = searcher.doc(docId);
>
> In doc object I can see the schema fields like 'id', 'type','format'
> etc.
> but I cannot find the field 'freq' which I needed. Is there any way to
> get
> the FunctionQuery fields in doc object ?
>
> Thanks,
> Tony
>
>
>
> On Mon, Jul 15, 2013 at 1:16 PM, Tony Mullins  >
> **wrote:
>
>  Hi,
>
>>
>> I have extended Solr's SearchComonent class and I am iterating through
>> all the docs in ResponseBuilder in @overrider Process() method.
>>
>> Here I want to get the value of FucntionQuery result but in Document
>> object I am only seeing the standard field of document not the
>> FucntionQuery result.

Re: Config changes in solr.DirectSolrSpellCheck after index is built?

I don't know the code well, but anything that
mentions "index based spellcheck" would
presumably require re-indexing.

But I'd also guess it depends on the changes.
Any changes to _how_ the index is _used_
shouldn't require re-indexing. But changing
how the tokens are put _into_ the index should.

But like I said, I'm speculating a bit.

Best
Erick

On Tue, Jul 16, 2013 at 10:48 AM, Brendan Grainger
 wrote:
> Hi All,
>
> Can you change the configuration of a spellchecker
> using solr.DirectSolrSpellCheck after you've built an index? I know that
> this spellchecker doesn't build and index off to the side like
> the IndexBasedSpellChecker so I'm wondering what's happening internally to
> create a spellchecking dictionary.
>
> Thanks
> Brendan
>
> --
> Brendan Grainger
> www.kuripai.com

Re: Search with punctuations

2013-07-17 Thread kobe.free.wo...@gmail.com

Hi Erick,

I modified the SOLR schema file for the field as follows and re-indexed the
schema,

 
  
 
 


   
  
  
 
 


   
  


My previous scenario seems to be working fine i.e., when I search for
"INTL", I get both the records containing string like "INTL" and "INT'L".
But, I am not able to perform a STARTS WITH search i.e., my schema field has
values like "INTERNATIONAL XYZ LOCAL" and "PLAY OF INTERNATIONAL XYZ", when
I perform a STARTS WITH search for the keyword "INTERNATIONAL" it is
returning both the values but, ideally it should return only "INTERNATIONAL
XYZ LOCAL". To perform the STARTS WITH search I append the keyword with "*"
i.e., the keyword in my case becomes "INTERNATIONAL*".

It seems that the STARTS WITH search has started behaving like CONTAINS
search. Please suggest me how should I achieve this scenario of performing
the STARTS WITH search on the same field type.

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-with-punctuations-tp4077510p4078591.html
Sent from the Solr - User mailing list archive at Nabble.com.

Avoid Solr Pivot Faceting Out of Memory / Shorter result for pivot faceting requests with facet.pivot.ngroup=true and facet.pivot.showLastList=false

2013-07-17 Thread Sandro Zbinden

Dear Usergroup


I am getting an out of memory exception in the following scenario.
I have 4 sql tables: patient, visit, study and image that will be denormalized 
for the solr index
The solr index looks like the following



|p_id |p_lastname|v_id  |v_name  |...

| 1  | Miller| 10 | Study 1   |...
| 2  | Miller| 11 | Study 2   |...
| 2  | Miller| 12 | Study 3   |...  <-- Duplication because 
of denormalization
| 3  | Smith| 13 | Study 4  |...
--

Now I am executing a facet query

q=*:*&facet=true &facet.pivot=p_lastname,p_id &facet.limit=-1

And I get the following result


p_lastname
Miller
3

  
   p_id
   1
   1
  
  
   p_id
   2
   2
  



p_lastname
Smith
1

   p_id
   3
   1
  




The goal is to show our clients a list of the group value and in parentheses 
how many patients the group contains.
 - Miller (2)
- Smith (1)

This is why we need to use the facet.pivot method with facet.limit-1. It is as 
far as I know the only way to get a grouping for 2 criterias.
And we need the pivot list to count how many patients are in a group.


Currently this works good on smaller indexes but if we have arround 1'000'000 
patients and we execute a query like the one above we run in an out of memory.
I figured out that the problem is not the calculation of the pivot but is the 
presentation of the result.
Because we load all fields (we can not us facet.offset because we need to order 
the results ascending and descending) the result can get really big.

To avoid this overload I created a change in the solr-core 
PivotFacetHandler.java class.
In the method doPivots i added the following code

   NamedList nl = this.getTermCounts(subField);
   pivot.add( "ngroups", nl.size());

This will give me the group size of the list.
Then I removed the recursion call pivot.add( "pivot", doPivots( nl, subField, 
nextField, fnames, subset) );
Like this my result looks like the following


p_lastname
Miller
3
2


p_lastname
Smith
1>
1



My questions is now if there is already something planned like 
facet.pivot.ngroup=true and facet.pivot.showLastList=false to improve the 
performance
of pivot faceting.

Is there a chance we could get this into the solr code. I think it's a really 
small change of the code but could improve the product enormous.

Best Regards

Sandro Zbinden

RE: Why "Sort" Doesn't Work?

2013-07-17 Thread Markus Jelsma

No, just the usual score calculated by Lucene's Similarity impl. 
 
-Original message-
> From:Furkan KAMACI 
> Sent: Wednesday 17th July 2013 13:39
> To: solr-user@lucene.apache.org
> Subject: Re: Why "Sort" Doesn't Work?
> 
> Hi Markus;
> 
> What is that score? It is not listed at schema. Is it document boost?
> 
> 2013/7/17 Markus Jelsma 
> 
> > No, there is no bug in the schema, it is just an example and provides the
> > most common usage only; sort by score.
> >
> > -Original message-
> > > From:Furkan KAMACI 
> > > Sent: Wednesday 17th July 2013 12:10
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Why "Sort" Doesn't Work?
> > >
> > > Hi Markus;
> > >
> > > This is default schema at Nutch. Do you mean there is a bug with schema?
> > >
> > >
> > >
> > > 2013/7/17 Markus Jelsma 
> > >
> > > > Remove the WDF from the analysis chain, it's not going to work with
> > > > multiple tokens.
> > > >
> > > > -Original message-
> > > > > From:Furkan KAMACI 
> > > > > Sent: Wednesday 17th July 2013 11:55
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Why "Sort" Doesn't Work?
> > > > >
> > > > > I run a query at my Solr 4.2.1 SolrCloud:
> > > > >
> > > > > /solr/select?q=*:*&rows=300&wt=csv&fl=url&sort=url asc
> > > > >
> > > > > result is as follows:
> > > > >
> > > > > http://goethetc.blogspot.com/
> > > > > http://about.deviantart.com/contact/
> > > > > http://browse.deviantart.com/designbattle/
> > > > > http://browse.deviantart.com/digitalart/
> > > > > http://hayathepbahar.blogspot.com/
> > > > > http://corporateoutfitter.cabelas.com/
> > > > > http://german.alibaba.com/
> > > > > ...
> > > > >
> > > > > url if defined as follows at my schema:
> > > > >
> > > > >  > positionIncrementGap="100">
> > > > > 
> > > > > 
> > > > > 
> > > > >  > > > > generateNumberParts="1"/>
> > > > > 
> > > > > 
> > > > > ...
> > > > >  > > > required="true"/>
> > > > >
> > > > > *Why it is not sorted?*
> > > > >
> > > >
> > >
> >
>

Re: Why "Sort" Doesn't Work?

Hi Markus;

What is that score? It is not listed at schema. Is it document boost?

2013/7/17 Markus Jelsma 

> No, there is no bug in the schema, it is just an example and provides the
> most common usage only; sort by score.
>
> -Original message-
> > From:Furkan KAMACI 
> > Sent: Wednesday 17th July 2013 12:10
> > To: solr-user@lucene.apache.org
> > Subject: Re: Why "Sort" Doesn't Work?
> >
> > Hi Markus;
> >
> > This is default schema at Nutch. Do you mean there is a bug with schema?
> >
> >
> >
> > 2013/7/17 Markus Jelsma 
> >
> > > Remove the WDF from the analysis chain, it's not going to work with
> > > multiple tokens.
> > >
> > > -Original message-
> > > > From:Furkan KAMACI 
> > > > Sent: Wednesday 17th July 2013 11:55
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Why "Sort" Doesn't Work?
> > > >
> > > > I run a query at my Solr 4.2.1 SolrCloud:
> > > >
> > > > /solr/select?q=*:*&rows=300&wt=csv&fl=url&sort=url asc
> > > >
> > > > result is as follows:
> > > >
> > > > http://goethetc.blogspot.com/
> > > > http://about.deviantart.com/contact/
> > > > http://browse.deviantart.com/designbattle/
> > > > http://browse.deviantart.com/digitalart/
> > > > http://hayathepbahar.blogspot.com/
> > > > http://corporateoutfitter.cabelas.com/
> > > > http://german.alibaba.com/
> > > > ...
> > > >
> > > > url if defined as follows at my schema:
> > > >
> > > >  positionIncrementGap="100">
> > > > 
> > > > 
> > > > 
> > > >  > > > generateNumberParts="1"/>
> > > > 
> > > > 
> > > > ...
> > > >  > > required="true"/>
> > > >
> > > > *Why it is not sorted?*
> > > >
> > >
> >
>

Re: and performance

2013-07-17 Thread Ayman Plaha

Thanks Aditya, can I also please get some advice on hosting.

   - What *hosting specs* should I get ? How much RAM ? Considering my
   - client application is very simple that just register users to database
   and queries SOLR and displays SOLR results.
   - simple batch program adds the 1000 OR 2000 documents to SOLR every
   second.

I'm hoping to deploy the code next week, if you guys can give me any other
advice I'd really appreciate that.


On Wed, Jul 17, 2013 at 7:07 PM, Aditya wrote:

> Hi
>
> It will not affect the performance. We are doing this  regularly. If you do
> optimize and search then there may be some impact.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
>
> On Wed, Jul 17, 2013 at 12:52 PM, Ayman Plaha 
> wrote:
>
> > Hey Guys,
> >
> > I've finally finished my Spring Java application that uses SOLR for
> > searches and just had performance related question about SOLR. I'm
> indexing
> > exactly 1000 *OR* 2000 records every second. Every record having 13
> fields
> > including 'id'. Majority of the fields are solr.StrField (no filters)
> with
> > characters ranging from 5 - 50 in length and one field which is text_t
> > (solr.TextField) which can be of length 100 characters to 2000 characters
> > and has the following tokenizer and filters
> >
> >- PatternTokenizerFactory
> >- LowerCaseFilterFactory
> >- SynonymFilterFactory
> >- SnowballPorterFilterFactory.
> >
> >
> > I'm not using shards. I was hoping when searches get slow I will consider
> > this or should I consider this now ?
> >
> > *Questions:*
> >
> >- I'm using SOLR autoCommit (every 15 minutes) with openSearcher set
> as
> >true. I'm not using autoSoftCommit because instant availability of the
> >documents for search is not necessary and I don't want to chew up too
> > much
> >memory because I'm consider Cloud hosting.
> >*
> >**90
> >**true
> >**
> >*will this effect the query performance of the client website if the
> >index grew to 10 million records ? I mean while the commit is
> happening
> >does that *effect the performance of queries* and how will this effect
> >the queries if the index grew to 10 million records ?
> >- What *hosting specs* should I get ? How much RAM ? Considering my
> >- client application is very simple that just register users to
> database
> >and queries SOLR and displays SOLR results.
> >- simple batch program adds the 1000 OR 2000 documents to SOLR every
> >second.
> >
> >
> > I'm hoping to deploy the code next week, if you guys can give me any
> other
> > advice I'd really appreciate that.
> >
> > Thanks
> > Ayman
> >
>

Re: HTTP Status 503 - Server is shutting down

2013-07-17 Thread Sandeep Gupta

Hi,

I think I will also wait for other people reply as I do not have much idea
now.
I suggested the things because I did it recently but I have only one
collection (default one) .

As you said and I can guess...
you have multiple collections like tt, shop and home in one solr instance..
By default all the collections should go inside solr dir (tomcat\solr)...
And may be you need to modify the solr.xml file (tomcat\solr\solr.xml)
See below.

There is another xml file, I have given name as solr.xml also
(\tomcat\conf\localhost\solr.xml) which has solr home path...
and therefore starting of tomcat read this file after host-manager.xml

Thanks
-Sandeep

On Wed, Jul 17, 2013 at 3:40 PM, PeterKerk  wrote:

> I can now approach http://localhost:8080/solr-4.3.1/#/, thanks!!
>
> I also noticed you mentioning something about a data import handler.
>
> Now, what I will be requiring after I've completed the basic setup of
> Tomcat6 and Solr431 I want to migrate my Solr350 (now running on Cygwin)
> cores to that environment.
>
> C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\tt
> C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\shop
> C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\homes
>
> Where do I need to copy the above cores for this all to work?
> What I don't understand is how Tomcat knows where it can find my Solr 4.3.1
> folder, in my case C:\Dropbox\Databases\solr-4.3.1, is that folder even any
> longer required?
>
> Many thanks again! :)
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/HTTP-Status-503-Server-is-shutting-down-tp4065958p4078567.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

AUTO: Siobhan Roche is out of the office (returning 18/07/2013)

2013-07-17 Thread Siobhan Roche


I am out of the office until 18/07/2013.

I will respond to your query on my return,
Thanks
Siobhan


Note: This is an automated response to your message  "Re: Solr is not
responding on deployment in tomcat" sent on 17/07/2013 1:33:56.

This is the only notification you will receive while this person is away.

RE: Why "Sort" Doesn't Work?

2013-07-17 Thread Markus Jelsma

No, there is no bug in the schema, it is just an example and provides the most 
common usage only; sort by score.
 
-Original message-
> From:Furkan KAMACI 
> Sent: Wednesday 17th July 2013 12:10
> To: solr-user@lucene.apache.org
> Subject: Re: Why "Sort" Doesn't Work?
> 
> Hi Markus;
> 
> This is default schema at Nutch. Do you mean there is a bug with schema?
> 
> 
> 
> 2013/7/17 Markus Jelsma 
> 
> > Remove the WDF from the analysis chain, it's not going to work with
> > multiple tokens.
> >
> > -Original message-
> > > From:Furkan KAMACI 
> > > Sent: Wednesday 17th July 2013 11:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Why "Sort" Doesn't Work?
> > >
> > > I run a query at my Solr 4.2.1 SolrCloud:
> > >
> > > /solr/select?q=*:*&rows=300&wt=csv&fl=url&sort=url asc
> > >
> > > result is as follows:
> > >
> > > http://goethetc.blogspot.com/
> > > http://about.deviantart.com/contact/
> > > http://browse.deviantart.com/designbattle/
> > > http://browse.deviantart.com/digitalart/
> > > http://hayathepbahar.blogspot.com/
> > > http://corporateoutfitter.cabelas.com/
> > > http://german.alibaba.com/
> > > ...
> > >
> > > url if defined as follows at my schema:
> > >
> > > 
> > > 
> > > 
> > > 
> > >  > > generateNumberParts="1"/>
> > > 
> > > 
> > > ...
> > >  > required="true"/>
> > >
> > > *Why it is not sorted?*
> > >
> >
>

RE: Where to specify numShards when startup up a cloud setup

2013-07-17 Thread Robert Stewart

Yes, thanks Shawn.  I know I can use collections HTTP API to set number of 
shards, but the problem with that is it is not easily scriptable so that the 
entire cluster can be setup in automated fashion - the script(s) will need to 
wait until the SOLR nodes are up and running before using the collection API.  
The information I want is: Is there some "configuration" way to set numShards 
(such as in solr.xml, etc. - or by sending some data to zookeeper API)?  I am 
guessing the answer is still no.

Thanks.

From: Shawn Heisey [s...@elyograg.org]
Sent: Tuesday, July 16, 2013 6:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Where to specify numShards when startup up a cloud setup

On 7/16/2013 3:36 PM, Robert Stewart wrote:
> I want to script the creation of N solr cloud instances (on ec2).
>
> But its not clear to me where I would specify numShards setting.
>  From documentation, I see you can specify on the "first node" you start up, 
> OR alternatively, use the "collections" API to create a new collection - but 
> in that case you need first at least one running SOLR instance.  I want to 
> push all solr instances with similar configuration onto N instances and just 
> run them with some number of shards pre-set somehow.  Where can I put 
> numShards configuration setting?
>
> What I want to do:
>
> 1) push solr configuration to zookeeper ensemble using zkCli command-line 
> tool.
> 2) create N instances of SOLR running on Ec2, pointing to the same zookeeper
> 3) start all SOLR instances which will become a cloud setup with M shards 
> (where Mhttp://zookeeper.apache.org

2) Construct a zkHost parameter for your ZK ensemble.  An example is
below using the default zookeeper port of 2181.  You'd need to use the
proper port numbers, names, etc.  The /chroot part is optional, but
highly recommended.  Use a name that has meaning for your SolrCloud
cluster rather than chroot:

-DzkHost=server1:2181,server2:2181,server3:2181/chroot

By using the /chroot syntax, you can run more than one SolrCloud cluster
on your zookeeper ensemble.  Just use a different value for each cluster.

3) Start Solr with the same zkHost parameter on every Solr host,
referring to the three zookeeper hosts already set up.  You can use the
same hosts for Solr as you did for zookeeper.

4) Use the zkcli script in example/cloud-scripts to upload a
configuration set to zookeeper using the "upconfig" command.  If you
aren't using the Solr example or a custom install based on the example,
then you'll need to examine the script to figure out how to run the java
command manually and have it find the solr and zookeeper jars.

5) Use the Collections API to create a collection, referencing the
uploaded config set and including additional parameters like numShards.
  If you have four Solr hosts, the following API call would work perfectly:

http://server:port/solr/admin/collections?action=CREATE&name=mycollection&numShards=2&replicationFactor=2&collection.configName=mycfg

Thanks,
Shawn

Re: HTTP Status 503 - Server is shutting down

2013-07-17 Thread PeterKerk

I can now approach http://localhost:8080/solr-4.3.1/#/, thanks!!

I also noticed you mentioning something about a data import handler. 

Now, what I will be requiring after I've completed the basic setup of
Tomcat6 and Solr431 I want to migrate my Solr350 (now running on Cygwin)
cores to that environment. 

C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\tt 
C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\shop 
C:\Dropbox\Databases\apache-solr-3.5.0\example\example-DIH\solr\homes 

Where do I need to copy the above cores for this all to work?
What I don't understand is how Tomcat knows where it can find my Solr 4.3.1
folder, in my case C:\Dropbox\Databases\solr-4.3.1, is that folder even any
longer required?

Many thanks again! :)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/HTTP-Status-503-Server-is-shutting-down-tp4065958p4078567.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Why "Sort" Doesn't Work?

Hi Markus;

This is default schema at Nutch. Do you mean there is a bug with schema?



2013/7/17 Markus Jelsma 

> Remove the WDF from the analysis chain, it's not going to work with
> multiple tokens.
>
> -Original message-
> > From:Furkan KAMACI 
> > Sent: Wednesday 17th July 2013 11:55
> > To: solr-user@lucene.apache.org
> > Subject: Why "Sort" Doesn't Work?
> >
> > I run a query at my Solr 4.2.1 SolrCloud:
> >
> > /solr/select?q=*:*&rows=300&wt=csv&fl=url&sort=url asc
> >
> > result is as follows:
> >
> > http://goethetc.blogspot.com/
> > http://about.deviantart.com/contact/
> > http://browse.deviantart.com/designbattle/
> > http://browse.deviantart.com/digitalart/
> > http://hayathepbahar.blogspot.com/
> > http://corporateoutfitter.cabelas.com/
> > http://german.alibaba.com/
> > ...
> >
> > url if defined as follows at my schema:
> >
> > 
> > 
> > 
> > 
> >  > generateNumberParts="1"/>
> > 
> > 
> > ...
> >  required="true"/>
> >
> > *Why it is not sorted?*
> >
>

RE: Why "Sort" Doesn't Work?

2013-07-17 Thread Markus Jelsma

Remove the WDF from the analysis chain, it's not going to work with multiple 
tokens.
 
-Original message-
> From:Furkan KAMACI 
> Sent: Wednesday 17th July 2013 11:55
> To: solr-user@lucene.apache.org
> Subject: Why "Sort" Doesn't Work?
> 
> I run a query at my Solr 4.2.1 SolrCloud:
> 
> /solr/select?q=*:*&rows=300&wt=csv&fl=url&sort=url asc
> 
> result is as follows:
> 
> http://goethetc.blogspot.com/
> http://about.deviantart.com/contact/
> http://browse.deviantart.com/designbattle/
> http://browse.deviantart.com/digitalart/
> http://hayathepbahar.blogspot.com/
> http://corporateoutfitter.cabelas.com/
> http://german.alibaba.com/
> ...
> 
> url if defined as follows at my schema:
> 
> 
> 
> 
> 
>  generateNumberParts="1"/>
> 
> 
> ...
> 
> 
> *Why it is not sorted?*
>

Why "Sort" Doesn't Work?

I run a query at my Solr 4.2.1 SolrCloud:

/solr/select?q=*:*&rows=300&wt=csv&fl=url&sort=url asc

result is as follows:

http://goethetc.blogspot.com/
http://about.deviantart.com/contact/
http://browse.deviantart.com/designbattle/
http://browse.deviantart.com/digitalart/
http://hayathepbahar.blogspot.com/
http://corporateoutfitter.cabelas.com/
http://german.alibaba.com/
...

url if defined as follows at my schema:








...


*Why it is not sorted?*

Re: and performance

2013-07-17 Thread Aditya

Hi

It will not affect the performance. We are doing this  regularly. If you do
optimize and search then there may be some impact.

Regards
Aditya
www.findbestopensource.com



On Wed, Jul 17, 2013 at 12:52 PM, Ayman Plaha  wrote:

> Hey Guys,
>
> I've finally finished my Spring Java application that uses SOLR for
> searches and just had performance related question about SOLR. I'm indexing
> exactly 1000 *OR* 2000 records every second. Every record having 13 fields
> including 'id'. Majority of the fields are solr.StrField (no filters) with
> characters ranging from 5 - 50 in length and one field which is text_t
> (solr.TextField) which can be of length 100 characters to 2000 characters
> and has the following tokenizer and filters
>
>- PatternTokenizerFactory
>- LowerCaseFilterFactory
>- SynonymFilterFactory
>- SnowballPorterFilterFactory.
>
>
> I'm not using shards. I was hoping when searches get slow I will consider
> this or should I consider this now ?
>
> *Questions:*
>
>- I'm using SOLR autoCommit (every 15 minutes) with openSearcher set as
>true. I'm not using autoSoftCommit because instant availability of the
>documents for search is not necessary and I don't want to chew up too
> much
>memory because I'm consider Cloud hosting.
>*
>**90
>**true
>**
>*will this effect the query performance of the client website if the
>index grew to 10 million records ? I mean while the commit is happening
>does that *effect the performance of queries* and how will this effect
>the queries if the index grew to 10 million records ?
>- What *hosting specs* should I get ? How much RAM ? Considering my
>- client application is very simple that just register users to database
>and queries SOLR and displays SOLR results.
>- simple batch program adds the 1000 OR 2000 documents to SOLR every
>second.
>
>
> I'm hoping to deploy the code next week, if you guys can give me any other
> advice I'd really appreciate that.
>
> Thanks
> Ayman
>

Re: Switching to using SolrCloud with tomcat7 and embedded zookeeper