Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-08 Thread Bernd Fehling
Hi Greg,

thanks a lot, thats it.
After setting q.op to OR it works _nearly_ as before with 4.10.4.

But how stupid this?
I have in my schema 
and also had q.op to AND to make sure my default _is_ AND,
meant as conjunction between terms.
But now I have q.op to OR and defaultOperator in schema to AND
to just get _nearly_ my old behavior back.

schema has following comment:
"... The default is OR, which is generally assumed so it is
not a good idea to change it globally here.  The "q.op" request
parameter takes precedence over this. ..."

What I don't understand is why they change some major internals
and don't give any notice about how to keep old parsing behavior.

>From my point of view the old parsing behavior was correct.
If searching for a term without operator it is always OR, otherwise
you can add "+" or "-" to modify that. Now with q.op AND it is
modified to "+" as a MUST.

I still get some differences in search results between 4.10.4 and 5.5.3.
What other side effects has this change of q.op from AND to OR in
other parts of query handling, parsing and searching?

Regards
Bernd

Am 09.09.2016 um 05:43 schrieb Greg Pendlebury:
> I forgot to mention the tickets:
> SOLR-2649 and SOLR-8812
> 
> On 9 September 2016 at 13:38, Greg Pendlebury 
> wrote:
> 
>> Under 4.10 q.op was ignored by the edismax parser and always forced to OR.
>> 5.5 is looking at the q.op=AND you requested.
>>
>> There are also some changes to the default values selected for mm, but I
>> doubt those apply here since you are setting it explicitly.
>>
>> On 8 September 2016 at 00:35, Mikhail Khludnev  wrote:
>>
>>> I suppose
>>>+((text:star text:trek)~2)
>>> and
>>>   +(+text:star +text:trek)
>>> are equal. mm=2 is equal to +foo +bar
>>>
>>> On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling <
>>> bernd.fehl...@uni-bielefeld.de> wrote:
>>>
 Hi list,

 while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query
>>> parsing.
 4.10.4
 text:star text:trek
   text:star text:trek
   (+((text:star text:trek)~2))/no_coord
   +((text:star text:trek)~2)

 5.5.3
 text:star text:trek
   text:star text:trek
   (+(+text:star +text:trek))/no_coord
   +(+text:star +text:trek)

 There are very many new features and changes between this two versions.
 It looks like a change in query parsing.
 Can someone point me to the solr or lucene jira about the changes?
 Or even give a hint how to get my "old" query parsing back?

 Regards
 Bernd

>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>


Solr Collection Create API queries

2016-09-08 Thread Swathi Singamsetty
Hi Team,

To implement the feature "Persist and use the
replicationFactor,maxShardsPerNode at Collection&Shard level" am following
the steps mentioned in the jira ticket
https://issues.apache.org/jira/browse/SOLR-4808.

I used the "smartCloud" and "autoManageCluster" properties to create a
collection in the create collection API to allow the overseer to bring up
the minimum no. of replicas for each shard as per the replicationFactor set
. But these 2 properties did not persist in the cluster state. Could
someone let me know how to use these properties in this feature?



Thanks & Regards,
Swathi.


Re: High load, frequent updates, low latency requirement use case

2016-09-08 Thread Erick Erickson
Use the SolrJ CloudSolrClient class and use the client.add(doclist) form.

Best,
Erick

On Thu, Sep 8, 2016 at 8:56 PM, Brent  wrote:
> Emir Arnautovic wrote
>> There should be no problems with ingestion on 24 machines. Assuming 1
>> replication, that is roughly 40 doc/sec/server. Make sure you bulk docs
>> when ingesting.
>
> What is bulking docs, and how do I do it? I'm guessing this is some sort of
> batch loading of documents?
>
> Thanks for the reply.
> -Brent
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/High-load-frequent-updates-low-latency-requirement-use-case-tp4293383p4295225.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: High load, frequent updates, low latency requirement use case

2016-09-08 Thread Brent
Emir Arnautovic wrote
> There should be no problems with ingestion on 24 machines. Assuming 1 
> replication, that is roughly 40 doc/sec/server. Make sure you bulk docs 
> when ingesting.

What is bulking docs, and how do I do it? I'm guessing this is some sort of
batch loading of documents?

Thanks for the reply.
-Brent



--
View this message in context: 
http://lucene.472066.n3.nabble.com/High-load-frequent-updates-low-latency-requirement-use-case-tp4293383p4295225.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-08 Thread Greg Pendlebury
I forgot to mention the tickets:
SOLR-2649 and SOLR-8812

On 9 September 2016 at 13:38, Greg Pendlebury 
wrote:

> Under 4.10 q.op was ignored by the edismax parser and always forced to OR.
> 5.5 is looking at the q.op=AND you requested.
>
> There are also some changes to the default values selected for mm, but I
> doubt those apply here since you are setting it explicitly.
>
> On 8 September 2016 at 00:35, Mikhail Khludnev  wrote:
>
>> I suppose
>>+((text:star text:trek)~2)
>> and
>>   +(+text:star +text:trek)
>> are equal. mm=2 is equal to +foo +bar
>>
>> On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling <
>> bernd.fehl...@uni-bielefeld.de> wrote:
>>
>> > Hi list,
>> >
>> > while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query
>> parsing.
>> > 4.10.4
>> > text:star text:trek
>> >   text:star text:trek
>> >   (+((text:star text:trek)~2))/no_coord
>> >   +((text:star text:trek)~2)
>> >
>> > 5.5.3
>> > text:star text:trek
>> >   text:star text:trek
>> >   (+(+text:star +text:trek))/no_coord
>> >   +(+text:star +text:trek)
>> >
>> > There are very many new features and changes between this two versions.
>> > It looks like a change in query parsing.
>> > Can someone point me to the solr or lucene jira about the changes?
>> > Or even give a hint how to get my "old" query parsing back?
>> >
>> > Regards
>> > Bernd
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>
>


Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-08 Thread Greg Pendlebury
Under 4.10 q.op was ignored by the edismax parser and always forced to OR.
5.5 is looking at the q.op=AND you requested.

There are also some changes to the default values selected for mm, but I
doubt those apply here since you are setting it explicitly.

On 8 September 2016 at 00:35, Mikhail Khludnev  wrote:

> I suppose
>+((text:star text:trek)~2)
> and
>   +(+text:star +text:trek)
> are equal. mm=2 is equal to +foo +bar
>
> On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
>
> > Hi list,
> >
> > while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query
> parsing.
> > 4.10.4
> > text:star text:trek
> >   text:star text:trek
> >   (+((text:star text:trek)~2))/no_coord
> >   +((text:star text:trek)~2)
> >
> > 5.5.3
> > text:star text:trek
> >   text:star text:trek
> >   (+(+text:star +text:trek))/no_coord
> >   +(+text:star +text:trek)
> >
> > There are very many new features and changes between this two versions.
> > It looks like a change in query parsing.
> > Can someone point me to the solr or lucene jira about the changes?
> > Or even give a hint how to get my "old" query parsing back?
> >
> > Regards
> > Bernd
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


shingle query matching keyword tokenized field

2016-09-08 Thread Gandham, Satya
Can anyone help with this question that I posted on stackOverflow.

http://stackoverflow.com/questions/39399321/solr-shingle-query-matching-keyword-tokenized-field

Thanks in advance.


Re: Solr [Streaming Expressions/Parallel SQL Interface] Not supporting Multi Value using mapReduce option

2016-09-08 Thread Erick Erickson
It's the same problem. The "GROUP BY" has to sort the returned rows in order
to partition the result docs (and thus do aggregations by group). We kind of
skipped explaining that ;)

Best,
Erick

On Thu, Sep 8, 2016 at 10:08 AM, Praveen Babu  wrote:
> Hi Erik/Joel,
>
>  I am not sure , did I confused you guys.
> I was talking about  "GROUP BY" on multivalued field .Not sorting
>
> Example : I have 1 TB data , I want to agg a multiValued field using stream
> api.
> aggmode=map_reduce
>
> Regards,
> S.Praveen
> Technical Architech
> LinkedIn:
> https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_responsive_tab_profile
>
>
>
>
> On Thu, Sep 8, 2016 at 8:35 PM, Erick Erickson 
> wrote:
>
>> The basic problem is "what does sorting on a multi-valued field mean"? If
>> you have a numeric field with values 1, 5, 7 how should sorting rank that
>> doc? Use 1? 7? the average? Median? Sum?
>>
>> There is some limited ability in the rest of Solr to sort by min/max but
>> that's
>> it.
>>
>> Best,
>> Erick
>>
>> On Thu, Sep 8, 2016 at 4:57 AM, Joel Bernstein  wrote:
>> > Yes, sorting on multi-valued value fields isn't supported with Streaming
>> > Expressions.
>> >
>> > Multi-value fields can be exported but not used for sorting.
>> >
>> > There currently isn't a plan to add sorting on multi-value fields, but if
>> > other areas in Solr are supporting this perhaps we could use the same
>> > technique.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Thu, Sep 8, 2016 at 2:24 AM, Praveen Babu 
>> > wrote:
>> >
>> >> Hi ,
>> >>
>> >>
>> >> Does parallel sql support Multi valued field?
>> >>
>> >> I am unable to group by on Multi valued field  when I choose
>> >>
>> >> /sql?aggregationMode=map_reduce
>> >>
>> >> "can not sort on multivalued field"
>> >>
>> >>
>> >>
>> >> input:
>> >>
>> >> {
>> >>
>> >> id: 1
>> >>
>> >> field1:[1,2,3],
>> >>
>> >> app.name:[watsapp,facebook,... ]
>> >>
>> >> }
>> >>
>> >> {
>> >>
>> >> id: 2
>> >>
>> >> field1:[1,2,3],
>> >>
>> >> app.name:[watsapp,facebook,... ]
>> >>
>> >> }
>> >>
>> >>
>> >>
>> >> Expected result :
>> >>
>> >> watsapp: 2
>> >>
>> >> facebook : 2
>> >>
>> >>
>> >> I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any
>> >> suggestion?
>> >>
>> >>
>> >>
>> >> Need your valuable suggestion.
>> >>
>> >>
>> >>
>> >> Regards,
>> >> S.Praveen
>> >> Technical Architech
>> >> LinkedIn:
>> >> https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_
>> >> responsive_tab_profile
>> >>
>>


Re: Solr [Streaming Expressions/Parallel SQL Interface] Not supporting Multi Value using mapReduce option

2016-09-08 Thread Praveen Babu
Hi Erik/Joel,

 I am not sure , did I confused you guys.
I was talking about  "GROUP BY" on multivalued field .Not sorting

Example : I have 1 TB data , I want to agg a multiValued field using stream
api.
aggmode=map_reduce

Regards,
S.Praveen
Technical Architech
LinkedIn:
https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_responsive_tab_profile




On Thu, Sep 8, 2016 at 8:35 PM, Erick Erickson 
wrote:

> The basic problem is "what does sorting on a multi-valued field mean"? If
> you have a numeric field with values 1, 5, 7 how should sorting rank that
> doc? Use 1? 7? the average? Median? Sum?
>
> There is some limited ability in the rest of Solr to sort by min/max but
> that's
> it.
>
> Best,
> Erick
>
> On Thu, Sep 8, 2016 at 4:57 AM, Joel Bernstein  wrote:
> > Yes, sorting on multi-valued value fields isn't supported with Streaming
> > Expressions.
> >
> > Multi-value fields can be exported but not used for sorting.
> >
> > There currently isn't a plan to add sorting on multi-value fields, but if
> > other areas in Solr are supporting this perhaps we could use the same
> > technique.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Sep 8, 2016 at 2:24 AM, Praveen Babu 
> > wrote:
> >
> >> Hi ,
> >>
> >>
> >> Does parallel sql support Multi valued field?
> >>
> >> I am unable to group by on Multi valued field  when I choose
> >>
> >> /sql?aggregationMode=map_reduce
> >>
> >> "can not sort on multivalued field"
> >>
> >>
> >>
> >> input:
> >>
> >> {
> >>
> >> id: 1
> >>
> >> field1:[1,2,3],
> >>
> >> app.name:[watsapp,facebook,... ]
> >>
> >> }
> >>
> >> {
> >>
> >> id: 2
> >>
> >> field1:[1,2,3],
> >>
> >> app.name:[watsapp,facebook,... ]
> >>
> >> }
> >>
> >>
> >>
> >> Expected result :
> >>
> >> watsapp: 2
> >>
> >> facebook : 2
> >>
> >>
> >> I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any
> >> suggestion?
> >>
> >>
> >>
> >> Need your valuable suggestion.
> >>
> >>
> >>
> >> Regards,
> >> S.Praveen
> >> Technical Architech
> >> LinkedIn:
> >> https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_
> >> responsive_tab_profile
> >>
>


Re: Solr Grouping, Aggregations and Custom Functions

2016-09-08 Thread Roshni
Hi Joel, Thanks for responding.

   For full fledged data analytics powered by solr, group by and
aggregations are needed. The basic aggregations are available- but we often
have calculated fields like the one I mentioned sum (a)/sum(b). It will be
cool to have these in solr. Such calculations cannot be perisisted in raw
data because it depends on the filters and first set of aggregations of sum 

1. When do you think this support may be available..in 6.x..?
2. As of today what are my options, if I still want to use solr- would using
Spark or Zeppelin over Solr help me with this custom calculation? Or perhaps
I can use java to retrieve the grouped sums from Solr API and then do the
custom calculation on the fly. That may slow it down. Which approach would
you recommend.




Parallel SQL only supports the following functions currently: (SUM, AVG,
MIN, MAX, COUNT).

More functions and compound functions are on the roadmap.

Joel Bernstein
http://joelsolr.blogspot.com/





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Grouping-Aggregations-and-Custom-Functions-tp4295093p4295181.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Default stop word list

2016-09-08 Thread Walter Underwood
I recommend that you remove StopFilterFactor from every analysis chain.

In the tf.idf scoring model, rare words are automatically weighted more than 
common words.

I have an index with 11.6 million documents. “the” occurs in 9.9 million of 
those documents. “cat” occurs in 16,000 of those documents. (I just did 
searches to get the counts).

This is the idf (inverse document frequency) formula for Solr:

public float idf(int docFreq, int numDocs) {
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
  }
“the” has an idf of 1.07. “cat” has an idf of 3.86.

The term “the” still counts for relevance, but it is dominated by the weight 
for “cat”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 8, 2016, at 7:09 AM, Steven White  wrote:
> 
> Hi Walter and all.  Sorry for the late reply, I was out of town.
> 
> Are you saying the list of stop words from the stop word file be remove?  I
> understand the issues I will run into because of the stop word list, but
> all alone, my understanding of stop word list being in the stop word file
> is -- to eliminate them from being indexed -- is so that relevancy ranking
> is improved.  For example, if I index the word "the" instead of removing it
> than when I send the search term "the cat" (without quotes) than records
> with "the" will rank far higher vs. records with "cat" in my result set.
> In fact records with "cat" may not even be on the first page.  Wasn't this
> was stop word list created?
> 
> If my understanding is correct, is there a way for me to rank lower records
> that have a hit due to a list of common words, such as stop words?  This
> way: (1) I can than get rid of all the stop word list in the stop word
> file, (2) solve the issue of searching on "be with me", et. al., and (3)
> prevent the ranking issue.
> 
> Steve
> 
> On Mon, Aug 29, 2016 at 9:18 PM, Walter Underwood 
> wrote:
> 
>> Do not remove stop words. Want to search for “vitamin a”? That won’t work.
>> 
>> Stop word removal is a hack left over from when we were running search
>> engines in 64 kbytes of memory.
>> 
>> Yes, common words are less important for search, but removing them is a
>> brute force approach with severe side effects. Instead, we use a
>> proportional approach with the tf.idf model. That puts a higher weight on
>> rare words and a lower weight on common words.
>> 
>> For some real-life examples of problems with stop words, you can read the
>> list of movie titles that disappear with stemming and stop words. I
>> discovered these when I was running search at Netflix.
>> 
>>• Being There (this is the first one I noticed)
>>• To Be and To Have (Être et Avoir)
>>• To Have and To Have Not
>>• Once and Again
>>• To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
>>• To Be or Not To Be (1983)
>>• Now and Then, Here and There
>>• Be with Me
>>• I’ll Be There
>>• It Had to Be You
>>• You Should Not Be Here
>>• You Are Here
>> 
>> https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 29, 2016, at 5:39 PM, Steven White  wrote:
>>> 
>>> Thanks Shawn.  This is the best answer I have seen, much appreciated.
>>> 
>>> A follow up question, I want to remove stop words from the list, but if I
>>> do, then search quality will degradation (and index size will grow (less
>> of
>>> an issue)).  For example, if I remove "a", then if someone search for
>> "For
>>> a Few Dollars More" (without quotes) chances are good records with "a"
>> will
>>> land higher up that are not relevant to user's search.  How can I address
>>> this?  Can I setup my schema so that records that get hits against a list
>>> of words, let's say off the stop word list, are ranked lower?
>>> 
>>> Steve
>>> 
>>> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey 
>> wrote:
>>> 
 On 8/27/2016 12:39 PM, Shawn Heisey wrote:
> I personally think that stopword removal is more of a problem than a
> solution.
 
 There actually is one thing that a stopword filter can dothat has little
 to do with the purpose it was designed for.  You can make it impossible
 to search for certain words.
 
 Imagine that your original data contains the word "frisbee" but for some
 reason you do not want anybody to be able to locate results using that
 word.  You can create a stopword list containing just "frisbee" and any
 other variations that you want to limit like "frisbees", then place it
 as a filter on the index side of your analysis.  With this in place,
 searching for those terms will retrieve zero results.
 
 Thanks,
 Shawn
 
 
>> 
>> 



MapReduceIndexerTool erroring with max_array_length

2016-09-08 Thread Darshan Pandya
Hello,

While this may be a question for cloudera, I wanted to tap the brains of
this very active community as well.

I am trying to use the MapReduceIndexerTool to index data in a hive table
to Solr Cloud / Cloudera Search.

The tool is failing the job with the following error



1799 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 1
files using 1 real mappers into 10 reducers

Error: MAX_ARRAY_LENGTH

Error: MAX_ARRAY_LENGTH

Error: MAX_ARRAY_LENGTH

36962 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool  - Job
failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper,
jobId: job_1473161870114_0339



The error stack trace is

2016-09-08 10:39:20,128 ERROR [main]
org.apache.hadoop.mapred.YarnChild: Error running child :
java.lang.NoSuchFieldError: MAX_ARRAY_LENGTH
at 
org.apache.lucene.codecs.memory.DirectDocValuesFormat.(DirectDocValuesFormat.java:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at java.lang.Class.newInstance(Class.java:374)
at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:67)
at org.apache.lucene.util.NamedSPILoader.(NamedSPILoader.java:47)
at org.apache.lucene.util.NamedSPILoader.(NamedSPILoader.java:37)
at 
org.apache.lucene.codecs.DocValuesFormat.(DocValuesFormat.java:43)
at 
org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:205)





My Schema.xml looks like





   

   

   









dataset_id





I am otherwise about to post documents using Solr APIs / upload methods.
Only the MapReduceIndexer tool is failing.



The command I am using is

hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool -D
'mapred.child.java.opts=-Xmx500m' --log4j
/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/s
hare/doc/search-1.0.0+cdh5.7.0+0/examples/solr-nrt/log4j.properties
--morphline-file /home/$USER/morphline2.conf --output-dir
hdfs://NNHOST:8020/user/$USER/outdir --verbose --zk-host ZKHOST:2181/solr1
--collection dataCatalog_search_index
hdfs://NNHOST:8020/user/hive/warehouse/name.db/concatenated_index4/;



My morphline config looks like



SOLR_LOCATOR : {

  # Name of solr collection

  collection : search_index



  # ZooKeeper ensemble

  $zkHost:2181/solr1"

}



# Specify an array of one or more morphlines, each of which defines an ETL

# transformation chain. A morphline consists of one or more (potentially

# nested) commands. A morphline is a way to consume records (e.g. Flume
events,

# HDFS files or blocks), turn them into a stream of records, and pipe the
stream

# of records through a set of easily configurable transformations on the
way to

# a target application such as Solr.

morphlines : [

  {

id : search_index

importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

commands : [

  {

readCSV {

  separator : ","

  columns : [dataset_id,search_string]

  ignoreFirstLine : true

  charset : UTF-8

}

  }





  # Consume the output record of the previous command and pipe another

  # record downstream.

  #

  # Command that deletes record fields that are unknown to Solr

  # schema.xml.

  #

  # Recall that Solr throws an exception on any attempt to load a
document

  # that contains a field that isn't specified in schema.xml.

  {

sanitizeUnknownSolrFields {

  # Location from which to fetch Solr schema

  solrLocator : ${SOLR_LOCATOR}

}

  }



  # log the record at DEBUG level to SLF4J

  { logDebug { format : "output record: {}", args : ["@{}"] } }



  # load the record into a Solr server or MapReduce Reducer

  {

loadSolr {

  solrLocator : ${SOLR_LOCATOR}

}

  }

]

  }

]





Please let me know if I am going anything wrong.

-- 
Sincerely,
Darshan


Re: Solr [Streaming Expressions/Parallel SQL Interface] Not supporting Multi Value using mapReduce option

2016-09-08 Thread Erick Erickson
The basic problem is "what does sorting on a multi-valued field mean"? If
you have a numeric field with values 1, 5, 7 how should sorting rank that
doc? Use 1? 7? the average? Median? Sum?

There is some limited ability in the rest of Solr to sort by min/max but that's
it.

Best,
Erick

On Thu, Sep 8, 2016 at 4:57 AM, Joel Bernstein  wrote:
> Yes, sorting on multi-valued value fields isn't supported with Streaming
> Expressions.
>
> Multi-value fields can be exported but not used for sorting.
>
> There currently isn't a plan to add sorting on multi-value fields, but if
> other areas in Solr are supporting this perhaps we could use the same
> technique.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Sep 8, 2016 at 2:24 AM, Praveen Babu 
> wrote:
>
>> Hi ,
>>
>>
>> Does parallel sql support Multi valued field?
>>
>> I am unable to group by on Multi valued field  when I choose
>>
>> /sql?aggregationMode=map_reduce
>>
>> "can not sort on multivalued field"
>>
>>
>>
>> input:
>>
>> {
>>
>> id: 1
>>
>> field1:[1,2,3],
>>
>> app.name:[watsapp,facebook,... ]
>>
>> }
>>
>> {
>>
>> id: 2
>>
>> field1:[1,2,3],
>>
>> app.name:[watsapp,facebook,... ]
>>
>> }
>>
>>
>>
>> Expected result :
>>
>> watsapp: 2
>>
>> facebook : 2
>>
>>
>> I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any
>> suggestion?
>>
>>
>>
>> Need your valuable suggestion.
>>
>>
>>
>> Regards,
>> S.Praveen
>> Technical Architech
>> LinkedIn:
>> https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_
>> responsive_tab_profile
>>


Re: Solr Grouping, Aggregations and Custom Functions

2016-09-08 Thread Praveen Babu
Hi Joel Bernstein,

Thanks for the update .If you guys get chance to provide that feature soon,
it will be more benefit to the solr users.


Regards,
S.Praveen
Technical Architech
LinkedIn:
https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_responsive_tab_profile




On Thu, Sep 8, 2016 at 5:30 PM, Joel Bernstein  wrote:

> Parallel SQL only supports the following functions currently: (SUM, AVG,
> MIN, MAX, COUNT).
>
> More functions and compound functions are on the roadmap.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Sep 8, 2016 at 12:11 AM, Praveen Babu 
> wrote:
>
> > Hi All,
> >
> > I am also new to Solr and I have gone through Solr document and tested
> agg
> > using Solr- Prasto ( Parallel sql), Stream.
> >
> > I am getting very good response using these 2 technologies. But my
> worries
> > are, unable to Group By Multivalue field which Solr standard api does but
> > not latest version of solr-prasto/Stream.
> >
> > I want to aggregate/Group by  "app.name" field using stream/ Parallel
> sql.
> > Please suggest.
> >
> > input:
> >
> > {
> >
> > id: 1
> >
> > field1:[1,2,3],
> >
> > app.name:[watsapp,facebook,... ]
> >
> > }
> >
> > {
> >
> > id: 2
> >
> > field1:[1,2,3],
> >
> > app.name:[watsapp,facebook,... ]
> >
> > }
> >
> >
> >
> > Expected result :
> >
> > watsapp: 2
> >
> > facebook : 2
> >
> >
> > I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any
> > suggestion?
> >
> >
> >
> > Regards,
> > S.Praveen
> > Technical Architech
> > LinkedIn:
> > https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_
> > responsive_tab_profile
> >
> >
> >
> >
> > On Thu, Sep 8, 2016 at 6:01 AM, Roshni Rajagopal 
> > wrote:
> >
> > > Hi Solr Gurus,
> > >
> > >I have these requirements
> > >
> > > 1. Need to group data in solr on multiple fields and compute
> agregations
> > > like SUM (field)
> > >
> > > 2. Need to compute some custom calculations - sum(field1)/sum(field2)
> on
> > > the grouped data.
> > >
> > > Options Ive tried
> > >
> > > 1. Group- this does not allow to group by more than 1 field, and
> > > aggregations are not supported
> > >
> > > 2. Stats - this along with facet.pivot gets results for basic group
> > > aggregations like SUM. Custom Calculation is not supported. Also the
> > format
> > > is messy with stats getting calculated at every level. Cannot paginate.
> > >
> > > 2. Facet JSON API -gets results for basic group aggregations like SUM.
> > > Format is less messy and we can paginate. Custom Calculation like
> > > DIV(sum(field1), sum(field2)) is still not supported.
> > >
> > > So the last resort is /sql handler for parallel queries. Is tested and
> > > stable, and will it meet my requirements? Im on solr 6.10.
> > >
> > > Or would you recommend adding Spark…I would prefer to handle all
> > > requirements in solr, as I dont want to maintain another moving part of
> > > Spark.
> > >
> > > Do advise!
> > >
> > > Regards
> > >
> > > Roshni
> > >
> >
>


Re: Default stop word list

2016-09-08 Thread Steven White
Hi Walter and all.  Sorry for the late reply, I was out of town.

Are you saying the list of stop words from the stop word file be remove?  I
understand the issues I will run into because of the stop word list, but
all alone, my understanding of stop word list being in the stop word file
is -- to eliminate them from being indexed -- is so that relevancy ranking
is improved.  For example, if I index the word "the" instead of removing it
than when I send the search term "the cat" (without quotes) than records
with "the" will rank far higher vs. records with "cat" in my result set.
In fact records with "cat" may not even be on the first page.  Wasn't this
was stop word list created?

If my understanding is correct, is there a way for me to rank lower records
that have a hit due to a list of common words, such as stop words?  This
way: (1) I can than get rid of all the stop word list in the stop word
file, (2) solve the issue of searching on "be with me", et. al., and (3)
prevent the ranking issue.

Steve

On Mon, Aug 29, 2016 at 9:18 PM, Walter Underwood 
wrote:

> Do not remove stop words. Want to search for “vitamin a”? That won’t work.
>
> Stop word removal is a hack left over from when we were running search
> engines in 64 kbytes of memory.
>
> Yes, common words are less important for search, but removing them is a
> brute force approach with severe side effects. Instead, we use a
> proportional approach with the tf.idf model. That puts a higher weight on
> rare words and a lower weight on common words.
>
> For some real-life examples of problems with stop words, you can read the
> list of movie titles that disappear with stemming and stop words. I
> discovered these when I was running search at Netflix.
>
> • Being There (this is the first one I noticed)
> • To Be and To Have (Être et Avoir)
> • To Have and To Have Not
> • Once and Again
> • To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
> • To Be or Not To Be (1983)
> • Now and Then, Here and There
> • Be with Me
> • I’ll Be There
> • It Had to Be You
> • You Should Not Be Here
> • You Are Here
>
> https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 29, 2016, at 5:39 PM, Steven White  wrote:
> >
> > Thanks Shawn.  This is the best answer I have seen, much appreciated.
> >
> > A follow up question, I want to remove stop words from the list, but if I
> > do, then search quality will degradation (and index size will grow (less
> of
> > an issue)).  For example, if I remove "a", then if someone search for
> "For
> > a Few Dollars More" (without quotes) chances are good records with "a"
> will
> > land higher up that are not relevant to user's search.  How can I address
> > this?  Can I setup my schema so that records that get hits against a list
> > of words, let's say off the stop word list, are ranked lower?
> >
> > Steve
> >
> > On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey 
> wrote:
> >
> >> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
> >>> I personally think that stopword removal is more of a problem than a
> >>> solution.
> >>
> >> There actually is one thing that a stopword filter can dothat has little
> >> to do with the purpose it was designed for.  You can make it impossible
> >> to search for certain words.
> >>
> >> Imagine that your original data contains the word "frisbee" but for some
> >> reason you do not want anybody to be able to locate results using that
> >> word.  You can create a stopword list containing just "frisbee" and any
> >> other variations that you want to limit like "frisbees", then place it
> >> as a filter on the index side of your analysis.  With this in place,
> >> searching for those terms will retrieve zero results.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>


Re: solr 5.5.2 dump threads - threads blocked in org.eclipse.jetty.util.BlockingArrayQueue

2016-09-08 Thread elisabeth benoit
Well, we rekicked the machine with puppet, restarted solr and now it seems
ok. dont know what happened.

2016-09-08 11:38 GMT+02:00 elisabeth benoit :

>
> Hello,
>
>
> We are perf testing solr 5.5.2 (with a limit test, i.e. sending as much
> queries/sec as possible) and we see the cpu never goes over 20%, and
> threads are blocked in org.eclipse.jetty.util.BlockingArrayQueue, as we
> can see in solr admin interface thread dumps
>
> qtp706277948-757 (757)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject@2c4a56cb
>
>- sun.misc.Unsafe.park​(Native Method)
>- java.util.concurrent.locks.LockSupport.parkNanos​(
>LockSupport.java:215)
>- java.util.concurrent.locks.AbstractQueuedSynchronizer$
>ConditionObject.awaitNanos​(AbstractQueuedSynchronizer.java:2078)
>- org.eclipse.jetty.util.BlockingArrayQueue.poll​(
>BlockingArrayQueue.java:389)
>- org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll​(
>QueuedThreadPool.java:531)
>- org.eclipse.jetty.util.thread.QueuedThreadPool.access$700​(
>QueuedThreadPool.java:47)
>- org.eclipse.jetty.util.thread.QueuedThreadPool$3.run​(
>QueuedThreadPool.java:590)
>- java.lang.Thread.run​(Thread.java:745)
>
>
> We changed two things in jetty configuration,
>
> maxThreads value in /opt/solr/server/solr/jetty.xml
>
>  default="400"/>
>
>
> and we activated the request log, i.e. uncommented the lines
>
>
>
> 
>
>   
>
> 
>
>   
>
> 
>
>   
>
> 
>
>/var/solr/logs/requests.log
>
> 
>
> _MM_dd
>
> 90
>
> true
>
> false
>
> false
>
> UTC
>
> true
>
>   
>
> 
>
>   
>
> 
>
>   
>
> 
>
>
> in jetty.xml
>
>
> We had the same result with maxThreads=1 (default value in solr
> install).
>
>
> Did anyone experiment the same issue with solr 5?
>
>
> Best regards,
>
> Elisabeth
>


Re: extract metadata

2016-09-08 Thread Alexandre Rafalovitch
That's what extract handler does. But look at the examples that ship with
Solr. Including examples/files one.

Or you can use Tina directly and send only extracted fields to Solr.

Regards,
Alex

On 8 Sep 2016 8:39 PM, "KRIS MUSSHORN"  wrote:

> How would one get all metadata/properties from a .doc/pdf/xls etc into
> fields into solr?
>
>


extract metadata

2016-09-08 Thread KRIS MUSSHORN
How would one get all metadata/properties from a .doc/pdf/xls etc into fields 
into solr? 



AW: Wrong highlighting in stripped HTML field

2016-09-08 Thread Neumann, Dennis
Hello,
thank you very much for your answers. As described in the SOLR-4686 issue, the 
problem only occurs when you use inline HTML tags (like  or ). So in 
my case the solution is actually to use a block element and force it to be 
inline:

bla

highlighting:

bla

Cheers and thanks again,
Dennis



Von: Alan Woodward [a...@flax.co.uk]
Gesendet: Donnerstag, 8. September 2016 12:48
An: solr-user@lucene.apache.org
Betreff: Re: Wrong highlighting in stripped HTML field

Hi, see https://issues.apache.org/jira/browse/SOLR-4686 
 - this is an ongoing point of 
contention!

Alan Woodward
www.flax.co.uk


> On 8 Sep 2016, at 09:38, Duck Geraint (ext) GBJH  
> wrote:
>
> As far as I can tell, that is how it's currently set-up (does the same on 
> mine at least). The HTML Stripper seems to exclude the pre tag, but include 
> the post tag when it generates the start and end offsets of each text token. 
> I couldn't say why though... (This may just have avoided needing to 
> backtrack).
>
> Play around in the analysis section of the admin ui to verify this.
>
> Geraint
>
>
> -Original Message-
> From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de]
> Sent: 07 September 2016 18:16
> To: solr-user@lucene.apache.org
> Subject: AW: Wrong highlighting in stripped HTML field
>
> Hello,
> can anyone confirm this behavior of the highlighter? Otherwise my Solr 
> installation might be misconfigured or something.
> Or does anyone know if this is a known issue? In that case I probably should 
> ask on the dev mailing list.
>
> Thanks and cheers,
> Dennis
>
>
> 
> Von: Neumann, Dennis [neum...@sub.uni-goettingen.de]
> Gesendet: Montag, 5. September 2016 18:00
> An: solr-user@lucene.apache.org
> Betreff: Wrong highlighting in stripped HTML field
>
> Hi guys
>
> I am having a problem with the standard highlighter. I'm working with Solr 
> 5.4.1. The problem appears in my project, but it is easy to replicate:
>
> I create a new core with the conf directory from configsets/basic_configs, so 
> everything is set to defaults. I add the following in schema.xml:
>
>
> required="false" multiValued="false" />
>
>
>  
>
>
>  
>  
>
>  
>
>
>
> Now I add this document (in the admin interface):
>
> {"id":"1","testfield":"bla"}
>
> I search for: testfield:bla
> with hl=on&hl.fl=testfield
>
> What I get is a response with an incorrectly formatted HTML snippet:
>
>
>  "response": {
>"numFound": 1,
>"start": 0,
>"docs": [
>  {
>"id": "1",
>"testfield": "bla",
>"_version_": 1544645963570741200
>  }
>]
>  },
>  "highlighting": {
>"1": {
>  "testfield": [
>"bla"
>  ]
>}
>  }
>
> Is there a way to tell the highlighter to just enclose the "bla"? I. e. I 
> want to get
>
> bla
>
>
> Best regards
> Dennis
>
>
> 
>
>
> Syngenta Limited, Registered in England No 2710846; Registered Office : 
> Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, 
> RG42 6EY, United Kingdom
> 
> This message may contain confidential information. If you are not the 
> designated recipient, please notify the sender immediately, and delete the 
> original and any copies. Any use of the message by you is prohibited.



Re: StrField with Wildcard Search

2016-09-08 Thread Ahmet Arslan
Hi,


I think AutomatonQuery is used.
http://opensourceconnections.com/blog/2013/02/21/lucene-4-finite-state-automaton-in-10-minutes-intro-tutorial/
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/AutomatonQuery.html

Ahmet


On Thursday, September 8, 2016 3:54 PM, Sandeep Khanzode 
 wrote:



Hi,

Okay.

So it seems that the wildcard searches will perform a (sort-of) dictionary 
search where they will inspect every (full keyword) token at search time, and 
do a match instead of a match on pre-created index-time tokens with TextField. 
However, the wildcard/fuzzy functionality will still be provided no matter the 
approach...
 
SRK


On Thursday, September 8, 2016 5:05 PM, Ahmet Arslan 
 wrote:



Hi,

EdgeNGram and Wildcard may be used to achieve the same goal: prefix search or 
starts with search.

Lets say, wildcard enumerates the whole inverted index, thus it may get slower 
for very large databases.
With this one no index time manipulation is required.

EdgeNGram does its magic at index time, indexes a lot of tokens, all possible 
prefixes.
Index size gets bigger, query time no wildcard operator required in this one.

Ahmet




On Thursday, September 8, 2016 12:35 PM, Sandeep Khanzode 
 wrote:
Hello,
There are quite a few links that detail the difference between StrField and 
TextField. Also links that explain that, even though the field is indexed, it 
is not tokenized and stored as a single keyword, as can be verified by the 
debug analysis on Solr admin and CURL debugQuery options.
What I am unable to understand is how a wildcard works on StrFields? For 
example, if the name is "John Doe" and I search for "John*", I get that match. 
Which means, that somewhere deep within, maybe a Trie or Dictionary 
representation exists that allows this search with a partial string.
I would have assumed that wildcard would match on TextFields which allow 
(Edge)NGramFilters, etc.  -- SRK 


Custom Function-based Fields

2016-09-08 Thread Sandeep Khanzode
Hi,
Can someone please direct me to some documentation that shows how to do this 
... ?
I need to write a non-trivial function that will return a new custom (not in 
schema) field but which is more complicated than a simple sum/avg/etc. 
I want to create a function that looks at a few dateranges in the current 
records and return possible an enum or an integer ... 
Maybe something similar could also be helpful ...
Thanks. SRK

Re: StrField with Wildcard Search

2016-09-08 Thread Sandeep Khanzode
Hi,
Okay.
So it seems that the wildcard searches will perform a (sort-of) dictionary 
search where they will inspect every (full keyword) token at search time, and 
do a match instead of a match on pre-created index-time tokens with TextField. 
However, the wildcard/fuzzy functionality will still be provided no matter the 
approach... SRK 

On Thursday, September 8, 2016 5:05 PM, Ahmet Arslan 
 wrote:
 

 Hi,

EdgeNGram and Wildcard may be used to achieve the same goal: prefix search or 
starts with search.

Lets say, wildcard enumerates the whole inverted index, thus it may get slower 
for very large databases.
With this one no index time manipulation is required.

EdgeNGram does its magic at index time, indexes a lot of tokens, all possible 
prefixes.
Index size gets bigger, query time no wildcard operator required in this one.

Ahmet



On Thursday, September 8, 2016 12:35 PM, Sandeep Khanzode 
 wrote:
Hello,
There are quite a few links that detail the difference between StrField and 
TextField. Also links that explain that, even though the field is indexed, it 
is not tokenized and stored as a single keyword, as can be verified by the 
debug analysis on Solr admin and CURL debugQuery options.
What I am unable to understand is how a wildcard works on StrFields? For 
example, if the name is "John Doe" and I search for "John*", I get that match. 
Which means, that somewhere deep within, maybe a Trie or Dictionary 
representation exists that allows this search with a partial string.
I would have assumed that wildcard would match on TextFields which allow 
(Edge)NGramFilters, etc.  -- SRK 


   

Re: Solr Grouping, Aggregations and Custom Functions

2016-09-08 Thread Joel Bernstein
Parallel SQL only supports the following functions currently: (SUM, AVG,
MIN, MAX, COUNT).

More functions and compound functions are on the roadmap.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Sep 8, 2016 at 12:11 AM, Praveen Babu 
wrote:

> Hi All,
>
> I am also new to Solr and I have gone through Solr document and tested agg
> using Solr- Prasto ( Parallel sql), Stream.
>
> I am getting very good response using these 2 technologies. But my worries
> are, unable to Group By Multivalue field which Solr standard api does but
> not latest version of solr-prasto/Stream.
>
> I want to aggregate/Group by  "app.name" field using stream/ Parallel sql.
> Please suggest.
>
> input:
>
> {
>
> id: 1
>
> field1:[1,2,3],
>
> app.name:[watsapp,facebook,... ]
>
> }
>
> {
>
> id: 2
>
> field1:[1,2,3],
>
> app.name:[watsapp,facebook,... ]
>
> }
>
>
>
> Expected result :
>
> watsapp: 2
>
> facebook : 2
>
>
> I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any
> suggestion?
>
>
>
> Regards,
> S.Praveen
> Technical Architech
> LinkedIn:
> https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_
> responsive_tab_profile
>
>
>
>
> On Thu, Sep 8, 2016 at 6:01 AM, Roshni Rajagopal 
> wrote:
>
> > Hi Solr Gurus,
> >
> >I have these requirements
> >
> > 1. Need to group data in solr on multiple fields and compute agregations
> > like SUM (field)
> >
> > 2. Need to compute some custom calculations - sum(field1)/sum(field2) on
> > the grouped data.
> >
> > Options Ive tried
> >
> > 1. Group- this does not allow to group by more than 1 field, and
> > aggregations are not supported
> >
> > 2. Stats - this along with facet.pivot gets results for basic group
> > aggregations like SUM. Custom Calculation is not supported. Also the
> format
> > is messy with stats getting calculated at every level. Cannot paginate.
> >
> > 2. Facet JSON API -gets results for basic group aggregations like SUM.
> > Format is less messy and we can paginate. Custom Calculation like
> > DIV(sum(field1), sum(field2)) is still not supported.
> >
> > So the last resort is /sql handler for parallel queries. Is tested and
> > stable, and will it meet my requirements? Im on solr 6.10.
> >
> > Or would you recommend adding Spark…I would prefer to handle all
> > requirements in solr, as I dont want to maintain another moving part of
> > Spark.
> >
> > Do advise!
> >
> > Regards
> >
> > Roshni
> >
>


Re: Solr [Streaming Expressions/Parallel SQL Interface] Not supporting Multi Value using mapReduce option

2016-09-08 Thread Joel Bernstein
Yes, sorting on multi-valued value fields isn't supported with Streaming
Expressions.

Multi-value fields can be exported but not used for sorting.

There currently isn't a plan to add sorting on multi-value fields, but if
other areas in Solr are supporting this perhaps we could use the same
technique.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Sep 8, 2016 at 2:24 AM, Praveen Babu 
wrote:

> Hi ,
>
>
> Does parallel sql support Multi valued field?
>
> I am unable to group by on Multi valued field  when I choose
>
> /sql?aggregationMode=map_reduce
>
> "can not sort on multivalued field"
>
>
>
> input:
>
> {
>
> id: 1
>
> field1:[1,2,3],
>
> app.name:[watsapp,facebook,... ]
>
> }
>
> {
>
> id: 2
>
> field1:[1,2,3],
>
> app.name:[watsapp,facebook,... ]
>
> }
>
>
>
> Expected result :
>
> watsapp: 2
>
> facebook : 2
>
>
> I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any
> suggestion?
>
>
>
> Need your valuable suggestion.
>
>
>
> Regards,
> S.Praveen
> Technical Architech
> LinkedIn:
> https://www.linkedin.com/in/praveen-babu-73232889?trk=nav_
> responsive_tab_profile
>


Re: Streaming expression in solr doesnot support collection alias

2016-09-08 Thread Joel Bernstein
Getting aliases working is a high priority and fairly easy to do. We should
have this in for Solr 6.3.


Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Sep 8, 2016 at 3:18 AM, Tali Finelt  wrote:

> Hi All,
>
> We saw there is an open issue regarding this subject:
> https://issues.apache.org/jira/browse/SOLR-9077
>
> We would very much like to use this feature in our new production version.
>
> This issue currently prevents us from using streaming.
>
> We were wondering if there is any plan to fix this soon?
> If not, can someone recommend a possible work-around?
>
> Thanks,
> Tali
>
>
>
>


Re: StrField with Wildcard Search

2016-09-08 Thread Ahmet Arslan
Hi,

EdgeNGram and Wildcard may be used to achieve the same goal: prefix search or 
starts with search.

Lets say, wildcard enumerates the whole inverted index, thus it may get slower 
for very large databases.
With this one no index time manipulation is required.

EdgeNGram does its magic at index time, indexes a lot of tokens, all possible 
prefixes.
Index size gets bigger, query time no wildcard operator required in this one.

Ahmet



On Thursday, September 8, 2016 12:35 PM, Sandeep Khanzode 
 wrote:
Hello,
There are quite a few links that detail the difference between StrField and 
TextField. Also links that explain that, even though the field is indexed, it 
is not tokenized and stored as a single keyword, as can be verified by the 
debug analysis on Solr admin and CURL debugQuery options.
What I am unable to understand is how a wildcard works on StrFields? For 
example, if the name is "John Doe" and I search for "John*", I get that match. 
Which means, that somewhere deep within, maybe a Trie or Dictionary 
representation exists that allows this search with a partial string.
I would have assumed that wildcard would match on TextFields which allow 
(Edge)NGramFilters, etc.  -- SRK 


Re: Wrong highlighting in stripped HTML field

2016-09-08 Thread Alan Woodward
Hi, see https://issues.apache.org/jira/browse/SOLR-4686 
 - this is an ongoing point of 
contention!

Alan Woodward
www.flax.co.uk


> On 8 Sep 2016, at 09:38, Duck Geraint (ext) GBJH  
> wrote:
> 
> As far as I can tell, that is how it's currently set-up (does the same on 
> mine at least). The HTML Stripper seems to exclude the pre tag, but include 
> the post tag when it generates the start and end offsets of each text token. 
> I couldn't say why though... (This may just have avoided needing to 
> backtrack).
> 
> Play around in the analysis section of the admin ui to verify this.
> 
> Geraint
> 
> 
> -Original Message-
> From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de]
> Sent: 07 September 2016 18:16
> To: solr-user@lucene.apache.org
> Subject: AW: Wrong highlighting in stripped HTML field
> 
> Hello,
> can anyone confirm this behavior of the highlighter? Otherwise my Solr 
> installation might be misconfigured or something.
> Or does anyone know if this is a known issue? In that case I probably should 
> ask on the dev mailing list.
> 
> Thanks and cheers,
> Dennis
> 
> 
> 
> Von: Neumann, Dennis [neum...@sub.uni-goettingen.de]
> Gesendet: Montag, 5. September 2016 18:00
> An: solr-user@lucene.apache.org
> Betreff: Wrong highlighting in stripped HTML field
> 
> Hi guys
> 
> I am having a problem with the standard highlighter. I'm working with Solr 
> 5.4.1. The problem appears in my project, but it is easy to replicate:
> 
> I create a new core with the conf directory from configsets/basic_configs, so 
> everything is set to defaults. I add the following in schema.xml:
> 
> 
> required="false" multiValued="false" />
> 
>
>  
>
>
>  
>  
>
>  
>
> 
> 
> Now I add this document (in the admin interface):
> 
> {"id":"1","testfield":"bla"}
> 
> I search for: testfield:bla
> with hl=on&hl.fl=testfield
> 
> What I get is a response with an incorrectly formatted HTML snippet:
> 
> 
>  "response": {
>"numFound": 1,
>"start": 0,
>"docs": [
>  {
>"id": "1",
>"testfield": "bla",
>"_version_": 1544645963570741200
>  }
>]
>  },
>  "highlighting": {
>"1": {
>  "testfield": [
>"bla"
>  ]
>}
>  }
> 
> Is there a way to tell the highlighter to just enclose the "bla"? I. e. I 
> want to get
> 
> bla
> 
> 
> Best regards
> Dennis
> 
> 
> 
> 
> 
> Syngenta Limited, Registered in England No 2710846; Registered Office : 
> Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, 
> RG42 6EY, United Kingdom
> 
> This message may contain confidential information. If you are not the 
> designated recipient, please notify the sender immediately, and delete the 
> original and any copies. Any use of the message by you is prohibited.



solr 5.5.2 dump threads - threads blocked in org.eclipse.jetty.util.BlockingArrayQueue

2016-09-08 Thread elisabeth benoit
Hello,


We are perf testing solr 5.5.2 (with a limit test, i.e. sending as much
queries/sec as possible) and we see the cpu never goes over 20%, and
threads are blocked in org.eclipse.jetty.util.BlockingArrayQueue, as we can
see in solr admin interface thread dumps

qtp706277948-757 (757)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2c4a56cb

   - sun.misc.Unsafe.park​(Native Method)
   - java.util.concurrent.locks.LockSupport.parkNanos​(LockSupport.java:215)
   -
   
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos​(AbstractQueuedSynchronizer.java:2078)
   -
   org.eclipse.jetty.util.BlockingArrayQueue.poll​(BlockingArrayQueue.java:389)
   -
   
org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll​(QueuedThreadPool.java:531)
   -
   
org.eclipse.jetty.util.thread.QueuedThreadPool.access$700​(QueuedThreadPool.java:47)
   -
   
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run​(QueuedThreadPool.java:590)
   - java.lang.Thread.run​(Thread.java:745)


We changed two things in jetty configuration,

maxThreads value in /opt/solr/server/solr/jetty.xml




and we activated the request log, i.e. uncommented the lines





  



  



  



   /var/solr/logs/requests.log



_MM_dd

90

true

false

false

UTC

true

  



  



  




in jetty.xml


We had the same result with maxThreads=1 (default value in solr
install).


Did anyone experiment the same issue with solr 5?


Best regards,

Elisabeth


StrField with Wildcard Search

2016-09-08 Thread Sandeep Khanzode
Hello,
There are quite a few links that detail the difference between StrField and 
TextField. Also links that explain that, even though the field is indexed, it 
is not tokenized and stored as a single keyword, as can be verified by the 
debug analysis on Solr admin and CURL debugQuery options.
What I am unable to understand is how a wildcard works on StrFields? For 
example, if the name is "John Doe" and I search for "John*", I get that match. 
Which means, that somewhere deep within, maybe a Trie or Dictionary 
representation exists that allows this search with a partial string.
I would have assumed that wildcard would match on TextFields which allow 
(Edge)NGramFilters, etc.  -- SRK

Re: [JSON Faceting] Domain filter query

2016-09-08 Thread Alessandro Benedetti
 another solution that jumped to my mind is to use stats :

Given the field : product_id to be the collapsing field.

For the facet i want the collapsed count I can do something like :

{
  brands:{
terms : {  // terms facet creates a bucket for each indexed term in the
field
  field : brand,
  sort : "uniqueProducts desc",
  facet : {
uniqueProducts : "unique(product_id)"
  }
}
  }
}
I will try it for curiousity!

Cheers

On Thu, Sep 8, 2016 at 9:45 AM, Alessandro Benedetti 
wrote:

> Hi guys,
> was thinking to this problem :
>
> Given a set of flat documents I want to calculate facets on :
> 1) flat results set
> 2) collapsed result set
>
> Specifically some of my field facets will need to be on the flat results
> set and some of them will need to be calculated over a collapsed result set
> ( you can immagine collapsed on one specific field).
>
> I know that this ideally can be solve restructuring the index in a nested
> object model and I agree that would be a good approach to address this
> problem ( as the nesting structure can be useful to solve other problems as
> well)
>
> Let's assume I don't want to change at all my schema.
> Is there a way to specify for some facet a domain ( the collapse filter
> query), while for others not ?
> I was taking a look to this : http://yonik.com/facet-domains/ which is
> conceptually similar to what I am describing but at the moment does not
> support a flexible use case ( taking in input a random filter query) .
>
> A naive solution would be to run 2 separates Solr queries ( one collapsed
> and one not).
> But let's explore other ideas !
>
> Cheers
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


[JSON Faceting] Domain filter query

2016-09-08 Thread Alessandro Benedetti
Hi guys,
was thinking to this problem :

Given a set of flat documents I want to calculate facets on :
1) flat results set
2) collapsed result set

Specifically some of my field facets will need to be on the flat results
set and some of them will need to be calculated over a collapsed result set
( you can immagine collapsed on one specific field).

I know that this ideally can be solve restructuring the index in a nested
object model and I agree that would be a good approach to address this
problem ( as the nesting structure can be useful to solve other problems as
well)

Let's assume I don't want to change at all my schema.
Is there a way to specify for some facet a domain ( the collapse filter
query), while for others not ?
I was taking a look to this : http://yonik.com/facet-domains/ which is
conceptually similar to what I am describing but at the moment does not
support a flexible use case ( taking in input a random filter query) .

A naive solution would be to run 2 separates Solr queries ( one collapsed
and one not).
But let's explore other ideas !

Cheers



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


RE: Wrong highlighting in stripped HTML field

2016-09-08 Thread Duck Geraint (ext) GBJH
As far as I can tell, that is how it's currently set-up (does the same on mine 
at least). The HTML Stripper seems to exclude the pre tag, but include the post 
tag when it generates the start and end offsets of each text token. I couldn't 
say why though... (This may just have avoided needing to backtrack).

Play around in the analysis section of the admin ui to verify this.

Geraint


-Original Message-
From: Neumann, Dennis [mailto:neum...@sub.uni-goettingen.de]
Sent: 07 September 2016 18:16
To: solr-user@lucene.apache.org
Subject: AW: Wrong highlighting in stripped HTML field

Hello,
can anyone confirm this behavior of the highlighter? Otherwise my Solr 
installation might be misconfigured or something.
Or does anyone know if this is a known issue? In that case I probably should 
ask on the dev mailing list.

Thanks and cheers,
Dennis



Von: Neumann, Dennis [neum...@sub.uni-goettingen.de]
Gesendet: Montag, 5. September 2016 18:00
An: solr-user@lucene.apache.org
Betreff: Wrong highlighting in stripped HTML field

Hi guys

I am having a problem with the standard highlighter. I'm working with Solr 
5.4.1. The problem appears in my project, but it is easy to replicate:

I create a new core with the conf directory from configsets/basic_configs, so 
everything is set to defaults. I add the following in schema.xml:





  


  
  

  



Now I add this document (in the admin interface):

{"id":"1","testfield":"bla"}

I search for: testfield:bla
with hl=on&hl.fl=testfield

What I get is a response with an incorrectly formatted HTML snippet:


  "response": {
"numFound": 1,
"start": 0,
"docs": [
  {
"id": "1",
"testfield": "bla",
"_version_": 1544645963570741200
  }
]
  },
  "highlighting": {
"1": {
  "testfield": [
"bla"
  ]
}
  }

Is there a way to tell the highlighter to just enclose the "bla"? I. e. I want 
to get

bla


Best regards
Dennis





Syngenta Limited, Registered in England No 2710846; Registered Office : 
Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire, 
RG42 6EY, United Kingdom

 This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited.


Streaming expression in solr doesnot support collection alias

2016-09-08 Thread Tali Finelt
Hi All,

We saw there is an open issue regarding this subject:
https://issues.apache.org/jira/browse/SOLR-9077

We would very much like to use this feature in our new production version. 

This issue currently prevents us from using streaming. 

We were wondering if there is any plan to fix this soon?
If not, can someone recommend a possible work-around?

Thanks,
Tali