Re: solr-user-subscribe

2017-07-16 Thread Yangrui Guo
unsubscribe

On Friday, July 14, 2017, Naohiko Uramoto  wrote:

> solr-user-subscribe >
>
> --
> Naohiko Uramoto
>


Re: highlighting on child document

2016-11-17 Thread Yangrui Guo
Thanks. Does Solr plan to add highlighting on children in future?

On Thursday, November 17, 2016, vstrugatsky  wrote:

> It appears that highlighting works for fields in the parent documents only.
> https://issues.apache.org/jira/browse/LUCENE-5929 only fixed a bug when
> trying to highlight fields in a parent document when using Block Join
> Parser.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/highlighting-on-child-document-tp4238236p4306375.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Can Solr find related terms in a document

2016-10-17 Thread Yangrui Guo
Looks more like the second case. I want to find pattern between certain
words.

On Monday, October 17, 2016, simon  wrote:

> Do you already have a set of terms for which you would want to find out
> their co-occurence, or are you trying to do data mining, looking in a
> collection for terms which occur together more often than by chance ?
>
>
> On Sun, Oct 16, 2016 at 3:45 AM, Yangrui Guo  > wrote:
>
> > Hello
> >
> > I'm curious to know if Solr can correlate the occurrences of two terms.
> > E.g. if "Bush administration" and "stupid mistake" often appear in the
> same
> > article, then Solr will think that the two terms are related. Is there a
> > way to achieve this?
> >
> > Yangrui
> >
>


Can Solr find related terms in a document

2016-10-16 Thread Yangrui Guo
Hello

I'm curious to know if Solr can correlate the occurrences of two terms.
E.g. if "Bush administration" and "stupid mistake" often appear in the same
article, then Solr will think that the two terms are related. Is there a
way to achieve this?

Yangrui


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-10 Thread Yangrui Guo
Hi my solution uses multi value text fields for storing data objects. It
works best with relational search using natural language. For example,
"car", "automobile", "vehicle" may denote the same class, but they are not
equivalent under certain contexts. Multiple value attributes could help the
search engine better interpret different expression by user referring to
the same concept.

On Sunday, July 10, 2016, Puneet Pawaia  wrote:

> Hi Yangrui
> We are testing the Rank and Retrieve as well as the NLP interface. However
> this is being done by another team and so I would not be able to comment
> further on it.
> I would like to know what kind of Solr field you are using for storing the
> output from your classes. And also what function output you are putting
> into that field.
> Thanks
> Puneet
>
> On 10 Jul 2016 00:17, "Yangrui Guo" >
> wrote:
>
> > Hi Puneet,
> >
> > I only use Watson's text to speech as user interface, because a lot of
> > people think NLP is the same as voice recognition. If you don't need
> voice
> > recognition you could remove Watson from it. Stanford has better
> dependency
> > parsing and can be used offline. However it seems you are using Watson's
> > retrieve and rank API, which is based on Solr, am I correct?
> >
> > Yangrui
> >
> > On Saturday, July 9, 2016, Puneet Pawaia  > wrote:
> >
> > > Hi Yangrui,
> > >
> > > I have been looking at your code for squery.
> > > Unfortunately, I am not very conversant with SolrJ.  I seem to be
> missing
> > > how and what data is added to the Solr index.
> > > Also, I see some references to IBM Watson in your code. Are you using
> IBM
> > > Watson? If yes, then why use the Stanford NLP if you can use the Watson
> > > NLP?
> > >
> > > Regards
> > > Puneet
> > >
> > >
> > > On Sat, Jul 9, 2016 at 11:37 AM, Puneet Pawaia <
> puneet.paw...@gmail.com 
> > > >
> > > wrote:
> > >
> > > > Hi Alessandro
> > > >
> > > > I am looking at being able to answer questions like "Can a
> non-compete
> > > > clause in an employment agreement be enforced after the expiry of the
> > > > agreement?"
> > > > We are doing some testing with IBM Watson and with a sample test
> data,
> > we
> > > > are able to get relevant replies to the above question. Since IBM
> > Watson
> > > > uses Solr at its backend, I was wondering if we can get the same
> > working
> > > at
> > > > the Solr level without having to use Watson.
> > > >
> > > > Regards
> > > > Puneet
> > > >
> > > > On Sat, Jul 9, 2016 at 11:34 AM, Puneet Pawaia <
> > puneet.paw...@gmail.com 
> > > >
> > > > wrote:
> > > >
> > > >> Hi Alessandro
> > > >>
> > > >> I am looking at being able to answer questions like "Can a
> non-compete
> > > >> clause in an employment agreement be enforced after the expiry of
> the
> > > >> agreement?"
> > > >>
> > > >> On Sat, Jul 9, 2016 at 4:34 AM, Alessandro Benedetti <
> > > >> abenede...@apache.org  > wrote:
> > > >>
> > > >>> Hi Puneet,
> > > >>> your requirement :
> > > >>> "I would like users to be able to write queries in natural language
> > > >>> rather
> > > >>> than keyword based search."
> > > >>>
> > > >>> Is really really vague :(
> > > >>> Can you try to help us with some specific example, starting of
> course
> > > >>> from
> > > >>> the simplest use cases you have initially in mind ?
> > > >>>
> > > >>> Moving from keyword based search to natural language is a really
> > > complex
> > > >>> task.
> > > >>> Proceeding step by step can help you.
> > > >>>
> > > >>> Do you want for example to set up a Q&A basic system ?
> > > >>> In that case you should take care of query rewriting.
> > > >>> You need basically to identify your base requirement and then
> build a
> > > >>> specific parser for that.
> > > >>> You can use triple stores and knowledge bases to enrich both your
> > query
> > > >>> and
> > &g

Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-09 Thread Yangrui Guo
Hi Puneet,

I only use Watson's text to speech as user interface, because a lot of
people think NLP is the same as voice recognition. If you don't need voice
recognition you could remove Watson from it. Stanford has better dependency
parsing and can be used offline. However it seems you are using Watson's
retrieve and rank API, which is based on Solr, am I correct?

Yangrui

On Saturday, July 9, 2016, Puneet Pawaia  wrote:

> Hi Yangrui,
>
> I have been looking at your code for squery.
> Unfortunately, I am not very conversant with SolrJ.  I seem to be missing
> how and what data is added to the Solr index.
> Also, I see some references to IBM Watson in your code. Are you using IBM
> Watson? If yes, then why use the Stanford NLP if you can use the Watson
> NLP?
>
> Regards
> Puneet
>
>
> On Sat, Jul 9, 2016 at 11:37 AM, Puneet Pawaia  >
> wrote:
>
> > Hi Alessandro
> >
> > I am looking at being able to answer questions like "Can a non-compete
> > clause in an employment agreement be enforced after the expiry of the
> > agreement?"
> > We are doing some testing with IBM Watson and with a sample test data, we
> > are able to get relevant replies to the above question. Since IBM Watson
> > uses Solr at its backend, I was wondering if we can get the same working
> at
> > the Solr level without having to use Watson.
> >
> > Regards
> > Puneet
> >
> > On Sat, Jul 9, 2016 at 11:34 AM, Puneet Pawaia  >
> > wrote:
> >
> >> Hi Alessandro
> >>
> >> I am looking at being able to answer questions like "Can a non-compete
> >> clause in an employment agreement be enforced after the expiry of the
> >> agreement?"
> >>
> >> On Sat, Jul 9, 2016 at 4:34 AM, Alessandro Benedetti <
> >> abenede...@apache.org > wrote:
> >>
> >>> Hi Puneet,
> >>> your requirement :
> >>> "I would like users to be able to write queries in natural language
> >>> rather
> >>> than keyword based search."
> >>>
> >>> Is really really vague :(
> >>> Can you try to help us with some specific example, starting of course
> >>> from
> >>> the simplest use cases you have initially in mind ?
> >>>
> >>> Moving from keyword based search to natural language is a really
> complex
> >>> task.
> >>> Proceeding step by step can help you.
> >>>
> >>> Do you want for example to set up a Q&A basic system ?
> >>> In that case you should take care of query rewriting.
> >>> You need basically to identify your base requirement and then build a
> >>> specific parser for that.
> >>> You can use triple stores and knowledge bases to enrich both your query
> >>> and
> >>> your index, but let's start from the basis, what is your simplest
> >>> requirement ?
> >>>
> >>> On Fri, Jul 8, 2016 at 1:56 PM, Jay Urbain  > wrote:
> >>>
> >>> > I've added multivalued fields within my SOLR schema for indexing
> >>> entities
> >>> > extracted using NLP methods applied to the text I'm indexing, along
> >>> with
> >>> > fields for other discrete data extracted from relational databases.
> >>> >
> >>> > A Java application reads data out of multiple relational databases,
> >>> uses
> >>> > NLP on the text and indexes each document (de-normalized) using
> SOLRJ.
> >>> >
> >>> > I initially tried doing this with content handlers, but found it much
> >>> > easier to just write a Java application.
> >>> >
> >>> > SOLRJ Java API reference:
> >>> > https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
> >>> >
> >>> > Stanford NLP:
> >>> > http://stanfordnlp.github.io/CoreNLP/
> >>> >
> >>> > Best,
> >>> > Jay
> >>> >
> >>> >
> >>> > On Thu, Jul 7, 2016 at 9:52 PM, Puneet Pawaia <
> puneet.paw...@gmail.com 
> >>> >
> >>> > wrote:
> >>> >
> >>> > > Hi Jay
> >>> > > Any place I can learn more on this method of integration?
> >>> > > Thanks
> >>> > > Puneet
> >>> > >
> >>> > > On 8 Jul 2016 02:58, "Jay Urbain"  > wrote:
> >>> > >
> >>> > > > I use Stanford NLP and cTakes (based on OpenNLP) while indexing
> >>> with a
> >>> > > > SOLRJ application.
> >>> > > >
> >>> > > > Best,
> >>> > > > Jay
> >>> > > >
> >>> > > > On Thu, Jul 7, 2016 at 12:09 PM, Puneet Pawaia <
> >>> > puneet.paw...@gmail.com >
> >>> > > > wrote:
> >>> > > >
> >>> > > > > Hi
> >>> > > > >
> >>> > > > > I am currently using Solr 5.5.x to test but can upgrade to Solr
> >>> 6.x
> >>> > if
> >>> > > > > required.
> >>> > > > > I am working on a POC for natural language query using Solr.
> >>> Should I
> >>> > > use
> >>> > > > > the Stanford libraries or are there any other libraries having
> >>> > > > integration
> >>> > > > > with Solr already available.
> >>> > > > > Any direction in how to do this would be most appreciated. How
> >>> > should I
> >>> > > > > process the query to give relevant results.
> >>> > > > >
> >>> > > > > Regards
> >>> > > > > Puneet
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> --
> >>>
> >>> Benedetti Alessandro
> >>> Visiting card : http://about.me/alessandro_benedetti
> >>>
> >>> "Tyger, tyger burning bright
> >>> In the forests of the night,

Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Yangrui Guo
My solution lets users retrieve data entities using queries like "find me a
job that only requires a high school degree" and "I want a car from
American with alloy wheels". It can also be expanded to perform other
database queries, like date time or price range searches. I use Stanford
NLP to identify the main entity and its related attributes in a user
sentence.

Yangrui

On Thursday, July 7, 2016, Puneet Pawaia  wrote:

> Hi  Yangrui
> I would like users to be able to write queries in natural language rather
> than keyword based search.
> A link to your solution would be worth looking at.
> Regards
> Puneet
>
> On 8 Jul 2016 03:02, "Yangrui Guo" >
> wrote:
>
> What is your NLP search like? I have a NLP solution for Solr and just open
> sourced it. Not sure if it fits your need
>
> Yangrui
>
> On Thursday, July 7, 2016, Puneet Pawaia  > wrote:
>
> > Hi
> >
> > I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> > required.
> > I am working on a POC for natural language query using Solr. Should I use
> > the Stanford libraries or are there any other libraries having
> integration
> > with Solr already available.
> > Any direction in how to do this would be most appreciated. How should I
> > process the query to give relevant results.
> >
> > Regards
> > Puneet
> >
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Yangrui Guo
https://github.com/guoyangrui/squery

It's not well documented yet but the idea was simple. Users should first
format their database tables into triples by creating view, then Solr and
Stanford NLP handles the data retrieval part. I hope someone could continue
contribute to its developement.

Yangrui

On Thursday, July 7, 2016, John Blythe  wrote:

> can you share a link, i'd be interested in checking it out.
>
> thanks-
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com 
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Thu, Jul 7, 2016 at 4:32 PM, Yangrui Guo  > wrote:
>
> > What is your NLP search like? I have a NLP solution for Solr and just
> open
> > sourced it. Not sure if it fits your need
> >
> > Yangrui
> >
> > On Thursday, July 7, 2016, Puneet Pawaia  > wrote:
> >
> > > Hi
> > >
> > > I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> > > required.
> > > I am working on a POC for natural language query using Solr. Should I
> use
> > > the Stanford libraries or are there any other libraries having
> > integration
> > > with Solr already available.
> > > Any direction in how to do this would be most appreciated. How should I
> > > process the query to give relevant results.
> > >
> > > Regards
> > > Puneet
> > >
> >
>


Re: Integrating Stanford NLP or any other NLP for Natural Language Query

2016-07-07 Thread Yangrui Guo
What is your NLP search like? I have a NLP solution for Solr and just open
sourced it. Not sure if it fits your need

Yangrui

On Thursday, July 7, 2016, Puneet Pawaia  wrote:

> Hi
>
> I am currently using Solr 5.5.x to test but can upgrade to Solr 6.x if
> required.
> I am working on a POC for natural language query using Solr. Should I use
> the Stanford libraries or are there any other libraries having integration
> with Solr already available.
> Any direction in how to do this would be most appreciated. How should I
> process the query to give relevant results.
>
> Regards
> Puneet
>


Re: Child doc facet not getting terms, only counts

2016-04-26 Thread Yangrui Guo
I've finally solved this problem. It appears that I do not need to add the
line domain: blockChildren: content_type:c in the subfacet. Now I've got my
desired results

On Tue, Apr 26, 2016 at 3:14 PM, Yangrui Guo  wrote:

> The documents are organized in a key-value like structure
>
> {
>  id: 1
>  product_name: some apparel
>  category: apparel
>  {
>   attribute: brand
>   value: Chanel
>   }
>   {
>attribute: madein
>value: Europe
>}
> }
>
> Because there are indefinite numbers of attributes associated with the
> products, I used this structure to store the document. My intention is to
> show facets of the value when an attribute facet is chosen. For example, if
> you choose "brand" then it'll show "Chanel", "Dior", etc. Is this currently
> possible?
>
> Yangrui Guo
>
>
> On Tuesday, April 26, 2016, Yonik Seeley  wrote:
>
>> How are the documents indexed?  Can you show an example document (with
>> nested documents)?
>> -Yonik
>>
>>
>> On Tue, Apr 26, 2016 at 5:08 PM, Yangrui Guo 
>> wrote:
>> >  When I use subfaceting with Json API, the facet results only gave me
>> > counts, no terms. My query is like this:
>> >
>> > {
>> > apparels : {
>> > type: terms,
>> > field: brand,
>> > facet:{
>> >   values:{
>> >   type: query,
>> >   q:\"brand:Chanel\",
>> >   facet: {
>> > type: terms,
>> > field: madein
>> >   }
>> >   domain: { blockChildren : \"content_type:p\" }
>> >   }
>> > },
>> > domain: { blockChildren : \"content_type:p\" }
>> > }
>> > }
>> > }
>> >
>> > And this is the results that I got:
>> >
>> > facets={
>> > count=57477,
>> > apparels={
>> > buckets=
>> > {
>> > val=Chanel,
>> > count=6,
>> > madein={
>> > count=6
>> >
>> > buckets={}
>> > }
>> > }
>> > }
>> > }
>> >
>> > The second buckets got zero results but the count was correct. What was
>> I
>> > missing? Thanks so much!
>>
>


Re: Child doc facet not getting terms, only counts

2016-04-26 Thread Yangrui Guo
The documents are organized in a key-value like structure

{
 id: 1
 product_name: some apparel
 category: apparel
 {
  attribute: brand
  value: Chanel
  }
  {
   attribute: madein
   value: Europe
   }
}

Because there are indefinite numbers of attributes associated with the
products, I used this structure to store the document. My intention is to
show facets of the value when an attribute facet is chosen. For example, if
you choose "brand" then it'll show "Chanel", "Dior", etc. Is this currently
possible?

Yangrui Guo


On Tuesday, April 26, 2016, Yonik Seeley  wrote:

> How are the documents indexed?  Can you show an example document (with
> nested documents)?
> -Yonik
>
>
> On Tue, Apr 26, 2016 at 5:08 PM, Yangrui Guo  > wrote:
> >  When I use subfaceting with Json API, the facet results only gave me
> > counts, no terms. My query is like this:
> >
> > {
> > apparels : {
> > type: terms,
> > field: brand,
> > facet:{
> >   values:{
> >   type: query,
> >   q:\"brand:Chanel\",
> >   facet: {
> > type: terms,
> > field: madein
> >   }
> >   domain: { blockChildren : \"content_type:p\" }
> >   }
> > },
> > domain: { blockChildren : \"content_type:p\" }
> > }
> > }
> > }
> >
> > And this is the results that I got:
> >
> > facets={
> > count=57477,
> > apparels={
> > buckets=
> > {
> > val=Chanel,
> > count=6,
> > madein={
> > count=6
> >
> > buckets={}
> > }
> > }
> > }
> > }
> >
> > The second buckets got zero results but the count was correct. What was I
> > missing? Thanks so much!
>


Child doc facet not getting terms, only counts

2016-04-26 Thread Yangrui Guo
 When I use subfaceting with Json API, the facet results only gave me
counts, no terms. My query is like this:

{
apparels : {
type: terms,
field: brand,
facet:{
  values:{
  type: query,
  q:\"brand:Chanel\",
  facet: {
type: terms,
field: madein
  }
  domain: { blockChildren : \"content_type:p\" }
  }
},
domain: { blockChildren : \"content_type:p\" }
}
}
}

And this is the results that I got:

facets={
count=57477,
apparels={
buckets=
{
val=Chanel,
count=6,
madein={
count=6

buckets={}
}
}
}
}

The second buckets got zero results but the count was correct. What was I
missing? Thanks so much!


how to retrieve json facet using solrj

2016-04-24 Thread Yangrui Guo
Hello

I use json facet api to get facets. The response returned with facets and
counts However, when I called the getFacetFields method in SolrJ client, I
got null results. How can I get the facet results from solrj? I set my
query as query.setParam("json.facet", "{entities : {type: terms,field:
class2} }" Am I missing something? Thanks.

Yangrui


Re: pivoting with json facet api

2016-04-21 Thread Yangrui Guo
Thanks so much! Are you also contributing to Solr development?

On Thu, Apr 21, 2016 at 3:33 PM, Alisa Z.  wrote:

>  Hi Yangrui,
>
> I have summarized some experiments about Solr nesting capabilities
> (however, it does not include precisely pivoting yet more of faceting up to
> parents and down to children with some statictics) so maybe you could find
> an idea there:
>
>
> https://medium.com/@alisazhila/solr-s-nesting-on-solr-s-capabilities-to-handle-deeply-nested-document-structures-50eeaaa4347a#.dbxdv3zdp
>
>
> Please, let me know if it were useful in comments. You could also specify
> your problem a bit more if you don't find the answer.
>
> Cheers,
> Alisa
>
>
>
> >Четверг, 21 апреля 2016, 1:01 -04:00 от Yangrui Guo  >:
> >
> >Hi
> >
> >I am trying to facet results on my nest documents. The solr document did
> >not say much on how to pivot with json api with nest documents. Could
> >someone show me some examples? Thanks very much.
> >
> >Yangrui
>
>


pivoting with json facet api

2016-04-20 Thread Yangrui Guo
Hi

I am trying to facet results on my nest documents. The solr document did
not say much on how to pivot with json api with nest documents. Could
someone show me some examples? Thanks very much.

Yangrui


Re: how to restrict phrase to appear in same child document

2016-04-20 Thread Yangrui Guo
Hi thanks for answering. My problem is that users do not distinguish what
color the color belongs to in the query. For example, "which black driver
has a white mercedes", it is difficult to distinguish which color belongs
to which field, because there can be thousands of car brands and
professions. Is there anyway that can achieve the feature I stated been
fore?

On Wednesday, April 20, 2016, Alisa Z.  wrote:

>  Yangrui,
>
> First, have you indexed your documents with proper nested document
> structure [
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments]?
> From the peice of data you showed, it seems that you just put it right as
> it is and it all got flattened.
>
> Then, you'll probably want to introduce a distinguishing
> "type"/"category"/"path" fields into your data, so it would look like this:
>
> {
> type:top
> id:
> {
> type:car_color
> car:
> color:
> }
> {
>   type:driver_color
> driver:
> color:
> }
> }
>
>
> >Wed, 20 Apr 2016 -3:28:33 -0400 от Yangrui Guo  >:
> >
> >hello
> >
> >I have a nested document type in my index. Here's the structure of my
> >document:
> >
> >{
> >id:
> >{
> >car:
> >color:
> >}
> >{
> >driver:
> >color:
> >}
> >}
> >
> >However, when I use the query q={!parent
> >which="content_type:parent"}+(black AND driver)&fq={!parent
> >which="content_type:parent"}+(white AND mercedes), the result also
> >contained white driver with black mercedes. I know I can put fields before
> >terms but it is not always easy to do this. Users might just enter one
> >string. How can I modify my query to require that the terms between two
> >parentheses must appear in the same child document, or boost those meet
> the
> >criteria? Thanks
>
>


how to restrict phrase to appear in same child document

2016-04-19 Thread Yangrui Guo
hello

I have a nested document type in my index. Here's the structure of my
document:

{
id:
{
car:
color:
}
{
driver:
color:
}
}

However, when I use the query q={!parent
which="content_type:parent"}+(black AND driver)&fq={!parent
which="content_type:parent"}+(white AND mercedes), the result also
contained white driver with black mercedes. I know I can put fields before
terms but it is not always easy to do this. Users might just enter one
string. How can I modify my query to require that the terms between two
parentheses must appear in the same child document, or boost those meet the
criteria? Thanks


Re: Multiple data-config.xml in one collection?

2016-04-06 Thread Yangrui Guo
Yes URL length is also one of my concerns. If, say, I have a million of
collections, must I specify all the collection names in the request to
perform a search across all collections? The reason I want to combine data
config into a single node is because I feel it is impractical to search
large amount of collections.

On Wednesday, April 6, 2016, Alexandre Rafalovitch 
wrote:

> I believe the config request for DIH is read on every import, so it is
> entirely possible to just have one handler and pass the parameter for
> which specific file to use as the configuration.
>
> It is also possible to actually pass the full configuration as a URL
> parameter dataConfig. Need to watch out for the URL length though if
> using GET request.
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 6 April 2016 at 00:12, Yangrui Guo >
> wrote:
> > Hello
> >
> > I'm using Solr Cloud to index a number of databases. The problem is there
> > is unknown number of databases and each database has its own
> configuration.
> > If I create a single collection for every database the query would
> > eventually become insanely long. Is it possible to upload different
> config
> > to zookeeper for each node in a single collection?
> >
> > Best,
> >
> > Yangrui Guo
>


Re: Multiple data-config.xml in one collection?

2016-04-05 Thread Yangrui Guo
Thanks man. I'd love to learn more about the Talend OpenStudio project
you're working on. Is it based on Lucene/Solr or a different project?

On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] 
wrote:

> Yangrui,
>
> Let me clarify - to have multiple data imports run concurrently, my
> impression is that you must have different requestHandlers declared in your
> solrconfig.xml
> By default, Data Import Handler is not multi-threaded; having multiple
> requestHandlers for it is a workaround to this, not a fix.
>
> I also have to say that I'm trying in newer projects to work with Talend
> OpenStudio to do the database queries and push data to Solr.  Talend
> OpenStudio allows the same sort of transformations as possible in Data
> Import Handler, and seems to me more independent of SolrCloud than Data
> Import Handler.  There are many different ways to do it.
>
> -Original Message-
> From: Davis, Daniel (NIH/NLM) [C]
> Sent: Tuesday, April 05, 2016 5:40 PM
> To: solr-user@lucene.apache.org 
> Subject: RE: Multiple data-config.xml in one collection?
>
> Yangrui,
>
> Solr will just do one data import.You can have a script invoke more
> than one, and they will run concurrently.   There are some risks with that,
> depending on what you are doing.   If it's just pulling from a database, I
> think you are all right.   I've even had 4 run concurrently to make Data
> Import Handler be "multi-threaded".   My query in one case looks like this:
>
> SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
> medplus.public_topic_sites_us_v t) WHERE threadid = 0
>
> And then I have 3 other queries in other DIH configurations for threadid
> 1,2,3.
>
> You also have to be careful with the clean parameter - unless a specific
> delete query is specified using the "preImportDeleteQuery" or
> "postImportDeleteQuery", then the clean parameter will cause DIH will
> remove the index data from all data import handlers even though you are
> only refreshing one.   If you configure it carefully, it all works however.
>
> These are the use cases for the "source" field I use:
>
> - Filter only on documents from one source for the user, by specifying
> fq=source:health-topics in the query to Solr.
> - Filter only documents from one source in backend processing, for
> instance for the preImportDeleteQuery.
> - Do something different in the application that front-ends Solr depending
> on the "source" field value.
>
> There are some impacts on relevancy from combining them into one
> collection:
>
> When you combine multiple sources into one collection, whether using DIH
> or some other mechanism, you have to remember that the relevancy
> calculations of Solr include documents from both sources.   Even if
> documents having different "source" documents are queried independently
> (through filter queries, such as fq:source=health-topics, the frequency of
> a word in the entire collection is a factor.
>
> However, you can query them together, even if you have to carefully tune
> weighting of the documents so that a large corpus doesn't dwarf a small one
> (unless it is appropriate).   As always, relevancy gets pretty tricky.
>
> Hope this helps,
>
> Dan Davis
>
> -Original Message-
> From: Yangrui Guo [mailto:guoyang...@gmail.com ]
> Sent: Tuesday, April 05, 2016 3:16 PM
> To: solr-user@lucene.apache.org 
> Subject: Re: Multiple data-config.xml in one collection?
>
> Hi Daniel,
>
> So if I implement multiple dataimporthandler and do a full import, does
> Solr perform import of all handlers at once or can just specify which
> handler to import? Thank you
>
> Yangrui
>
> On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <
> daniel.da...@nih.gov >
> wrote:
>
> > If Shawn is correct, and you are using DIH, then I have done this by
> > implementing multiple requestHandlers each of them using Data Import
> > Handler, and have each specify a different XML file for the data config.
> > Instead of using data-config.xml, I've used a large number of files such
> as:
> > health-topics-conf.xml
> > encyclopedia-conf.xml
> > ...
> > I tend to index a single valued, required field named "source" that I
> > can use in the delete query, and I use the TemplateTranformer to make
> this easy:
> >
> >  > ...
> >transformer="TemplateTransformer">
> >
> >...
> >
> > Hope this helps,
> >
> > -Dan
> >
> > -Original Message-
> > From: Shawn Heisey [mailto:apa...@elyograg.org 
> 

Re: Multiple data-config.xml in one collection?

2016-04-05 Thread Yangrui Guo
Hi Daniel,

So if I implement multiple dataimporthandler and do a full import, does
Solr perform import of all handlers at once or can just specify which
handler to import? Thank you

Yangrui

On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] 
wrote:

> If Shawn is correct, and you are using DIH, then I have done this by
> implementing multiple requestHandlers each of them using Data Import
> Handler, and have each specify a different XML file for the data config.
> Instead of using data-config.xml, I've used a large number of files such as:
> health-topics-conf.xml
> encyclopedia-conf.xml
> ...
> I tend to index a single valued, required field named "source" that I can
> use in the delete query, and I use the TemplateTranformer to make this easy:
>
>  ...
>transformer="TemplateTransformer">
>
>...
>
> Hope this helps,
>
> -Dan
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org ]
> Sent: Tuesday, April 05, 2016 10:50 AM
> To: solr-user@lucene.apache.org 
> Subject: Re: Multiple data-config.xml in one collection?
>
> On 4/5/2016 8:12 AM, Yangrui Guo wrote:
> > I'm using Solr Cloud to index a number of databases. The problem is
> > there is unknown number of databases and each database has its own
> configuration.
> > If I create a single collection for every database the query would
> > eventually become insanely long. Is it possible to upload different
> > config to zookeeper for each node in a single collection?
>
> Every shard replica (core) in a collection shares the same configuration,
> which it gets from zookeeper.  This is one of SolrCloud's guarantees, to
> prevent problems found with old-style sharding when the configuration is
> different on each machine.
>
> If you're using the dataimport handler, which you probably are since you
> mentioned databases, you can parameterize pretty much everything in the DIH
> config file so it comes from URL parameters on the full-import or
> delta-import command.
>
> Below is a link to the DIH config that I'm using, redacted slightly.
> I'm not running SolrCloud, but the same thing should work in cloud.  It
> should give you some idea of how to use variables in your config, set by
> parameters on the URL.
>
> http://apaste.info/jtq
>
> Thanks,
> Shawn
>
>


Re: Multiple data-config.xml in one collection?

2016-04-05 Thread Yangrui Guo
Hi thanks for the answer. Yes I will be using DIH to import data from
different database connections. Do I have to create a collection for each
connection?

On Tuesday, April 5, 2016, Shawn Heisey  wrote:

> On 4/5/2016 8:12 AM, Yangrui Guo wrote:
> > I'm using Solr Cloud to index a number of databases. The problem is there
> > is unknown number of databases and each database has its own
> configuration.
> > If I create a single collection for every database the query would
> > eventually become insanely long. Is it possible to upload different
> config
> > to zookeeper for each node in a single collection?
>
> Every shard replica (core) in a collection shares the same
> configuration, which it gets from zookeeper.  This is one of SolrCloud's
> guarantees, to prevent problems found with old-style sharding when the
> configuration is different on each machine.
>
> If you're using the dataimport handler, which you probably are since you
> mentioned databases, you can parameterize pretty much everything in the
> DIH config file so it comes from URL parameters on the full-import or
> delta-import command.
>
> Below is a link to the DIH config that I'm using, redacted slightly.
> I'm not running SolrCloud, but the same thing should work in cloud.  It
> should give you some idea of how to use variables in your config, set by
> parameters on the URL.
>
> http://apaste.info/jtq
>
> Thanks,
> Shawn
>
>


Multiple data-config.xml in one collection?

2016-04-05 Thread Yangrui Guo
Hello

I'm using Solr Cloud to index a number of databases. The problem is there
is unknown number of databases and each database has its own configuration.
If I create a single collection for every database the query would
eventually become insanely long. Is it possible to upload different config
to zookeeper for each node in a single collection?

Best,

Yangrui Guo


Re: Partial sentence match with block join

2015-12-16 Thread Yangrui Guo
For example:

If company A is { name:"Apple Inc", location:"Los Alamos"} and company B is
{ name:"Banana Inc", location:"Los Angeles"} then if you only want to
retrieve company A you must use "Apple AND Inc AND Los AND Alamos"},
otherwise it will also retrieve company B. However if you use AND for all
terms then partial match wouldn't be possible. This seems to be
contradictory.

On Tuesday, December 15, 2015, Upayavira  wrote:

>
> Cab you give an example? I cannot understand what you mean from your
> description below.
>
> Thx!
>
> On Wed, Dec 16, 2015, at 12:42 AM, Yangrui Guo wrote:
> > This will be a very common situation. Amazon and Google now display
> > keywords missing in the document. However it seems that Solr parent-child
> > structure requires to use "AND" to confine all terms appear inside a
> > single
> > child document, otherwise it will totally disregard the parent-child
> > structure. Is there a way to achieve this?
> >
> > On Tuesday, December 15, 2015, Jack Krupansky  >
> > wrote:
> >
> > > Set the default operator to OR and optionally set the mm parameter to
> 2 to
> > > require at least two of the query terms to match, and don't quote the
> terms
> > > as a phrase unless you want an exact (optionally sloppy) match.
> > >
> > > Interesting example since I'll bet there are a lot of us who still
> think of
> > > the company as being named "Apple Computer" even though they dropped
> > > "Computer" from the name back in 2007. Also, it is "Inc.", not
> "Company",
> > > so a proper search would be for "Apple Inc." or the old "Apple
> Computer,
> > > Inc."
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Dec 15, 2015 at 2:35 AM, Yangrui Guo  
> > > > wrote:
> > >
> > > > Hello
> > > >
> > > > I've been using 5.3.1. I would like to enable this feature: when user
> > > > enters a query, the results should include documents that also
> partially
> > > > match the query. For example, the document is Apple
> Company
> > > > and user query is "apple computer company". Though the document is
> > > missing
> > > > the term "computer". I've tried phrase slop but it doesn't seem to be
> > > > working with block join. How can I do this in solr?
> > > >
> > > > Thanks
> > > >
> > > > Yangrui
> > > >
> > >
>


Re: Partial sentence match with block join

2015-12-15 Thread Yangrui Guo
This will be a very common situation. Amazon and Google now display
keywords missing in the document. However it seems that Solr parent-child
structure requires to use "AND" to confine all terms appear inside a single
child document, otherwise it will totally disregard the parent-child
structure. Is there a way to achieve this?

On Tuesday, December 15, 2015, Jack Krupansky 
wrote:

> Set the default operator to OR and optionally set the mm parameter to 2 to
> require at least two of the query terms to match, and don't quote the terms
> as a phrase unless you want an exact (optionally sloppy) match.
>
> Interesting example since I'll bet there are a lot of us who still think of
> the company as being named "Apple Computer" even though they dropped
> "Computer" from the name back in 2007. Also, it is "Inc.", not "Company",
> so a proper search would be for "Apple Inc." or the old "Apple Computer,
> Inc."
>
>
> -- Jack Krupansky
>
> On Tue, Dec 15, 2015 at 2:35 AM, Yangrui Guo  > wrote:
>
> > Hello
> >
> > I've been using 5.3.1. I would like to enable this feature: when user
> > enters a query, the results should include documents that also partially
> > match the query. For example, the document is Apple Company
> > and user query is "apple computer company". Though the document is
> missing
> > the term "computer". I've tried phrase slop but it doesn't seem to be
> > working with block join. How can I do this in solr?
> >
> > Thanks
> >
> > Yangrui
> >
>


Partial sentence match with block join

2015-12-14 Thread Yangrui Guo
Hello

I've been using 5.3.1. I would like to enable this feature: when user
enters a query, the results should include documents that also partially
match the query. For example, the document is Apple Company
and user query is "apple computer company". Though the document is missing
the term "computer". I've tried phrase slop but it doesn't seem to be
working with block join. How can I do this in solr?

Thanks

Yangrui


Re: child document faceting returning empty buckets

2015-11-09 Thread Yangrui Guo
Just solved the problem by changing blockChildren:"content_type:children"
to blockParent:"content_type:children". Does Solrj support json faceting as
well?

Yangrui

On Mon, Nov 9, 2015 at 2:39 PM, Yangrui Guo  wrote:

> Hello
>
> I followed Yonik's blog regarding faceting on child document and my curl
> command is posted below:
>
> curl http://localhost:8983/solr/movie_shard1_replica1/query -d '
> q={!parent which="content_type:parent"}+movie&
> json.facet={
> movies:{
> type:terms,
> field:actor,
> domain:{blockChildren:"content_type:children"}
> }
> }'
>
> But I got an empty list of buckets from the response. The count number was
> equivalent to number of parent docs. Is there anything wrong with my query?
>
>  "facets":{
> "count":2412762,
> "movies":{
>   "buckets":[]}}}
>
> Yangrui Guo
>


child document faceting returning empty buckets

2015-11-09 Thread Yangrui Guo
Hello

I followed Yonik's blog regarding faceting on child document and my curl
command is posted below:

curl http://localhost:8983/solr/movie_shard1_replica1/query -d '
q={!parent which="content_type:parent"}+movie&
json.facet={
movies:{
type:terms,
field:actor,
domain:{blockChildren:"content_type:children"}
}
}'

But I got an empty list of buckets from the response. The count number was
equivalent to number of parent docs. Is there anything wrong with my query?

 "facets":{
"count":2412762,
"movies":{
  "buckets":[]}}}

Yangrui Guo


SqlEntityProcessor is too unstable

2015-11-09 Thread Yangrui Guo
Hello

I've been trying to index IMDB data from MySQL with no success yet. The
problem was with the data import handler. When I specify using of
"SqlEntityProcessor", DIH either totally skipped the row, or didn't start
importing at all, or the results are not searchable. I also tried setting
batchSize to -1 but the result count was less than the row counts in MySQL.
I checked used memory but it was far less than the entire heap. Has anyone
been in my situation before?

Yangrui


Re: highlighting on child document

2015-11-08 Thread Yangrui Guo
But how does highlighting work with block join query? Do I need to supply
additional parameter?

Yangrui

On Sun, Nov 8, 2015 at 12:45 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> On Thu, Nov 5, 2015 at 12:12 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> >
> > Highlighter for block join hasn't been implemented.
>
>
> Here I'm wrong:
>  https://issues.apache.org/jira/browse/LUCENE-5929
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


fetched but none was processed when set batchSize to -1

2015-11-08 Thread Yangrui Guo
*Hello*

*Indexing since 23m 45s*
Requests: 5 (0/s), Fetched: 352,993 (248/s), Skipped: 0, Processed: 0 (0/s)
Started: less than a minute ago

I tried to index a table with nested structure. I set the parent entity as
director and put cacheImpl="SortedMapBackedCache" processor=
"SqlEntityProcessor" cachedKey="parent" cacheLookup="director.id" in the
children. However quite a long time has passed but no document was
processed. What was wrong with DIH?

Yangrui


Re: data import extremely slow

2015-11-07 Thread Yangrui Guo
Thanks for your kind reply. I tried using both sqlentityprocessor and set
batchSize to -1but didn't get any improvement. It'd be helpful if I can see
data import handler's log.

On Saturday, November 7, 2015, Alexandre Rafalovitch 
wrote:

> LoL. Of course I meant SolrJ. I had to misspell the most important
> word of the hundreds I wrote in this thread :-)
>
> Thank you Erick for the correction.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 19:18, Erick Erickson  > wrote:
> > Alexandre, did you mean SolrJ?
> >
> > Here's a way to get started
> > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >
> > Best,
> > Erick
> >
> > On Sat, Nov 7, 2015 at 2:22 PM, Alexandre Rafalovitch
> > > wrote:
> >> Have you thought of just using Solr. Might be faster than
> troubleshooting
> >> DIH for complex scenarios.
> >> On 7 Nov 2015 3:39 pm, "Yangrui Guo"  > wrote:
> >>
> >>> I found multiple strange things besides the slowness. I performed
> count(*)
> >>> in MySQL but only one-fifth of the records were imported. Also
> sometimes
> >>> dataimporthandler  either doesn't import at all or only imports a
> portion
> >>> of the table. How can I debug the importer?
> >>>
> >>> On Saturday, November 7, 2015, Yangrui Guo  > wrote:
> >>>
> >>> > I just realized that not everything was ok. Three child entities
> were not
> >>> > imported. Had set batchSize to -1 but again solr was stuck :(
> >>> >
> >>> > On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo  
> >>> > ');>>
> wrote:
> >>> >
> >>> >> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey
> and
> >>> >> used WHERE clause instead. Everything works fine now.
> >>> >>
> >>> >> Yangrui
> >>> >>
> >>> >>
> >>> >> On Friday, November 6, 2015, Shawn Heisey  
> >>> >> ');>>
> wrote:
> >>> >>
> >>> >>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
> >>> >>> >  >>> >>>
> >>> >>> There's a good chance that JDBC is trying to read the entire
> result set
> >>> >>> (all three million rows) into memory before sending any of that
> info to
> >>> >>> Solr.
> >>> >>>
> >>> >>> Set the batchSize to -1 for MySQL so that it will stream results to
> >>> Solr
> >>> >>> as soon as they are available, and not wait for all of them.
> Here's
> >>> >>> more info on the situation, which frequently causes OutOfMemory
> >>> problems
> >>> >>> for users:
> >>> >>>
> >>> >>>
> >>> >>>
> >>>
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
> >>> >>> <
> >>>
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29%7C%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
> >>> >
> >>> >>>
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Shawn
> >>> >>>
> >>> >>>
> >>> >
> >>>
>


Re: Data import handler not indexing all data

2015-11-07 Thread Yangrui Guo
Yes the id is unique. If I only select distinct id,count(id) I get the same
results. However I found this is more likely a MySQL issue. I created a new
table called director1 and ran query "insert into director1 select * from
director" I got only 287041 results inserted, which was the same as Solr. I
don't know why the same query is causing two different results.

On Saturday, November 7, 2015, Alexandre Rafalovitch 
wrote:

> That's not quite the question I asked. Do a distinct on 'id' only in
> the database itself. If your ids are NOT unique, you need to create a
> composite or a virtual id for Solr. Because whatever your
> solrconfig.xml say is uniqueKey will be used to deduplicate the
> documents. If you have 10 documents with the same id value, only one
> will be in the final Solr.
>
> I am not saying that's where the problem is, DIH is fiddly. But just
> get that out of the way.
>
> If that's not the case, you may need to isolate which documents are
> failing. The easiest way to do so is probably to index a smaller
> subset of records, say 1000. Pick a condition in your SQL to do so
> (e.g. id value range). Then, see how many made it into Solr. If not
> all 1000, export the list of IDs from SQL, then a list of IDs from
> Solr (use CSV format and just fl=id). Sort both, compare, see what ids
> are missing. Look what is strange about those documents as opposed to
> the documents that did make it into Solr. Try to push one of those
> missing documents explicitly into Solr by either modifying SQL query
> in DIH or as CSV or whatever.
>
> Good luck,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 19:07, Yangrui Guo  > wrote:
> > Hi thanks for the continued support. I'm really worried as my project
> > deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> > distinct in the beginning of the query because IMDB doesn't have a table
> > for cast & crew. It puts movie and person and their roles into one huge
> > table 'cast_info'. Hence there are multiple rows for a director, one row
> > per his movie.
> >
> > On Saturday, November 7, 2015, Alexandre Rafalovitch  >
> > wrote:
> >
> >> Just to get the paranoid option out of the way, is 'id' actually the
> >> column that has unique ids in your database? If you do "select
> >> distinct id from imdb.director" - how many items do you get?
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 7 November 2015 at 18:21, Yangrui Guo  
> >> > wrote:
> >> > Hello
> >> >
> >> > I'm being troubled by solr's data import handler. My solr version is
> >> 5.3.1
> >> > and mysql is 5.5. I tried to index imdb data but found solr only
> >> partially
> >> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> >> query
> >> > result was 1636549. However DIH only fetched and indexed 287041 rows.
> I
> >> > didn't see any error in the log. Why was this happening?
> >> >
> >> > Here's my data-config.xml
> >> >
> >> > 
> >> >  >> > url="jdbc:mysql://localhost:3306/imdb" user="root"
> password="password" />
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> >
> >> > Yangrui Guo
> >>
>


Re: Data import handler not indexing all data

2015-11-07 Thread Yangrui Guo
Hi thanks for the continued support. I'm really worried as my project
deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
distinct in the beginning of the query because IMDB doesn't have a table
for cast & crew. It puts movie and person and their roles into one huge
table 'cast_info'. Hence there are multiple rows for a director, one row
per his movie.

On Saturday, November 7, 2015, Alexandre Rafalovitch 
wrote:

> Just to get the paranoid option out of the way, is 'id' actually the
> column that has unique ids in your database? If you do "select
> distinct id from imdb.director" - how many items do you get?
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 18:21, Yangrui Guo  > wrote:
> > Hello
> >
> > I'm being troubled by solr's data import handler. My solr version is
> 5.3.1
> > and mysql is 5.5. I tried to index imdb data but found solr only
> partially
> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> query
> > result was 1636549. However DIH only fetched and indexed 287041 rows. I
> > didn't see any error in the log. Why was this happening?
> >
> > Here's my data-config.xml
> >
> > 
> >  > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >
> > Yangrui Guo
>


Data import handler not indexing all data

2015-11-07 Thread Yangrui Guo
Hello

I'm being troubled by solr's data import handler. My solr version is 5.3.1
and mysql is 5.5. I tried to index imdb data but found solr only partially
indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the query
result was 1636549. However DIH only fetched and indexed 287041 rows. I
didn't see any error in the log. Why was this happening?

Here's my data-config.xml











Yangrui Guo


Re: data import extremely slow

2015-11-07 Thread Yangrui Guo
I found multiple strange things besides the slowness. I performed count(*)
in MySQL but only one-fifth of the records were imported. Also sometimes
dataimporthandler  either doesn't import at all or only imports a portion
of the table. How can I debug the importer?

On Saturday, November 7, 2015, Yangrui Guo  wrote:

> I just realized that not everything was ok. Three child entities were not
> imported. Had set batchSize to -1 but again solr was stuck :(
>
> On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo  > wrote:
>
>> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and
>> used WHERE clause instead. Everything works fine now.
>>
>> Yangrui
>>
>>
>> On Friday, November 6, 2015, Shawn Heisey > > wrote:
>>
>>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
>>> > >>
>>> There's a good chance that JDBC is trying to read the entire result set
>>> (all three million rows) into memory before sending any of that info to
>>> Solr.
>>>
>>> Set the batchSize to -1 for MySQL so that it will stream results to Solr
>>> as soon as they are available, and not wait for all of them.  Here's
>>> more info on the situation, which frequently causes OutOfMemory problems
>>> for users:
>>>
>>>
>>> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>>> <http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29%7C%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F>
>>>
>>>
>>> Thanks,
>>> Shawn
>>>
>>>
>


Re: data import extremely slow

2015-11-07 Thread Yangrui Guo
I just realized that not everything was ok. Three child entities were not
imported. Had set batchSize to -1 but again solr was stuck :(

On Fri, Nov 6, 2015 at 3:11 PM, Yangrui Guo  wrote:

> Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and used
> WHERE clause instead. Everything works fine now.
>
> Yangrui
>
>
> On Friday, November 6, 2015, Shawn Heisey  wrote:
>
>> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
>> > >
>> There's a good chance that JDBC is trying to read the entire result set
>> (all three million rows) into memory before sending any of that info to
>> Solr.
>>
>> Set the batchSize to -1 for MySQL so that it will stream results to Solr
>> as soon as they are available, and not wait for all of them.  Here's
>> more info on the situation, which frequently causes OutOfMemory problems
>> for users:
>>
>>
>> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>> <http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29%7C%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F>
>>
>>
>> Thanks,
>> Shawn
>>
>>


Re: data import extremely slow

2015-11-06 Thread Yangrui Guo
Thanks for the reply. I just removed CacheKeyLookUp and CachedKey and used
WHERE clause instead. Everything works fine now.

Yangrui

On Friday, November 6, 2015, Shawn Heisey  wrote:

> On 11/6/2015 10:32 AM, Yangrui Guo wrote:
> > 
> There's a good chance that JDBC is trying to read the entire result set
> (all three million rows) into memory before sending any of that info to
> Solr.
>
> Set the batchSize to -1 for MySQL so that it will stream results to Solr
> as soon as they are available, and not wait for all of them.  Here's
> more info on the situation, which frequently causes OutOfMemory problems
> for users:
>
>
> http://wiki.apache.org/solr/DataImportHandlerFaq?highlight=%28mysql%29|%28batchsize%29#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
>
>
> Thanks,
> Shawn
>
>


data import extremely slow

2015-11-06 Thread Yangrui Guo
Hi

I'm using Solr's data import handler and MySQL 5.5 to index imdb database.
However the data-import takes a few minutes to process one document while
there are over 3 million movies. This is going to take forever yet I can
select the rows in MySQL in no time. Where am I doing wrong? My
data-config.xml is like below:











I created views for the database:

movie:

SELECT
`title`.`id` AS `id`
FROM
`title`

movie_actor:

SELECT
CONCAT('movie.',
`title`.`id`,
'.actor.',
`cast_info`.`person_id`) AS `id`,
`title`.`id` AS `parent`,
`name`.`name` AS `name`,
FROM
((`title`
JOIN `cast_info` ON ((`cast_info`.`movie_id` = `title`.`id`)))
JOIN `name` ON ((`cast_info`.`person_id` = `name`.`id`)))
WHERE
(`cast_info`.`role_id` = 1)

movie_actress:

SELECT
CONCAT('movie.',
`title`.`id`,
'.actress.',
`cast_info`.`person_id`) AS `id`,
`title`.`id` AS `parent`,
`name`.`name` AS `name`,
FROM
((`title`
JOIN `cast_info` ON ((`cast_info`.`movie_id` = `title`.`id`)))
JOIN `name` ON ((`cast_info`.`person_id` = `name`.`id`)))
WHERE
(`cast_info`.`role_id` = 2)

Thanks,

Yangrui


Re: highlighting on child document

2015-11-05 Thread Yangrui Guo
So if child document highlighting doesn't work how can I let solr tell
which child document and its field matched?

On Wednesday, November 4, 2015, Mikhail Khludnev 
wrote:

> Hello,
>
> Highlighter for block join hasn't been implemented. So, far you can call
> highlighter with children query also passing fq={!child
> ..}parent-id:.
>
> On Wed, Nov 4, 2015 at 7:57 PM, Yangrui Guo  > wrote:
>
> > Hi
> >
> > I want to highlight matched terms on child documents because I need to
> > determine which field matched the search terms. However when I use block
> > join solr returned empty highlight fields. How can I use highlight with
> > nested document? Or is there anyway to tell which field matched the query
> > terms?
> >
> > Yangrui
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> >
>


highlighting on child document

2015-11-04 Thread Yangrui Guo
Hi

I want to highlight matched terms on child documents because I need to
determine which field matched the search terms. However when I use block
join solr returned empty highlight fields. How can I use highlight with
nested document? Or is there anyway to tell which field matched the query
terms?

Yangrui


Re: Kate Winslet vs Winslet Kate

2015-11-03 Thread Yangrui Guo
Tried but still didn't get correct result. I guess the reason is because I
use block join with the document. My current solution is to use a name
tagged to extract persons then put name field restriction before it. This
will not work with all situations though. Thanks for the reply.

On Tuesday, November 3, 2015, Imtiaz Shakil Siddique <
shakilsust...@gmail.com> wrote:

> I think edismax query parser perfectly fits for your needs.
> You can make edismax search for query words on multiple fields using the
> "qf" parameter and you can also set the priority of those searched fields.
>
> Edismax also auto generates phrase query for specified fields . ( look into
> the "pf" and "pf2" parameter )
>
> Here is an example -->
> qf: name^3.0 text^1.0
> pf: name
> ps:2
> pf2:name text^0.5
>
> Now when you search for "Kate Winslet" you'll get docs matching the query
> word from name field.
> If you search for "Kate Winslet Movie" then "pf" will pick up "Kate
> Winslet" from name field and the "movie" will be picked up from text field.
> You can experiment the query time boosting to fine tune your needs.
>
> Regards
> Imtiaz Shakil Siddique
> On Nov 3, 2015 9:36 PM, "scott chu" >
> wrote:
>
> > solr-user,妳好
> >
> > With repsect to querying, Dismax makes solr query syntax quite like
> > Google's, you type simple keywords, you can boost them, you can use +/-
> > just like Google's. Meaning they give users a lot of covenince and less
> > boolean knowlege to establish intended query string. Normal Lucene search
> > syntax are treated as escaped characters except AND & OR. You can say
> > Dismax gives some room for phrase querying. eDismax improve something
> > of Dismax but it depends on if you need thorse improvement or not. You
> can
> > see it here:
> >
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
> >
> >
> >
> > - Original Message -
> > *From: *Yangrui Guo >
> > *To: *solr-user@lucene.apache.org 
> > *Date: *2015-11-01, 23:58:27
> > *Subject: *Re: Kate Winslet vs Winslet Kate
> >
> > Could you tell me more about the edismax approach? I'm new to it. Thanks
> a
> >
> > lot
> >
> > On Sunday, November 1, 2015, Erick Erickson  
> > <+erickerick...@gmail.com >> wrote:
> >
> > > If your goal is to have docs with "kate" and "winslet"
> > > in the _name_ field be scored higher, just make that
> > > explicit as
> > > name:(kate AND winslet)
> > > perhaps boosting as
> > > name:(kate AND winslet)^10
> > > or add it as a clause
> > > q=kate AND winslet OR name:(kate AND winslet)^10
> > > or even
> > > q=kate AND winslet OR name:(kate AND winslet)^10 OR name:"kate
> > winslet"^20
> > >
> > >
> > > Or use edismax to do this kind of thing for you, that's
> > > its purpose.
> > >
> > > Best,
> > > Erick
> > >
> > > On Sun, Nov 1, 2015 at 7:06 AM, Yangrui Guo  
> > <+guoyang...@gmail.com >
> > > > wrote:
> > > > I debugged the query and found the query has been translated into
> > > > _text_:Kate AND _text_:Winslet, which _text_ is the default search
> > field.
> > > > Because my documents use parent/child relation it appeared that if
> > > there's
> > > > no exact match of Kate Winslet, solr will return all documents
> contains
> > > > "Kate" and "Winslet" in anywhere. However it will more sense if solr
> > can
> > > > rank docs that have "Kate" and "Winslet" in the same field higher. Of
> > > > course I can use some NLP tricks with named entity recognition but it
> > > would
> > > > be more expensive to develop.
> > > >
> > > > On Sunday, November 1, 2015, Paul Libbrecht  
> > <+p...@hoplahup.net >
> > > > wrote:
> > > >
> > > >> Alexandre,
> > > >>
> > > >> I guess you are talking about that post:
> > > >>
> > > >>
> > > >>
> > >
> >
> http://lucidworks.com/blog/2015/06/06/query-autofiltering-extended-language-logic-search/
> > > >>
> > > >> I think it is very often impossible to solve properly.
> > > >>
> > > >> Words such as "direction" have very many meanings and woul

Re: Kate Winslet vs Winslet Kate

2015-11-01 Thread Yangrui Guo
I've just read the post and it has addressed much of my issue. It is hard
to detect phrases and disambiguate phrases but some existing approaches
seem really promising.

On Sunday, November 1, 2015, Paul Libbrecht  wrote:

> Alexandre,
>
> I guess you are talking about that post:
>
>
> http://lucidworks.com/blog/2015/06/06/query-autofiltering-extended-language-logic-search/
>
> I think it is very often impossible to solve properly.
>
> Words such as "direction" have very many meanings and would come in
> different fields.
> In IMDB, words such as the names of persons would come in at least
> different roles; similarly, the actors' role's name is likely to match
> the family name of persons...
>
> Paul
>
>
>
> > As others indicated having intelligence to recognize the terms (e.g.
> > Kate should be in name) or some user indication to do so can make thing
> > more precise but is rarely done.
> > Alexandre Rafalovitch 
> > 1 novembre 2015 13:07
> > Which is what I believe Ted Sullivan is working on and presented at
> > the latest Lucene/Solr Revolution. His presentation does not seem to
> > be up, but he was writing about it on:
> > http://lucidworks.com/blog/author/tedsullivan/
>
> > Erick Erickson 
> > 1 novembre 2015 07:40
> > Yeah, that's actually a tough one. You have no control over what the
> > user types,
> > you have to try to guess what they meant.
>
>


Re: Kate Winslet vs Winslet Kate

2015-11-01 Thread Yangrui Guo
Could you tell me more about the edismax approach? I'm new to it. Thanks a
lot

On Sunday, November 1, 2015, Erick Erickson  wrote:

> If your goal is to have docs with "kate" and "winslet"
> in the _name_ field be scored higher, just make that
> explicit as
> name:(kate AND winslet)
> perhaps boosting as
> name:(kate AND winslet)^10
> or add it as a clause
> q=kate AND winslet OR name:(kate AND winslet)^10
> or even
> q=kate AND winslet OR name:(kate AND winslet)^10 OR name:"kate winslet"^20
>
>
> Or use edismax to do this kind of thing for you, that's
> its purpose.
>
> Best,
> Erick
>
> On Sun, Nov 1, 2015 at 7:06 AM, Yangrui Guo  > wrote:
> > I debugged the query and found the query has been translated into
> > _text_:Kate AND _text_:Winslet, which _text_ is the default search field.
> > Because my documents use parent/child relation it appeared that if
> there's
> > no exact match of Kate Winslet, solr will return all documents contains
> > "Kate" and "Winslet" in anywhere. However it will more sense if solr can
> > rank docs that have "Kate" and "Winslet" in the same field higher. Of
> > course I can use some NLP tricks with named entity recognition but it
> would
> > be more expensive to develop.
> >
> > On Sunday, November 1, 2015, Paul Libbrecht  > wrote:
> >
> >> Alexandre,
> >>
> >> I guess you are talking about that post:
> >>
> >>
> >>
> http://lucidworks.com/blog/2015/06/06/query-autofiltering-extended-language-logic-search/
> >>
> >> I think it is very often impossible to solve properly.
> >>
> >> Words such as "direction" have very many meanings and would come in
> >> different fields.
> >> In IMDB, words such as the names of persons would come in at least
> >> different roles; similarly, the actors' role's name is likely to match
> >> the family name of persons...
> >>
> >> Paul
> >>
> >>
> >>
> >> > As others indicated having intelligence to recognize the terms (e.g.
> >> > Kate should be in name) or some user indication to do so can make
> thing
> >> > more precise but is rarely done.
> >> > Alexandre Rafalovitch <mailto:arafa...@gmail.com 
> >
> >> > 1 novembre 2015 13:07
> >> > Which is what I believe Ted Sullivan is working on and presented at
> >> > the latest Lucene/Solr Revolution. His presentation does not seem to
> >> > be up, but he was writing about it on:
> >> > http://lucidworks.com/blog/author/tedsullivan/
> >>
> >> > Erick Erickson <mailto:erickerick...@gmail.com 
> >
> >> > 1 novembre 2015 07:40
> >> > Yeah, that's actually a tough one. You have no control over what the
> >> > user types,
> >> > you have to try to guess what they meant.
> >>
> >>
>


Re: Kate Winslet vs Winslet Kate

2015-11-01 Thread Yangrui Guo
I debugged the query and found the query has been translated into
_text_:Kate AND _text_:Winslet, which _text_ is the default search field.
Because my documents use parent/child relation it appeared that if there's
no exact match of Kate Winslet, solr will return all documents contains
"Kate" and "Winslet" in anywhere. However it will more sense if solr can
rank docs that have "Kate" and "Winslet" in the same field higher. Of
course I can use some NLP tricks with named entity recognition but it would
be more expensive to develop.

On Sunday, November 1, 2015, Paul Libbrecht  wrote:

> Alexandre,
>
> I guess you are talking about that post:
>
>
> http://lucidworks.com/blog/2015/06/06/query-autofiltering-extended-language-logic-search/
>
> I think it is very often impossible to solve properly.
>
> Words such as "direction" have very many meanings and would come in
> different fields.
> In IMDB, words such as the names of persons would come in at least
> different roles; similarly, the actors' role's name is likely to match
> the family name of persons...
>
> Paul
>
>
>
> > As others indicated having intelligence to recognize the terms (e.g.
> > Kate should be in name) or some user indication to do so can make thing
> > more precise but is rarely done.
> > Alexandre Rafalovitch 
> > 1 novembre 2015 13:07
> > Which is what I believe Ted Sullivan is working on and presented at
> > the latest Lucene/Solr Revolution. His presentation does not seem to
> > be up, but he was writing about it on:
> > http://lucidworks.com/blog/author/tedsullivan/
>
> > Erick Erickson 
> > 1 novembre 2015 07:40
> > Yeah, that's actually a tough one. You have no control over what the
> > user types,
> > you have to try to guess what they meant.
>
>


Re: Kate Winslet vs Winslet Kate

2015-10-31 Thread Yangrui Guo
Thanks for the reply. Putting the name: before the terms did the work. I
just wanted to generalize the search query because users might be
interested in querying Kate Winslet herself or her movies. If user enter
query string "Kate Winslet movie", the query q=name:(Kate AND Winslet AND
movie) will return nothing.

Yangrui Guo

On Saturday, October 31, 2015, Erick Erickson 
wrote:

> There are a couple of anomalies here.
>
> 1> kate AND winslet
> What does the query look like if you add &debug=true to the statement
> and look at the "parsed_query" section of the return?  My guess is you
> typed "q=name:kate AND winslet" which parses as "q=name:kate AND
> default_search_field:winslet" and are getting matches you don't
> expect. You need something like "q=name:(kate AND winslet)" or
> "q=name:kate AND name:winslet". Note that if you're using eDIsmax it's
> more complicated, but that should still honor the intent.
>
> 2> I have no idea why searching for "Kate Winslet" in quotes returns
> anything, I wouldn't expect it to unless you mean you type in "q=kate
> winslet" which is searching against your default field, not the name
> field.
>
> Best,
> Erick
>
> On Sat, Oct 31, 2015 at 8:52 PM, Yangrui Guo  > wrote:
> > Hi today I found an interesting aspect of solr. I imported IMDB data into
> > solr. The IMDB puts last name before first name for its person's name
> field
> > eg. "Winslet, Kate". When I search "Winslet Kate" with quotation marks I
> > could get the exact result. However if I search "Kate Winslet" or Kate
> AND
> > Winslet solr seem to return me all result containing either Kate or
> Winslet
> > which is similar to "Winslet Kate"~99. From user perspective I
> > certainly want solr to treat Kate Winslet the same as Winslet Kate. Is
> > there anyway to make solr score higher for terms in the same field?
> >
> > Yangrui
>


Kate Winslet vs Winslet Kate

2015-10-31 Thread Yangrui Guo
Hi today I found an interesting aspect of solr. I imported IMDB data into
solr. The IMDB puts last name before first name for its person's name field
eg. "Winslet, Kate". When I search "Winslet Kate" with quotation marks I
could get the exact result. However if I search "Kate Winslet" or Kate AND
Winslet solr seem to return me all result containing either Kate or Winslet
which is similar to "Winslet Kate"~99. From user perspective I
certainly want solr to treat Kate Winslet the same as Winslet Kate. Is
there anyway to make solr score higher for terms in the same field?

Yangrui


Solr getting irrelevant results when use block join

2015-10-31 Thread Yangrui Guo
Hi I'm using solr to search imdb database. I set the parent entity to
include the name for each actor/actress and child entity for his movies.
Because user might either enter a movie or a person I did not specify which
entity solr should return. When I just search q=Kate AND Winslet without
block join solr returned me the correct result. However, when I search
{!parent which="type:parent"}+(Kate AND Winslet) solr seemed to have
returned all document containing just term "Kate". I tried quoting the
terms but the order needs to be exactly "Kate Winslet". Is there any method
I can boost higher the score of the document which includes the terms in
the same field?

Yangrui


How to retrieve single child document with block join

2015-10-31 Thread Yangrui Guo
Hi

I want to know if I can get the child document only if it contains the
query term. Currently I could retrieve all child document at once with
query expansion. Does solr support individual child retrieval?

Thanks,

Yangrui