Re: search by person name

2015-04-20 Thread Yavar Husain
In this case q=name:(ana jose) will work, but suppose if it is to be
searched in full text field It might have poor recall, It will also produce
document like "San Jose is better than Santa Ana" which was not the user
intent. Erick's solution  "ana jose"~2  is capturing the intent too.

On Mon, Apr 20, 2015 at 10:09 PM, Steven White  wrote:

> Why not just use q=name:(ana jose) ?  Than missing words or words order
> won't matter.  No?
>
> Steve
>
> On Mon, Apr 20, 2015 at 12:26 PM, Erick Erickson 
> wrote:
>
> > First, a little patience on your part please, we're all volunteers here.
> >
> > Second, what have you done to try to analyze the problem? Have you
> > tried adding &debgu=query to to your URL? Looked at the analysis page?
> > Anything else?
> >
> > You might review: http://wiki.apache.org/solr/UsingMailingLists
> >
> > My guess (and Rafal provided you a strong clue if my guess is right)
> > is that by enclosing "ana jose" in quotes you've created a phrase
> > query that requires the two words to be right next to each other and
> > they have "maria" between them. Using "slop", i.e. "ana jose"~2 should
> > find the doc if I'm correct.
> >
> > Best,
> > Erick
> >
> > On Mon, Apr 20, 2015 at 7:41 AM, Pedro Figueiredo
> >  wrote:
> > > Any help please?
> > >
> > > PF
> > >
> > > -Original Message-
> > > From: Pedro Figueiredo [mailto:pjlfigueir...@criticalsoftware.com]
> > > Sent: 20 de abril de 2015 14:19
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: search by person name
> > >
> > > yes
> > >
> > > Pedro Figueiredo
> > > Senior Engineer
> > >
> > > pjlfigueir...@criticalsoftware.com
> > > M. 934058150
> > >
> > >
> > > Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal
> T. +351
> > 229 446 927 | F. +351 229 446 929 www.criticalsoftware.com
> > >
> > > PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA A CMMI®
> > LEVEL 5 RATED COMPANY CMMI® is registered in the USPTO by CMU"
> > >
> > >
> > >
> > > -Original Message-
> > > From: Rafal Kuc [mailto:ra...@alud.com.pl]
> > > Sent: 20 de abril de 2015 14:10
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: search by person name
> > >
> > > Hello,
> > >
> > > How does you query look like? Do you use phrase query, like q=name:"ana
> > jose" ?
> > >
> > > ---
> > > Regards,
> > > Rafał Kuć
> > >
> > >
> > >
> > >
> > >> Wiadomość napisana przez Pedro Figueiredo <
> > pjlfigueir...@criticalsoftware.com> w dniu 20 kwi 2015, o godz. 15:06:
> > >>
> > >> Hi all,
> > >>
> > >> Can anyone advise the tokens and filters to use, for the most common
> > way to search by people’s names.
> > >> The basics requirements are:
> > >>
> > >> For field name – “Ana Maria José”
> > >> The following search’s should return the example:
> > >> 1.   “Ana”
> > >> 2.   “Maria”
> > >> 3.   “Jose”
> > >> 4.   “ana maria”
> > >> 5.   “ana jose”
> > >>
> > >> With the following configuration I’m not able to satisfy all the
> > searches (namely the last one….):
> > >> 
> > >> 
> > >> 
> > >>
> > >> Thanks in advanced,
> > >>
> > >> Pedro Figueiredo
> > >> Senior Engineer
> > >>
> > >> pjlfigueir...@criticalsoftware.com
> > >> 
> > >> M. 934058150
> > >>
> > >>
> > >> Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal
> > >> T. +351 229 446 927 | F. +351 229 446 929 www.criticalsoftware.com
> > >> 
> > >>
> > >> PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA A CMMI®
> > >> LEVEL 5 RATED COMPANY  CMMI® is registered
> > in the USPTO by CMU "
> > >
> > >
> >
>


Re: Search in Solr Index

2015-04-20 Thread Yavar Husain
There might be issues with your default search field. Suppose if you are
searching field named "MyTestField" then give your query as
MyTestField:Birmingham
and see if you get any results. As Matt suggested there might be some
issues with the way you have done tokenization/analysis etc.



On Mon, Apr 20, 2015 at 9:21 PM, Matt Kuiper  wrote:

> What type of field are you using? String?  If so try another type, like
> text_general.
>
> I believe with type String the contents are stored in the index exactly as
> they are inputted into the index.  So a search hit will have to match
> exactly the full value of the field, I assume in your case "Birmingham" is
> only part of the value.  With text_general and other types, the value will
> be tokenized and allow for hits on parts or variants of the value.
>
> Matt
>
>
> -Original Message-
> From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
> vijaya.bhoomire...@whishworks.com]
> Sent: Monday, April 20, 2015 9:31 AM
> To: solr-user@lucene.apache.org
> Subject: Search in Solr Index
>
> Hi,
>
> I am indexing some data from a Database. Data is getting indexed properly
> and when I query in the Solr stock UI with query parameters as *.*, I could
> see the documents with all the fields listed and as well the numFound
> reflecting properly. However,  if I perform a query with a simple string
> for example "Birmingham", numFound returns 0 with no records to be
> displayed. There are records which are indexed that contains fields with
> the text "Birmingham". In the schema.xml, all the fields have been defined
> as indexed="true" and stored="true"
>
> This is happening for any search query string. What could be the reason
> for this behavior?
>
>
> Thanks & Regards
> Vijay
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>


Re: What is the best way of Indexing different formats of documents?

2015-04-07 Thread Yavar Husain
Well have indexed heterogeneous sources including a variety of NoSQL's,
RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
of using SolrJ is that you should have an API to fetch data from your data
source (Say JDBC for RDBMS, Tika for extracting text content from rich
documents etc.) than SolrJ is so damn great and simple. Its as simple as
downloading the jar and few lines of code to send data to your solr server
after pre-processing your data. More details here:

http://lucidworks.com/blog/indexing-with-solrj/

https://wiki.apache.org/solr/Solrj

http://www.solrtutorial.com/solrj-tutorial.html

Cheers,
Yavar



On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com <
sangeetha.subraman...@gtnexus.com> wrote:

> Hi,
>
> I am a newbie to SOLR and basically from database background. We have a
> requirement of indexing files of different formats (x12,edifact, csv,xml).
> The files which are inputted can be of any format and we need to do a
> content based search on it.
>
> From the web I understand we can use TIKA processor to extract the content
> and store it in SOLR. What I want to know is, is there any better approach
> for indexing files in SOLR ? Can we index the document through streaming
> directly from the Application ? If so what is the disadvantage of using it
> (against DIH which fetches from the database)? Could someone share me some
> insight on this ? ls there any web links which I can refer to get some idea
> on it ? Please do help.
>
> Thanks
> Sangeetha
>
>


Information Retrieval/Text Mining opportunity @ GE Research Data Mining Labs, Bangalore

2015-03-25 Thread Yavar Husain
I have loved working on Solr, so thought of posting an Information
Retrieval/Text Mining requirement that we have for our GE Data Mining
Research Labs @ Bangalore. Apologies if it is considered inappropriate here.



Here goes the Job Description for those interested:



If Information Retrieval, Text Mining, Natural Language Processing  &
Machine Learning fascinates you; if you are excited to research & build
state of art Algorithms working on massive data-sets for an array of Text
Mining problems (Search, Named Entity Recognition, Semantic Graphs,
Sentiments, Spell Corrector, Text Categorization, Clustering, Topic
Modelling and so on…) then GE Global Research Data Mining Labs in Bangalore
is looking out for you. The real scope of applied research in our lab goes
way beyond the term “Natural” in Natural Language Processing.



Do connect if you need more information. Even if one has limited or no
experience with the areas mentioned above but is passionate about
Information Retrieval/Text Mining & have rock solid background in
Algorithms is encouraged to apply/connect.



Check out more on GE Research: http://www.geglobalresearch.com/



Cheers,

Yavar Husain

Lead Data Scientist - Text Mining Laboratory

GE Research, Bangalore

LinkedIn: http://www.linkedin.com/pub/yavar-husain/5/805/151

Text@ yavarhus...@gmail.com


Pattern for extracting text from a rich document and an associated metadata file

2015-03-04 Thread Yavar Husain
What is the best pattern to index the following kind of data:

HarryPotter.PDF
HarryPotter.txt

Avengers.Docx
Avengers.txt

For each of the above file the meta data lies in the text file having same
name as the rich document (as can be seen above).

(1) Now the brute force method that I can think of is extract text from
rich document and extract meta data from the associated txt file, club them
to form an xml and send it to Solr for indexing.

(2) Another thing that I can think of is to use SolrJ and just
programatically read the PDF and the txt file and send that to Solr. If
this is the case then is it possible to send PDF directly to Solr without
having to extract text first in my SolrJ program.

Is there something better that I can do quickly? I know if I just had rich
documents I would have used the Tika-Solr integration/requestHandlers to do
the job.

Any help would be appreciated.

Thanks,
Yavar


Re: Is Solr best for "did you mean" functionality just like Google?

2015-02-24 Thread Yavar Husain
Solr is an IR system where Spell correction is a topping however Google has
a team dedicated just for Spell corrections. Did you mean (more general
term and much broader than basic Spell correctors) or Spell Correctors
require a plethora of skills. I will just discuss Spell correctors here and
not go into Did you mean:

To start with: 1) Edit Distances (Example: In misspelt 'cax' if x is
replaced by 'r' or 't' it becomes car and cat respectively which can be
probable candidates for your misspelt word and now since both are at edit
distance of 1 you can select the one which occurs more number of times in
your solr index, however you will have to handle the cases where the
misspelt word is already present in your index. Say you have misspelt token
'cax' occuring 100 times in your index )

A good spell corrector requires a lot of features on top of the above

2) Phonetics (sounds of words/metaphone etc.).
3) If you have natural language queries like "The cax ran out of the
house", here cat would be much more suitable spelling correction for cax as
compared to car.
4) Language models play an important role. Think, what is the probability
of getting an 'm' after 'e' and how does it compare with getting a 'z'
after 'e'
5) Your search/http etc. logs will be a good source to improve spell
corrector
6) and you can list several other

You can build a physics based model by taking into account the above
features for recommending the best.

However rather than working hard doing the above there is always a smarter
way out :), one example on that can be looking at terms in your solr index
and the one's occuring the least times can be analyzed for spelling errors.

Cheers,
Yavar

On Mon, Feb 23, 2015 at 9:53 PM, Nitin Solanki  wrote:

> Hello,
>   I came in the worst condition. I want to do spell/query
> correction functionality. I have 49 GB indexed data where I have applied
> spellchecker. I want to do same as Google - "*did you mean*".
> *Example* - If any user types any question/query which might be misspell or
> wrong typed. I need to give them suggestion like "Did you mean".
> Is Solr best for it?
>
>
> Warm Regards,
> Nitin Solanki
>


Re: Solr Date Range not returning results for last 1 month

2014-12-23 Thread Yavar Husain
Thanks Erick. That works!

Will check some other time as to why NOW/DAY does not work.

Regards,
Yavar

On Wed, Dec 24, 2014 at 11:39 AM, Erick Erickson 
wrote:

> Hmmm, not quite sure what's going on here, but try an end
> time of NOW/MONTH+1MONTH with the usual escaping of the
> plus sign...
>
> Best,
> Erick
>
> On Tue, Dec 23, 2014 at 9:55 PM, Yavar Husain 
> wrote:
> > So my Solr date range query is as follows:
> >
> >
> &facet.range=date&facet.range.start=NOW/DAY-36MONTH&facet.range.end=NOW/DAY&facet.range.gap=%2B1MONTH
> >
> > I need facets for past 36 months or 3 year and everything is fine except
> > for data not being returned for last 1 month,
> >
> > However the facets I am getting for the date is till last month, say
> today
> > is 24th December and I am getting it till 24th November. How should I
> > modify my query to obtain results till today? Tried a few options using
> HIT
> > and TRIAL :) but could not arrive at a solution.
> >
> > Appreciate the help in this regard.
>


Solr Date Range not returning results for last 1 month

2014-12-23 Thread Yavar Husain
So my Solr date range query is as follows:

&facet.range=date&facet.range.start=NOW/DAY-36MONTH&facet.range.end=NOW/DAY&facet.range.gap=%2B1MONTH

I need facets for past 36 months or 3 year and everything is fine except
for data not being returned for last 1 month,

However the facets I am getting for the date is till last month, say today
is 24th December and I am getting it till 24th November. How should I
modify my query to obtain results till today? Tried a few options using HIT
and TRIAL :) but could not arrive at a solution.

Appreciate the help in this regard.


Solr Clustering component different results than Carrot workbench

2014-08-18 Thread Yavar Husain
Though I am interacting with Dawid (creator of Carrot2) on Carrot2 mailing
list however just wanted to post my problem to a wider audience.

I am using Solr 4.7 (on both windows and linux) and saved my
lingo-attributes.xml file from the workbench which I am using in Solr. Note
that for testing I am just having one solr Index and all the queries are
getting fired on that.

Now the clusters that I am getting are good in the workbench (carrot) but
pathetic in Solr. In the logs (jetty) I can see:

Loaded Solr resource: clustering/carrot2/lingo-attributes.xml, so that
indicates that my attribute file is being loaded.

I am really confused what is accounting for the difference in the two
outputs (workbench vs Solr). Again to reiterate the data sources are same
(just one solr index and same queries with 100 results). This is happening
on both Linux and Windows.

Given below is my search component and request handler configuration:



  lingo

  
  org.carrot2.clustering.lingo.LingoClusteringAlgorithm
  30


  
  clustering/carrot2



  

  
  

  true
  true
  
  org.carrot2.clustering.lingo.LingoClusteringAlgorithm
  clustering/carrot2
  film_id
  
  description
  
  true
  
  
  
  false
  100


  clustering

  


Data Import Handler - resource not found - Jetty - Windows 7

2014-07-25 Thread Yavar Husain
Have most of experience working on Solr with Tomcat. However I recently
started with Jetty. I am using Solr 4.7.0 on Windows 7. I have configured
solr properly and am able to see the admin UI as well as velocity browse.
Dataimporthandler screen is also getting displayed. However when I do a
full import it fails with the following error:

INFO  - 2014-07-25 12:28:35.177; org.apache.solr.core.SolrCore;
[collection1] webapp=/solr path=/dataimport
params={indent=true&command=status&_=1406271515176&wt=json} status=0
QTime=0
ERROR - 2014-07-25 12:28:35.179; org.apache.solr.common.SolrException;
java.io.IOException: Can't find resource
'C:/solr-4.7.0/example/solr/collection1/conf' in classpath or
'C:\solr-4.7.0\example\solr\collection1\conf'
at
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:342)
at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:134)

Few Notes:
My solrconfig.xml has dataimport configured and i have used:

  
  
  

Also my jars are present on those paths.

On my core admin UI I can see correct datadir which is
C:\solr-4.7.0\example\solr\collection1\data\

Any help would be appreciated.

Thanks,
Yavar

   -


Re: Solr Cassandra MySQL Best Practice Indexing

2014-07-22 Thread Yavar Husain
Exactly. Thanks a lot Jack. +1 for "Your best bet is to get that RDBMS data
moved to Cassandra or DSE ASAP."


On Tue, Jul 22, 2014 at 5:15 PM, Jack Krupansky 
wrote:

> I don't think the Solr Data Import Handler has a Cassandra plugin (entity
> processor) yet, so the most straight forward approach is to write a Java
> app that reads from Cassandra, then reads the corresponding RDBMS data,
> combines the data, and then uses SolrJ to add documents to Solr.
>
> Your best bet is to get that RDBMS data moved to Cassandra or DSE ASAP.
> All you have until then is a stopgap measure rather than a robust
> architecture.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Yavar Husain
> Sent: Tuesday, July 22, 2014 2:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Cassandra MySQL Best Practice Indexing
>
>
> Thanks Jack for your guidance on DSE. However it would be great if somebody
> could help me solving my use case:
>
> So my full text data lies on Cassandra along with an ID. Now I have a lot
> of structured data linked to the ID which lies on an RDBMS (read MySQL). I
> need this structured data as it would help me with my faceting and other
> needs. What is the best practice in going about indexing in this scenario.
>
> I will think about incremental indexing for the new records later.
>
> Bit confused. Any help would be appreciated.
>
>
> On Mon, Jul 21, 2014 at 6:51 PM, Jack Krupansky 
> wrote:
>
>  Solandra is not a supported product. DataStax Enterprise (DSE) supersedes
>> it. With DSE, just load your data into a Solr-enabled Cassandra data
>> center
>> and it will be indexed automatically in the embedded Solr within DSE, as
>> per a Solr schema that you provide. Then use any of the nodes in that
>> Solr-enabled Cassandra data center just the same as with normal Solr.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Yavar Husain
>> Sent: Monday, July 21, 2014 8:37 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr Cassandra MySQL Best Practice Indexing
>>
>>
>> So my full text data lies on Cassandra along with an ID. Now I have a lot
>> of structured data linked to the ID which lies on an RDBMS (read MySQL). I
>> need this structured data as it would help me with my faceting and other
>> needs. What is the best practice in going about indexing in this scenario.
>> My thoughts (maybe weird):
>>
>> 1. Read the data from Cassandra, for each ID read, read the corresponding
>> row from MySQL for that ID, form an XML on the fly (for each ID) and send
>> it to Solr for Indexing without storing anything.
>> 2. I do not have much idea on Solandra. However even if I use it I will
>> have to go to MySQL for fetching the structured data.
>> 3. Duplicate the data and either get all of Cassandra to MySQL or vice
>> versa but then data duplication would happen.
>>
>> I will think about incremental indexing for the new records later.
>>
>> Bit confused. Any help would be appreciated.
>>
>>
>


Re: Solr Cassandra MySQL Best Practice Indexing

2014-07-21 Thread Yavar Husain
Thanks Jack for your guidance on DSE. However it would be great if somebody
could help me solving my use case:

So my full text data lies on Cassandra along with an ID. Now I have a lot
of structured data linked to the ID which lies on an RDBMS (read MySQL). I
need this structured data as it would help me with my faceting and other
needs. What is the best practice in going about indexing in this scenario.

I will think about incremental indexing for the new records later.

Bit confused. Any help would be appreciated.


On Mon, Jul 21, 2014 at 6:51 PM, Jack Krupansky 
wrote:

> Solandra is not a supported product. DataStax Enterprise (DSE) supersedes
> it. With DSE, just load your data into a Solr-enabled Cassandra data center
> and it will be indexed automatically in the embedded Solr within DSE, as
> per a Solr schema that you provide. Then use any of the nodes in that
> Solr-enabled Cassandra data center just the same as with normal Solr.
>
> -- Jack Krupansky
>
> -Original Message- From: Yavar Husain
> Sent: Monday, July 21, 2014 8:37 AM
> To: solr-user@lucene.apache.org
> Subject: Solr Cassandra MySQL Best Practice Indexing
>
>
> So my full text data lies on Cassandra along with an ID. Now I have a lot
> of structured data linked to the ID which lies on an RDBMS (read MySQL). I
> need this structured data as it would help me with my faceting and other
> needs. What is the best practice in going about indexing in this scenario.
> My thoughts (maybe weird):
>
> 1. Read the data from Cassandra, for each ID read, read the corresponding
> row from MySQL for that ID, form an XML on the fly (for each ID) and send
> it to Solr for Indexing without storing anything.
> 2. I do not have much idea on Solandra. However even if I use it I will
> have to go to MySQL for fetching the structured data.
> 3. Duplicate the data and either get all of Cassandra to MySQL or vice
> versa but then data duplication would happen.
>
> I will think about incremental indexing for the new records later.
>
> Bit confused. Any help would be appreciated.
>


Solr Cassandra MySQL Best Practice Indexing

2014-07-21 Thread Yavar Husain
So my full text data lies on Cassandra along with an ID. Now I have a lot
of structured data linked to the ID which lies on an RDBMS (read MySQL). I
need this structured data as it would help me with my faceting and other
needs. What is the best practice in going about indexing in this scenario.
My thoughts (maybe weird):

1. Read the data from Cassandra, for each ID read, read the corresponding
row from MySQL for that ID, form an XML on the fly (for each ID) and send
it to Solr for Indexing without storing anything.
2. I do not have much idea on Solandra. However even if I use it I will
have to go to MySQL for fetching the structured data.
3. Duplicate the data and either get all of Cassandra to MySQL or vice
versa but then data duplication would happen.

I will think about incremental indexing for the new records later.

Bit confused. Any help would be appreciated.


Research Scientist - Information Retrieval at GE Global Research (Data Mining Lab)

2014-04-24 Thread Yavar Husain
I am an avid Solr user so thought of posting an Information Retrieval/Text
Mining requirement that we have for our GE Data Mining Research Labs. I
hope it is not considered inappropriate here. Here goes the JD:

If Information Retrieval, Text Mining, Natural Language Processing  &
Machine Learning fascinates you; if you are excited to research & build
state of art Algorithms working on massive datasets for an array of Text
Mining problems (Search, Named Entity Recognition, Graphs, Sentiments,
Spell Correctors, Text Categorization, Clustering, Topic Modelling and so
on…) then GE Global Research Data Mining Labs in Bangalore is looking out
for you. The real scope of applied research in our lab goes way beyond the
term “Natural” in Natural Language Processing. Check out more on:

Research Scientist – Text Mining @ GE Global Research:
http://www.linkedin.com/jobs2/view/11451651
GE Research: http://www.geglobalresearch.com/
GE Software: http://gesoftware.com/

Do connect if you need more information. Even if one has limited or no
experience with the areas mentioned above but is passionate about
Information Retrieval/Text Mining & have a rock solid background in
Algorithms is encouraged to apply/connect.

Cheers,
Yavar Husain
Lead Data Scientist - Text Mining Laboratory
GE Research, Bangalore
LinkedIn: http://www.linkedin.com/pub/yavar-husain/5/805/151
Text@ yavarhus...@gmail.com