LSH in Solr/Lucene

2014-01-20 Thread Shashi Kant
Hi folks, have any of you successfully implemented LSH (MinHash) in
Solr? If so, could you share some details of how you went about it?

I know LSH is available in Mahout, but was hoping if someone has a
solr or Lucene implementation.

Thanks


Searching Numeric Data

2014-01-11 Thread Shashi Kant
Hi all, I have a use-case where I would need to search a set of
numeric values, using a query set. My business case is

1. I have various Rock samples from various locations {R1...Rn} with
multiple measurements like Porosity [255] - an array of values ,
Conductivity [1028] - also an array of numbers, and several such
metrics etc.

They are arrays becauses measurements are taken in various ambient conditions.

2. For a new rock-sample Rn+1 I would like to query Solr and get a
ranked list of ordered by their multidimensional similarity.

I was thinking using Solr as a way of perform this query, by
representing the numeric arrays as text and creating a document for
each sample with fields for each of the measurements.

Has anyone approached in such a fashion? If so, could you share some
details about your approach.

Regards
Shashi


-- 
sk...@alum.mit.edu
(604) 446-2460


Re: Solr Patent

2013-09-14 Thread Shashi Kant
You can ask on this site http://patents.stackexchange.com/



On Sat, Sep 14, 2013 at 10:03 AM, Michael Sokolov
 wrote:
> On 9/13/2013 9:14 PM, Zaizen Ushio wrote:
>>
>> Hello
>> I have a question about patent.  I believe Apache license is protecting
>> Solr developers from patent issue in Solr community.  But is there any case
>> that Solr developer or Solr users are alleged by outside of Solr Community?
>> Is there any cases somebody experienced?  Any advice is appreciated.
>>
>> Thanks,  Zaizen
>>
>>
>>
> Zaizen - I doubt you will get legal advice from this community.  If you do
> get any other advice than to consult a lawyer, you should ignore it and
> consult a lawyer.  Or move to New Zealand - I hear they outlawed software
> patents there.  See that's just the sort of unhelpful legal advice you're
> likely to get here :)
>
> -Mike



-- 
sk...@alum.mit.edu
(604) 446-2460


Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shashi Kant
Here is a paper that I found useful:
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf


On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI  wrote:
> Thanks for your comments.
>
> 2013/7/23 Tommaso Teofili 
>
>> if you need a specialized algorithm for detecting blogposts plagiarism /
>> quotations (which are different tasks IMHO) I think you have 2 options:
>> 1. implement a dedicated one based on your features / metrics / domain
>> 2. try to fine tune an existing algorithm that is flexible enough
>>
>> If I were to do it with Solr I'd probably do something like:
>> 1. index "original" blogposts in Solr (possibly using Jack's suggestion
>> about ngrams / shingles)
>> 2. do MLT queries with "candidate blogposts copies" text
>> 3. get the first, say, 2-3 hits
>> 4. mark it as quote / plagiarism
>> 5. eventually train a classifier to help you mark other texts as quote /
>> plagiarism
>>
>> HTH,
>> Tommaso
>>
>>
>>
>> 2013/7/23 Furkan KAMACI 
>>
>> > Actually I need a specialized algorithm. I want to use that algorithm to
>> > detect duplicate blog posts.
>> >
>> > 2013/7/23 Tommaso Teofili 
>> >
>> > > Hi,
>> > >
>> > > I you may leverage and / or improve MLT component [1].
>> > >
>> > > HTH,
>> > > Tommaso
>> > >
>> > > [1] : http://wiki.apache.org/solr/MoreLikeThis
>> > >
>> > >
>> > > 2013/7/23 Furkan KAMACI 
>> > >
>> > > > Hi;
>> > > >
>> > > > Sometimes a huge part of a document may exist in another document. As
>> > > like
>> > > > in student plagiarism or quotation of a blog post at another blog
>> post.
>> > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
>> > to
>> > > > detect it?
>> > > >
>> > >
>> >
>>


Re: Search for misspelled words in corpus

2013-06-08 Thread Shashi Kant
n-grams might help, followed by a edit distance metric such as Jaro-Winkler
or Smith-Waterman-Gotoh to further filter out.


On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic  wrote:

> Interesting problem.  The first thing that comes to mind is to do
> "word expansion" during indexing.  Kind of like synonym expansion, but
> maybe a bit more dynamic. If you can have a dictionary of correctly
> spelled words, then for each token emitted by the tokenizer you could
> look up the dictionary and expand the token to all other words that
> are similar/close enough.  This would not be super fast, and you'd
> likely have to add some custom heuristic for figuring out what
> "similar/close enough" means, but it might work.
>
> I'd love to hear other ideas...
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల
>  wrote:
> > Hi,
> >
> > I have a problem where our text corpus on which we need to do search
> > contains many misspelled words. Same word could also be misspelled in
> > several different ways. It could also have documents that have correct
> > spellings However, the search term that we give in query would always be
> > correct spelling. Now when we search on a term, we would like to get all
> > the documents that contain both correct and misspelled forms of the
> search
> > term.
> > We tried fuzzy search, but it doesn't work as per our expectations. It
> > returns any close match, not specifically misspelled words. For example,
> if
> > I'm searching for a word like "fight", I would like to return the
> documents
> > that have words like "figth" and "feight", not documents with words like
> > "sight" and "light".
> > Is there any suggested approach for doing this?
> >
> > regards,
> > Kamesh
>


Re: How apache solr stores indexes

2013-05-28 Thread Shashi Kant
Better still start here: http://en.wikipedia.org/wiki/Inverted_index

http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html

And there are several books on search engines and related algorithms.



On Tue, May 28, 2013 at 10:41 PM, Alexandre Rafalovitch
wrote:

> And you need to know this why?
>
> If you are really trying to understand how this all works under the
> covers, you need to look at Lucene's inverted index as a start. Start
> here:
> http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description
>
> Might take you a couple of weeks to put it all together.
>
> Or you could try asking the actual business-level question that you
> need an answer to. :-)
>
> Regards,
>Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Tue, May 28, 2013 at 10:13 PM, Kamal Palei 
> wrote:
> > Dear All
> > I have a basic doubt how the data is stored in apache solr indexes.
> >
> > Say I have thousand registered users in my site. Lets say I want to store
> > skills of each users as a multivalued string index.
> >
> > Say
> > user 1 has skill set - Java, MySql, PHP
> > user 2 has skill set - C++, MySql, PHP
> > user 3 has skill set - Java, Android, iOS
> > ... so on
> >
> > You can see user 1 and 2 has two common skills that is MySql and PHP
> > In an actual case there might be millions of repetition of words.
> >
> > Now question is, does apache solr stores them as just words, OR converts
> > each words to an unique number and stores the number only.
> >
> > Best Regards
> > Kamal
> > Net Cloud Systems
> > Bangalore, India
>


Re: Could I use Solr to index multiple applications?

2012-07-17 Thread Shashi Kant
My suggestion would be to look into Multi Tenancy http://www.elasticsearch.org/.
It is easy to setup and use for multiple indexes.


On Tue, Jul 17, 2012 at 9:26 PM, Zhang, Lisheng
 wrote:
> Thanks very much for quick help! Multicore sounds interesting,
> I roughly read the doc, so we need to put each core name into
> Solr config XML, if we add another core and change XML, do we
> need to restart Solr?
>
> Best regards, Lisheng
>
> -Original Message-
> From: shashi@gmail.com [mailto:shashi....@gmail.com]On Behalf Of
> Shashi Kant
> Sent: Tuesday, July 17, 2012 5:46 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Could I use Solr to index multiple applications?
>
>
> Look up multicore solr. Another choice could be ElasticSearch - which
> is more straightforward in managing multiple indexes IMO.
>
>
>
> On Tue, Jul 17, 2012 at 7:53 PM, Zhang, Lisheng
>  wrote:
>> Hi,
>>
>> We have an application where we index data into many different directories 
>> (each directory
>> is corresponding to a different lucene IndexSearcher).
>>
>> Looking at Solr config it seems that Solr expects there is only one indexed 
>> data directory,
>> can we use Solr for our application?
>>
>> Thanks very much for helps, Lisheng
>>


Re: Could I use Solr to index multiple applications?

2012-07-17 Thread Shashi Kant
Look up multicore solr. Another choice could be ElasticSearch - which
is more straightforward in managing multiple indexes IMO.



On Tue, Jul 17, 2012 at 7:53 PM, Zhang, Lisheng
 wrote:
> Hi,
>
> We have an application where we index data into many different directories 
> (each directory
> is corresponding to a different lucene IndexSearcher).
>
> Looking at Solr config it seems that Solr expects there is only one indexed 
> data directory,
> can we use Solr for our application?
>
> Thanks very much for helps, Lisheng
>


Re: Does Solr fit my needs?

2012-04-27 Thread Shashi Kant
We have used both Solr and graph databases for our XML file indexing. Both
are equivalent in terms of performance, but a graph db (such as Neo4j)
offers a lot more flexibility in joining across the nodes and traversing.
If your data is strictly hierarchical Solr might do it, alternately suggest
looking at a Graph database such as Neo4j.



On Fri, Apr 27, 2012 at 10:36 AM, Bob Sandiford <
bob.sandif...@sirsidynix.com> wrote:

> ndexing and searching of the specific fields, it is certainly possible to
> retrieve the xml file.  While Solr isn't a DB, it does allow a binary field
> to be associated with an index document.  We store a GZipped XML file in a
> binary field and retrieve that under certain conditions to get at original
> document information.  We've found that Solr can handle these much faster
> than our DB can do.  (We regularly index a large portion of our documents,
> and the XML files are prone to frequent changes).  If you DO keep such a
> blob in your Solr index, make sure you retrieve that field ONLY when you
> really want it...


Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-23 Thread Shashi Kant
You can update the document in the index quite frequently. IDNK what
your requirement is, another option would be to boost query time.

On Sun, Jan 22, 2012 at 5:51 AM, Bing Li  wrote:
> Dear Shashi,
>
> Thanks so much for your reply!
>
> However, I think the value of PageRank is not a static one. It must update
> on the fly. As I know, Lucene index is not suitable to be updated too
> frequently. If so, how to deal with that?
>
> Best regards,
> Bing
>
>
> On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant  wrote:
>>
>> Lucene has a mechanism to "boost" up/down documents using your custom
>> ranking algorithm. So if you come up with something like Pagerank
>> you might do something like doc.SetBoost(myboost), before writing to
>> index.
>>
>>
>>
>> On Sat, Jan 21, 2012 at 5:07 PM, Bing Li  wrote:
>> > Hi, Kai,
>> >
>> > Thanks so much for your reply!
>> >
>> > If the retrieving is done on a string field, not a text field, a
>> > complete
>> > matching approach should be used according to my understanding, right?
>> > If
>> > so, how does Lucene rank the retrieved data?
>> >
>> > Best regards,
>> > Bing
>> >
>> > On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:
>> >
>> >> Solr is kind of retrieval step, you can customize the score formula in
>> >> Lucene. But it supposes not to be too complicated, like it's better can
>> >> be
>> >> factorization. It also regards to the stored information, like
>> >> TF,DF,position, etc. You can do 2nd phase rerank to the top N data you
>> >> have
>> >> got.
>> >>
>> >> Sent from my iPad
>> >>
>> >> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
>> >>
>> >> > Dear all,
>> >> >
>> >> > I am using SolrJ to implement a system that needs to provide users
>> >> > with
>> >> > searching services. I have some questions about Solr searching as
>> >> follows.
>> >> >
>> >> > As I know, Lucene retrieves data according to the degree of keyword
>> >> > matching on text field (partial matching).
>> >> >
>> >> > But, if I search data by string field (complete matching), how does
>> >> Lucene
>> >> > sort the retrieved data?
>> >> >
>> >> > If I want to add new sorting ways, Solr's function query seems to
>> >> > support
>> >> > this feature.
>> >> >
>> >> > However, for a complicated ranking strategy, such PageRank, can Solr
>> >> > provide an interface for me to do that?
>> >> >
>> >> > My ranking ways are more complicated than PageRank. Now I have to
>> >> > load
>> >> all
>> >> > of matched data from Solr first by keyword and rank them again in my
>> >> > ways
>> >> > before showing to users. It is correct?
>> >> >
>> >> > Thanks so much!
>> >> > Bing
>> >>
>
>


Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-21 Thread Shashi Kant
Lucene has a mechanism to "boost" up/down documents using your custom
ranking algorithm. So if you come up with something like Pagerank
you might do something like doc.SetBoost(myboost), before writing to index.



On Sat, Jan 21, 2012 at 5:07 PM, Bing Li  wrote:
> Hi, Kai,
>
> Thanks so much for your reply!
>
> If the retrieving is done on a string field, not a text field, a complete
> matching approach should be used according to my understanding, right? If
> so, how does Lucene rank the retrieved data?
>
> Best regards,
> Bing
>
> On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:
>
>> Solr is kind of retrieval step, you can customize the score formula in
>> Lucene. But it supposes not to be too complicated, like it's better can be
>> factorization. It also regards to the stored information, like
>> TF,DF,position, etc. You can do 2nd phase rerank to the top N data you have
>> got.
>>
>> Sent from my iPad
>>
>> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
>>
>> > Dear all,
>> >
>> > I am using SolrJ to implement a system that needs to provide users with
>> > searching services. I have some questions about Solr searching as
>> follows.
>> >
>> > As I know, Lucene retrieves data according to the degree of keyword
>> > matching on text field (partial matching).
>> >
>> > But, if I search data by string field (complete matching), how does
>> Lucene
>> > sort the retrieved data?
>> >
>> > If I want to add new sorting ways, Solr's function query seems to support
>> > this feature.
>> >
>> > However, for a complicated ranking strategy, such PageRank, can Solr
>> > provide an interface for me to do that?
>> >
>> > My ranking ways are more complicated than PageRank. Now I have to load
>> all
>> > of matched data from Solr first by keyword and rank them again in my ways
>> > before showing to users. It is correct?
>> >
>> > Thanks so much!
>> > Bing
>>


Re: Solr, SQL Server's LIKE

2011-12-29 Thread Shashi Kant
for a simple, hackish (albeit inefficient) approach look up wildcard searchers

e,g foo*, *bar



On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
 wrote:
> I have been tinkering with Solr for a few weeks, and I am convinced that it 
> could be very helpful in many of my upcoming projects. I am trying to decide 
> whether Solr is appropriate for this one, and I haven't had luck looking for 
> answers on Google.
>
> I need to search a list of names of companies and individuals pretty exactly. 
> T-SQL's LIKE operator does this with decent performance, but I have a feeling 
> there is a way to configure Solr to do this better. I've tried using an edge 
> N-gram tokenizer, but it feels like it might be more complicated than 
> necessary. What would you suggest?
>
> I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
> more complicated (magic) searches that I don't think SQL Server can handle, 
> since its tokens (as far as I know) can't be smaller than one word.
>
> Thanks,
>
> Devon Baumgarten
>


Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-27 Thread Shashi Kant
You can also look at cosine similarity (or related metrics) to measure
document similarity.

On Tue, Dec 27, 2011 at 6:51 AM, vibhoreng04  wrote:
> Hi iorixxx,
>
> Thanks for the quick update.I hope I can take it from here !
>
>
> Regards,
>
> Vibhor
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3614253.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Score

2011-08-15 Thread Shashi Kant
https://wiki.apache.org/lucene-java/ScoresAsPercentages



On Mon, Aug 15, 2011 at 8:13 PM, Bill Bell  wrote:

> How do I change the score to scale it between 0 and 100 irregardless of the
> score?
>
> q.alt=*:*&bq=lang:Spanish&defType=dismax
>
> Bill Bell
> Sent from mobile
>
>


Re: Multiple Cores on different machines?

2011-08-09 Thread Shashi Kant
"Betamax VCR"? really ? :-)



On Tue, Aug 9, 2011 at 3:38 PM, Chris Hostetter wrote:

>
> : A quick question - is it possible to have 2 cores in Solr on two
> different
> : machines?
>
> your question is a little vague ... like asking "is it possible to have to
> have two betamax VCRs in two different rooms of my house" ... sure, if you
> want ... but why are you asking the question?  are you expecting those
> VCRs to be doing something special that makes you wonder if that special
> thing will work when there are two of them?
>
> https://people.apache.org/~hossman/#xyproblem
> XY Problem
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
> -Hoss
>


Re: Solr can not index "F**K"!

2011-07-31 Thread Shashi Kant
Check your Stop words list
On Jul 31, 2011 6:25 PM, "François Schiettecatte" 
wrote:
> That seems a little far fetched, have you checked your analysis?
>
> François
>
> On Jul 31, 2011, at 4:58 PM, randohi wrote:
>
>> One of our clients (a hot girl!) brought this to our attention:
>> In this document there are many f* words:
>>
>> http://sec.gov/Archives/edgar/data/1474227/00014742271032/d424b3.htm
>>
>> and we have indexed it with latest version of Solr (ver 3.3). But, we if
we
>> search F**K, it does not return the document back!
>>
>> We have tried to index it with different text types, but still not
working.
>>
>> Any idea why F* can not be indexed - being censored by the government? :D
>>
>>
>> --
>> View this message in context:
http://lucene.472066.n3.nabble.com/Solr-can-not-index-F-K-tp3214246p3214246.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: searching a subset of SOLR index

2011-07-05 Thread Shashi Kant
Range query


On Tue, Jul 5, 2011 at 4:37 AM, Jame Vaalet  wrote:
> Hi,
> Let say, I have got 10^10 documents in an index with unique id being document 
> id which is assigned to each of those from 1 to 10^10 .
> Now I want to search a particular query string in a subset of these documents 
> say ( document id 100 to 1000).
>
> The question here is.. will SOLR able to search just in this set of documents 
> rather than the entire index ? if yes what should be query to limit search 
> into this subset ?
>
> Regards,
> JAME VAALET
> Software Developer
> EXT :8108
> Capital IQ
>
>


Re: Solr vs ElasticSearch

2011-05-31 Thread Shashi Kant
Here is a very interesting comparison

http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/


> -Original Message-
> From: Mark
> Sent: May-31-11 10:33 PM
> To: solr-user@lucene.apache.org
> Subject: Solr vs ElasticSearch
>
> I've been hearing more and more about ElasticSearch. Can anyone give me a
> rough overview on how these two technologies differ. What are the
> strengths/weaknesses of each. Why would one choose one of the other?
>
> Thanks
>
>


Re: I need an available solr lucene consultant

2011-05-17 Thread Shashi Kant
You might be better off looking for freelancers on sites such as
odesk.com, guru.com, rentacoder.com, elance.com & many more


On Tue, May 17, 2011 at 4:09 PM, Markus Jelsma
 wrote:
> Check this out:
> http://wiki.apache.org/solr/Support
>
>> Hi,
>>
>> I am looking for an experienced and skilled Solr & Lucene
>> developer/consultant to work on a software project incorporating natural
>> language processing and machine learning algorithms. As part of a larger
>> NLP/AI project that is under way, we need someone to install, refine and
>> optimize Solr and Lucene for our website. The data being analyzed will be
>> from user-generated textual discussions around a multitude of topics that
>> will continuously be updated. You must be able to work in a LAMP
>> environment with other developers, be smart, reliable, and a self-starter
>> with excellent problem solving and analytical abilities. You must have a
>> solid grasp of English – written and verbal.
>>
>> Please note that I am a start-up and I am not going to be able to pay what
>> a large established company can pay.
>>
>> Thank you,
>>
>> Lance
>>
>> -
>> Lance
>


Re: Looking for help with Solr implementation

2010-11-12 Thread Shashi Kant
Have you tried posting on odesk.com? I have had decent success finding
Solr/Lucene resources there.


On Thu, Nov 11, 2010 at 7:52 PM, AC  wrote:

> Hi,
>
>
> Not sure if this is the correct place to post but I'm looking for someone
> to
> help finish a Solr install on our LAMP based website.  This would be a paid
> project.
>
>
> The programmer that started the project got too busy with his full-time job
> to
> finish the project.  Solr has been installed and a basic search is working
> but
> we need to configure it to work across the site and also set-up faceted
> search.I tried posting on some popular freelance sites but haven't been
> able
> to find anyone with real Solr expertise / experience.
>
>
> If you think you can help me with this project please let me know and I can
> supply more details.
>
>
> Regards
>
>
>


Re: Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-29 Thread Shashi Kant
On Fri, Oct 29, 2010 at 6:00 PM, Ron Mayer  wrote:

> I have some documents with a bunch of attachments (images, thumbnails
> for them, audio clips, word docs, etc); and am currently dealing with
> them by just putting a path on a filesystem to them in solr; and then
> jumping through hoops of keeping them in sync with solr.
>
>

Not sure why that is an issue. Keeping them in sync with solr would be the
same as storing within a file-system. Why would storing within solr be any
different.


> Would it be nuts to stick the image data itself in solr?
>
> More specifically - if I have a bunch of large stored fields,
> would it significantly impact search performance in the
> cases when those fields aren't fetched.
>
>
Hard to say. Assume you mean storing by converting into a base64 format. If
you do not retrieve the field when fetching, AFAIK should not affect it
significantly, if at all.
So if you manage your retrieval should be fine.


> Searches are very common in this system, and it's very rare
> that someone actually opens up one of these attachments
> so I'm not really worried about the time it takes to fetch
> them when someone does actually want one.
>
>


Re: Color search for images

2010-09-17 Thread Shashi Kant
>
> What I am envisioning (at least to start) is have all this add two fields in
> the index.  One would be for color information for the color similarity
> search.  The other would be a simple multivalued text field that we put
> keywords into based on what OpenCV can detect about the image.  If it
> detects faces, we would put "face" into this field.  Other things that it
> can detect would result in other keywords.
>
> For the color search, I have a few inter-related hurdles.  I've got to
> figure out what form the color data actually takes and how to represent it
> in Solr.  I need Java code for Solr that can take an input color value and
> find similar values in the index.  Then I need some code that can go in our
> feed processing scripts for new content.  That code would also go into a
> crawler script to handle existing images.
>

You are on the right track. You can create a set of representative
keywords from the image. OpenCV  gets a color histogram from the image
- you can set the bin values to be as granular as you need, and create
a look-up list of color names to generate a MVF representative of the
image.
If you want to get more sophisticated, represent the colors with
payloads in correlation with the distribution of the color in the
image.

Another approach would be to segment the image and extract colors from
each. So if you have a red rose with all white background, the textual
representation would be something like:

white, white...red...white, white

Play around and see which works best.

HTH


Re: Get all results from a solr query

2010-09-16 Thread Shashi Kant
Start with a *:*, then the “numFound” attribute of the 
element should give you the rows to fetch by a 2nd request.


On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  wrote:
> That will stil just return 10 rows for me.  Is there something else in
> the configuration of solr to have it return all the rows in the
> results?
>
> -- Chris
>
>
>
> On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
>> q=*:*
>>
>> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  wrote:
>>> I have some queries that I'm running against a solr instance (older,
>>> 1.2 I believe), and I would like to get *all* the results back (and
>>> not have to put an absurdly large number as a part of the rows
>>> parameter).
>>>
>>> Is there a way that I can do that?  Any help would be appreciated.
>>>
>>> -- Chris
>>>
>>
>


Re: Get all results from a solr query

2010-09-16 Thread Shashi Kant
q=*:*

On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  wrote:
> I have some queries that I'm running against a solr instance (older,
> 1.2 I believe), and I would like to get *all* the results back (and
> not have to put an absurdly large number as a part of the rows
> parameter).
>
> Is there a way that I can do that?  Any help would be appreciated.
>
> -- Chris
>


Re: Color search for images

2010-09-16 Thread Shashi Kant
> Lire looks promising, but how hard is it to integrate the content-based
> search into Solr as opposed to Lucene?  I myself am not a Java developer.  I
> have access to people who are, but their time is scarce.
>


Lire is a nascent effort and based on a cursory overview a while back,
IMHO was an over-simplified version of what a CBIR engine should be.
They use CEDD (color & edge descriptors).
Wouldn't work for the kind of applications I am working on - which
needs among other things, Color, Shape, Orientation, Pose, Edge/Corner
etc.

OpenCV has a steep learning curve, but having been through it, is very
powerful toolkit - the best there is by far! BTW the code is in C++,
but has both Java & .NET bindings.
This is a fabulous book to get hold of:
http://www.amazon.com/Learning-OpenCV-Computer-Vision-Library/dp/0596516134,
if you are seriously into OpenCV.

Pls feel free to reach out of if you need any help with OpenCV +
Solr/Lucene. I have spent quite a bit of time on this.


Re: Color search for images

2010-09-16 Thread Shashi Kant
On Thu, Sep 16, 2010 at 3:21 AM, Lance Norskog  wrote:
> Yes, notice the flowers are all a medium-dark crimson red. There are a bunch
> of these image-indexing & search technologies, but there is no (to my
> knowledge) "finished technology"- it's very much an area of research. If you
> want to search the word 'flower' and index data that can find blobs of red,
> that might be easy with public tools. But there are many hard problems.
>

Lance, is there *ever* a "finished technology"? >-)


Re: Color search for images

2010-09-15 Thread Shashi Kant
> I'm sure there's some post doctoral types who could get a graphic shape 
> analyzer, color analyzer, to at least say it's a flower.
>
> However, even Google would have to build new datacenters to have the 
> horsepower to do that kind of graphic processing.
>

Not necessarily true. Like.com - which incidentally got acquired by
Google recently - built a true visual search technology and applied it
on a large scale.


Re: Color search for images

2010-09-15 Thread Shashi Kant
>
> On a related note, I'm curious if anyone has run across a good set of
> algorithms (or hopefully a library) for doing naive image
> classification. I'm looking for something that can classify images
> into something similar to the broad categories that Google image
> search has (Face, Photo, Clip Art, Line Drawing, etc.).
>
>
> --Paul
>

OpenCV is the way to go.Very comprehensive set of algorithms.


Re: Color search for images

2010-09-15 Thread Shashi Kant
Shawn, I have done some research into this, machine-vision especially
on a large scale is a hard problem, not to be entered into lightly. I
would recommend starting with OpenCV - a comprehensive toolkit for
extracting various features such as Color, Edge etc from images. Also
there is a project LIRE http://www.semanticmetadata.net/lire/ which
attempts to do something along what you are thinking of. Not sure how
well it works.

HTH,
Shashi


On Wed, Sep 15, 2010 at 10:59 AM, Shawn Heisey  wrote:
>  My index consists of metadata for a collection of 45 million objects, most
> of which are digital images.  The executives have fallen in love with
> Google's color image search.  Here's a search for "flower" with a red color
> filter:
>
> http://www.google.com/images?q=flower&tbs=isch:1,ic:specific,isc:red
>
> I am interested in duplicating this.  Can this group of fine people point me
> in the right direction?  I don't want anyone to do it for me, just help me
> find software and/or algorithms that can extract the color information, then
> find a way to get Solr to index and search it.
>
> Thanks,
> Shawn
>
>


Re: Indexing all versions of Microsoft Office Documents

2010-04-27 Thread Shashi Kant
If you are on Windows try the Microsoft IFilter API - it supports
current Office versions.
http://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CC&displaylang=en



On Tue, Apr 27, 2010 at 6:08 AM, Roland Villemoes  
wrote:
> Hi All,
>
> Does anyone have a running solution indexing Microsoft Office Documents e.g. 
> .docx .xlsx etc. ?
>
> I can see a lot of examples using Tika for rich content extraction, but still 
> nothing when it comes to newer versions of Microsoft Office?
> What libraries to use of not Tika?
>
> med venlig hilsen/best regards
>
> Roland Villemoes
> Tel: (+45) 22 69 59 62
> E-Mail: mailto:r...@alpha-solutions.dk
>
> Alpha Solutions A/S
> Borgergade 2, 3.sal, 1300 København K
> Tel: (+45) 70 20 65 38
> Web: http://www.alpha-solutions.dk
>
> ** This message including any attachments may contain confidential and/or 
> privileged information intended only for the person or entity to which it is 
> addressed. If you are not the intended recipient you should delete this 
> message. Any printing, copying, distribution or other use of this message is 
> strictly prohibited. If you have received this message in error, please 
> notify the sender immediately by telephone, or e-mail and delete all copies 
> of this message and any attachments from your system. Thank you.
>
>


Re: LucidWorks Solr

2010-04-21 Thread Shashi Kant
Why do these approaches have to be mutually exclusive?
Do a dictionary lookup, if no satisfactory match found use an
algorithmic stemmer. Would probably save a few CPU cycles by
algorithmic stemming iff necessary.


On Wed, Apr 21, 2010 at 1:31 PM, Robert Muir  wrote:
> sy to look at the "faults" of some algorithmic stemmer, in truth its
> only purpose is to cause related forms of the word to conflate to the same
> form, and hopefully avoiding unrelated terms from conflating to this form.
>
> A dictionary-based stemmer is out-of-date the day you put it into
> production: languages aren't static. For example, you can't expect a
> dictionary-based stemmer to properly deal with forms like "googling" or
> "tweets" that have recently slipped into English vocabulary, but an
> algorithmic stemmer will likely deal with these just fine.


Re: Query time only Ranges

2010-03-31 Thread Shashi Kant
In that case, you could just calculate an offset from 00:00:00 in
seconds (ignore the date)
Pretty simple.


On Wed, Mar 31, 2010 at 4:57 PM, abhatna...@vantage.com
 wrote:
>
> Hi Sashi,
> Could you elaborate point no .1 in the light of case where in a field should
> have just time?
>
>
> Ankit
>
>
> --
> View this message in context: 
> http://n3.nabble.com/Query-time-only-Ranges-tp688831p689413.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Query time only Ranges

2010-03-31 Thread Shashi Kant
I suggest approaching it thus:

1. Create a datetime offset from a baseline date (say Jan 1, 1900,
00:00:00) and store the date diff in secs from that date-time.
2. Use numeric range query. I find this approach works faster and
would also give you the granularity you want.




On Wed, Mar 31, 2010 at 4:40 PM, abhatna...@vantage.com
 wrote:
>
> One issue though – first I need precision upto seconds.
>
> Also does anybody knows that performance issue involved with this
> granularity.
>
> How about the approach of breaking date time field into fields like hours,
> mins, secs
>
>
>
> Ankit
> --
> View this message in context: 
> http://n3.nabble.com/Query-time-only-Ranges-tp688831p689373.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: boost on certain keywords

2010-01-28 Thread Shashi Kant
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/


On Thu, Jan 28, 2010 at 6:54 AM, Shashi Kant  wrote:
> Look at Payload.
>
> On Thu, Jan 28, 2010 at 6:48 AM, murali k  wrote:
>>
>> Say I have a clothes store,  i have ladies clothes, mens clothes
>>
>> when someone searches for "clothes", i want to prioritize mens clothing
>> results,
>> how can I achieve this ?
>> this logic should only apply for this keyword, other keywords should work
>> as-is
>>
>> should I be trying with something on synonyms or during the process of
>> indexing ? or something in dismax request handler ?
>>
>>
>>
>>
>> --
>> View this message in context: 
>> http://old.nabble.com/boost-on-certain-keywords-tp27354717p27354717.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>


Re: boost on certain keywords

2010-01-28 Thread Shashi Kant
Look at Payload.

On Thu, Jan 28, 2010 at 6:48 AM, murali k  wrote:
>
> Say I have a clothes store,  i have ladies clothes, mens clothes
>
> when someone searches for "clothes", i want to prioritize mens clothing
> results,
> how can I achieve this ?
> this logic should only apply for this keyword, other keywords should work
> as-is
>
> should I be trying with something on synonyms or during the process of
> indexing ? or something in dismax request handler ?
>
>
>
>
> --
> View this message in context: 
> http://old.nabble.com/boost-on-certain-keywords-tp27354717p27354717.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: HI

2009-12-13 Thread Shashi Kant
http://lmgtfy.com/?q=lucene+basics


On Sun, Dec 13, 2009 at 1:01 PM, Faire Mii  wrote:

> Hi,
>
> I am a beginner and i wonder what a document, entity and a field relates to
> in a database?
>
> And i wonder if there are some good tutorials that learn you how to design
> your schema. Because all other articles i have
>
> read aren't very helpful for beginners.
>
> Regards
>
> Fayer
>


Re: Migrating to Solr

2009-11-24 Thread Shashi Kant
Here is a link that might be helpful:

http://sesat.no/moving-from-fast-to-solr-review.html

The site is choc-a-bloc with great information on their migration
experience.


On Tue, Nov 24, 2009 at 8:55 AM, Tommy Molto  wrote:

> Hi,
>
> I'm new at Solr and i need to make a "test pilot" of a migration from Fast
> ESP to Apache Solr, anyone had this experience before?
>
>
> Att,
>


Re: Solr - Load Increasing.

2009-11-16 Thread Shashi Kant
I think it would be useful for members of this list to realize that not
everyone uses the same metrology and terms.

It is very easy for "Americans" to use the imperial system and presume
everyone does the same; Europeans to use the metric system etc. Hopefully
members on this list would be persuaded to use or at least clarify their
terminology.

While the apocryphal saying goes " the great thing about standards is they
are so many choose from", we should all make an effort to communicate across
cultures and nations.



On Mon, Nov 16, 2009 at 5:33 PM, Israel Ekpo  wrote:

> On Mon, Nov 16, 2009 at 5:22 PM, Walter Underwood  >wrote:
>
> > Probably "lakh": 100,000.
> >
> > So, 900k qpd and 3M docs.
> >
> > http://en.wikipedia.org/wiki/Lakh
> >
> > wunder
> >
> > On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote:
> >
> > > Hi,
> > >
> > > Your autoCommit settings are very aggressive.  I'm guessing that's
> what's
> > causing the CPU load.
> > >
> > > btw. what is "laks"?
> > >
> > > Otis
> > > --
> > > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> > >
> > >
> > >
> > > - Original Message 
> > >> From: kalidoss 
> > >> To: solr-user@lucene.apache.org
> > >> Sent: Mon, November 16, 2009 9:11:21 AM
> > >> Subject: Solr - Load Increasing.
> > >>
> > >> Hi All.
> > >>
> > >>   My server solr box cpu utilization  increasing b/w 60 to 90% and
> some
> > time
> > >> solr is getting down and we are restarting it manually.
> > >>
> > >>   No of documents in solr 30 laks.
> > >>   No of add/update requrest solr 30 thousand / day. Avg of every 30
> > minutes
> > >> around 500 writes.
> > >>   No of search request 9laks / day.
> > >>   Size of the data directory: 4gb.
> > >>
> > >>
> > >>   My system ram is 8gb.
> > >>   System available space 12gb.
> > >>   processor Family: Pentium Pro
> > >>
> > >>   Our solr data size can be increase in number like 90 laks. and
> writes
> > per day
> > >> will be around 1laks.   - Hope its possible by solr.
> > >>
> > >>   For write commit i have configured like
> > >>
> > >>  1
> > >>  10
> > >>
> > >>
> > >>   Is all above can be possible? 90laks datas and 1laks per day writes
> > and
> > >> 30laks per day read??  - if yes what type of system configuration
> would
> > require.
> > >>
> > >>   Please suggest us.
> > >>
> > >> thanks,
> > >> Kalidoss.m,
> > >>
> > >>
> > >> Get your world in your inbox!
> > >>
> > >> Mail, widgets, documents, spreadsheets, organizer and much more with
> > your
> > >> Sifymail WIYI id!
> > >> Log on to http://www.sify.com
> > >>
> > >> ** DISCLAIMER **
> > >> Information contained and transmitted by this E-MAIL is proprietary to
> > Sify
> > >> Limited and is intended for use only by the individual or entity to
> > which it is
> > >> addressed, and may contain information that is privileged,
> confidential
> > or
> > >> exempt from disclosure under applicable law. If this is a forwarded
> > message, the
> > >> content of this E-MAIL may not have been sent with the authority of
> the
> > Company.
> > >> If you are not the intended recipient, an agent of the intended
> > recipient or a
> > >> person responsible for delivering the information to the named
> > recipient,  you
> > >> are notified that any use, distribution, transmission, printing,
> copying
> > or
> > >> dissemination of this information in any way or in any manner is
> > strictly
> > >> prohibited. If you have received this communication in error, please
> > delete this
> > >> mail & notify us immediately at ad...@sifycorp.com
> > >
> >
> >
>
>
> Thanks Walter for clarifying that.
>
> I too was wondering what "laks" meant.
>
> It was a bit distracting when I read the original post.
> --
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
>


Re: Search Within

2009-04-04 Thread Shashi Kant
This post describes the search-within-search implementation.

http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html


Shashi


On Sat, Apr 4, 2009 at 1:21 PM, Vernon Chapman wrote:

> Bess,
>
> I think that might work I'll try it out and see how it works for my case.
>
> thanks
>
>
> Bess Sadler wrote:
>
>> Hi, Vernon.
>>
>> In Blacklight, the way we've been doing this is just to stack queries on
>> top of each other. It's a conceptual shift from the way one might think
>> about "search within", but it accomplishes the same thing. For example:
>>
>> search1 ==> q=horse
>>
>> search2 ==> q=horse AND dog
>>
>> The second search, from the user's point of view, takes the search results
>> from the horse search and further narrows them to those items that also
>> contain dog. But you're really just doing a new search, one that contains
>> both search values.
>>
>> Does that help? Or am I misunderstanding your question?
>>
>> Bess
>>
>> On 4-Apr-09, at 12:10 PM, Vernon Chapman wrote:
>>
>>  I am not sure if this is a really easy or newbee-ish type question.
>>> I would like to implement a search within these results type feature.
>>> Has anyone done this and could you please share some tips, pointers and
>>> or documentation on how to implement this.
>>>
>>> Thanks
>>>
>>> Vern
>>>
>>>
>>
>>


Re: Hardware Questions...

2009-03-24 Thread Shashi Kant
Have you looked at http://wiki.apache.org/solr/SolrPerformanceData
?

On Tue, Mar 24, 2009 at 4:51 PM, solr  wrote:

> We have three Solr servers (several two processor Dell PowerEdge
> servers). I'd like to get three newer servers and I wanted to see what
> we should be getting. I'm thinking the following...
>
>
>
> Dell PowerEdge 2950 III
>
> 2x2.33GHz/12M 1333MHz Quad Core
>
> 16GB RAM
> 6 x 146GB 15K RPM RAID-5 drives
>
>
>
> How do people spec out servers, especially CPU, memory and disk? Is this
> all based on the number of doc's, indexes, etc...
>
>
>
> Also, what are people using for benchmarking and monitoring Solr? Thanks
> - Mike
>
>


Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
Can anyone back that up?

IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus 
builds on Tesseract".
Can you confirm that Vikram has a point?

Shashi




- Original Message 
From: Vikram Kumar 
To: solr-user@lucene.apache.org; Shashi Kant 
Sent: Thursday, February 26, 2009 9:21:07 PM
Subject: Re: Use of scanned documents for text extraction and indexing

Tesseract is pure OCR. Ocropus builds on Tesseract.
Vikram

On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant  wrote:

> Another project worth investigating is Tesseract.
>
> http://code.google.com/p/tesseract-ocr/
>
>
>
>
> - Original Message 
> From: Hannes Carl Meyer 
> To: solr-user@lucene.apache.org
> Sent: Thursday, February 26, 2009 11:35:14 AM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Hi Sithu,
>
> there is a project called ocropus done by the DFKI, check the online demo
> here: http://demo.iupr.org/cgi-bin/main.cgi
>
> And also http://sites.google.com/site/ocropus/
>
> Regards
>
> Hannes
>
> m...@hcmeyer.com
> http://mimblog.de
>
> On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> sithu.sudar...@fda.hhs.gov> wrote:
>
> >
> > Hi All:
> >
> > Is there any study / research done on using scanned paper documents as
> > images (may be PDF), and then use some OCR or other technique for
> > extracting text, and the resultant index quality?
> >
> >
> > Thanks in advance,
> > Sithu D Sudarsan
> >
> > sithu.sudar...@fda.hhs.gov
> > sdsudar...@ualr.edu
> >
> >
> >
>
>



Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
Another project worth investigating is Tesseract.

http://code.google.com/p/tesseract-ocr/




- Original Message 
From: Hannes Carl Meyer 
To: solr-user@lucene.apache.org
Sent: Thursday, February 26, 2009 11:35:14 AM
Subject: Re: Use of scanned documents for text extraction and indexing

Hi Sithu,

there is a project called ocropus done by the DFKI, check the online demo
here: http://demo.iupr.org/cgi-bin/main.cgi

And also http://sites.google.com/site/ocropus/

Regards

Hannes

m...@hcmeyer.com
http://mimblog.de

On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
sithu.sudar...@fda.hhs.gov> wrote:

>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for
> extracting text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudar...@fda.hhs.gov
> sdsudar...@ualr.edu
>
>
>



Re: why don't we have a forum for discussion?

2009-02-18 Thread Shashi Kant
Steve - could you not just subscribe to the list from another (off-mobile 
device) email (Gmail or Yahoo) for example?
We discourage using corporate email for subscribing mailing lists precisely for 
such reasons : volume, spam, malware risks etc.

Shashi




- Original Message 
From: Stephen Weiss 
To: solr-user@lucene.apache.org
Sent: Wednesday, February 18, 2009 7:34:30 PM
Subject: Re: why don't we have a forum for discussion?

Like an earlier poster, my issue isn't on the laptop, it's with my mobile 
device.  The sheer volume of e-mail overwhelms the thing sometimes (right now, 
for instance).  There's really no option for moving the e-mail off to some 
other folder, it just all goes to one place.

Perhaps that means I need a better phone, it's just the obvious solutions 
aren't always practical.  Forums can conversely just as easily be set up to 
emulate mailing lists as well...  Our company's internal forum works this way.

--
Steve

On Feb 18, 2009, at 7:16 PM, Mike Klaas wrote:

> 
> 
> 2. Many people greatly prefer the mailing list format (obviously, it takes a 
> little bit of effort to use mailinglists effectively (e.g., directing the 
> traffic to a folder/tag/etc.)



Re: why don't we have a forum for discussion?

2009-02-18 Thread Shashi Kant
one man's "crap" is another man's treasure. :-P

So how would you decide what is worth posting? 
If you feel the list is overwhelming your email, set some filters.


Shashi


- Original Message 
From: Tony Wang 
To: solr-user@lucene.apache.org
Sent: Wednesday, February 18, 2009 2:06:57 PM
Subject: why don't we have a forum for discussion?

I am just curious why we don't have a forum for discussion or you guys think
it's really necessary to receive lots of crap information about Solr and
nutch in email? I can offer you a forum for discussion anyway.

-- 
Are you RCholic? www.RCholic.com
温 良 恭 俭 让 仁 义 礼 智 信