[Simplified my question] How to enhance solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating at this comma?)

2017-05-24 Thread Robert Hume
Hi,

Following up on my last email question ... I've learned more and I
simplified by question ...

I have a Solr 3.6 deployment.  Currently I'm using
solr.StandardTokenizerFactory to parse tokens during indexing.

Here's two example streams that demonstrate my issue:

Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
... which is good.

Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
... which is not good because users can't search by "000123".

It seems StandardTokenizerFactory treats the "6,000" differently (like it's
currency or a product number, maybe?) so it doesn't tokenize at the comma.

QUESTION: How can I enhance StandardTokenizer to do everything it's doing
now plus produce a couple of additional tokens like this ...

`bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`

... so users can search by "000123"?

Thanks!
Rob


Why is Standard Tokenizer not separating at this comma?

2017-05-24 Thread Robert Hume
I have a Solr 3.6 deployment I inherited.

The schema.xml specifies the use of StandardTokenizerFactory like so ...


...
  
...


According to this reference guide (
https://home.apache.org/~ctargett/RefGuidePOC/jekyll/Tokenizers.html) ...
the StandardTokenizer will treat punctuation as a delimiters.


However, here is my content that gets indexed:

"IOM-1:BA9ATS0FAB,\"Company Name

Module\",8.1.0.16.0.2,B-A,06KB09029932,PASS,,0,0,0,Y:0,0,0,0,0:BA9AUT0FAB,\"Company
CM Rear Module\",B-6,09XP12133407,"



This piece `B-A,06KB09029932` gets tokenized into two words ... `|B-A|`
and `|06KB09029932|`.


But this piece `B-6,09XP12133407` gets tokenized into one word ...
`|B-6,09XP12133407|`.

What I've observed is the comma is not considered a delimiter when it is
proceeded by a digit ... almost like it considers "6,000" to be currency or
something?


QUESTION: Is this a bug in StandardTokenizer, or do I misunderstand how
commas are used as delimiters?

Rob


Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Robert Hume
Thanks for that!  I was thinking (B) too, but wanted guidance that I'm
using the tool correctly.

Am still interested in hearing opinions from others, thanks!

rh

On Tue, Feb 21, 2017 at 8:17 PM, Dave  wrote:

> B is a better option long term. Solr is meant for retrieving flat data,
> fast, not hierarchical. That's what a database is for and trust me you
> would rather have a real database on the end point.  Each tool has a
> purpose, solr can never replace a relational database, and a relational
> database could not replace solr. Start with the slow model (database) for
> control/display and enhance with the fast model (solr) for retrieval/search
>
>
>
> > On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
> >
> > To learn how to properly use Solr, I'm building a little experimental
> > project with it to search for used car listings.
> >
> > Car listings appear on a variety of different places ... central places
> > Craigslist and also many many individual Used Car dealership websites.
> >
> > I am wondering, should I:
> >
> > (a) deploy a Solr search engine and build individual indexers for every
> > type of web site I want to find listings on?
> >
> > or
> >
> > (b) build my own database to store car listings, and then build services
> > that scrape data from different sites and feed entries into the database;
> > then point my Solr search to my database, one simple source of listings?
> >
> > My concerns are:
> >
> > With (a) ... I have to be smart enough to understand all those different
> > data sources and remove/update listings when they change; while this be
> > harder to do with custom Solr indexers than writing something from
> scratch?
> >
> > With (b) ... I'm maintaining a huge database of all my listings which
> seems
> > redundant; google doesn't make a *copy* of everything on the internet, it
> > just knows it's there.  Is maintaining my own database a bad design?
> >
> > Thanks for reading!
>


Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Robert Hume
To learn how to properly use Solr, I'm building a little experimental
project with it to search for used car listings.

Car listings appear on a variety of different places ... central places
Craigslist and also many many individual Used Car dealership websites.

I am wondering, should I:

(a) deploy a Solr search engine and build individual indexers for every
type of web site I want to find listings on?

or

(b) build my own database to store car listings, and then build services
that scrape data from different sites and feed entries into the database;
then point my Solr search to my database, one simple source of listings?

My concerns are:

With (a) ... I have to be smart enough to understand all those different
data sources and remove/update listings when they change; while this be
harder to do with custom Solr indexers than writing something from scratch?

With (b) ... I'm maintaining a huge database of all my listings which seems
redundant; google doesn't make a *copy* of everything on the internet, it
just knows it's there.  Is maintaining my own database a bad design?

Thanks for reading!


how to tell SolrHttpServer client to accept/ignore all certs?

2016-11-14 Thread Robert Hume
I'm using HttpSolrServer (in Solr 3.6) to connect to a Solr web service and
perform a query.

The certificate at the other end has expired and so connections now fail.

It will take the IT at the other end too many days to replace the cert
(this is out of my control).

How can I tell the HttpSolrServer to ignore bad certs when it does queries
to the server?

NOTE 1: I noticed that I can pass my own Apache HttpClient (we're currently
using 4.3) into the HttpSolrServer constructor, but internally
HttpSolrServer seems to do a lot of customizing/configuring it's own
default HttpClient, so I didn't want to mess with that.

NOTE: This is an 100% internal application so there is real security
problems with this temporary workaround.

Thanks!!

rh


[Newbie question] what is a "core" and are they different from 3.x to 5.x ?

2015-11-05 Thread Robert Hume
Trying to learn about SOLR.

I can see there is something called a "core" ... it appears there can be
many cores for a single SOLR server.

Can someone "explain like I'm five" -- what is a core?

And how do "cores" differ from 3.x to 5.x.

Any pointers in the right direction are helpful!

Thanks!
Rob


[Newbie question] in SOLR 5, would I have a "master-to-slave" relationship for two servers?

2015-11-05 Thread Robert Hume
Hi,

In my SOLR 3 deployment (inherited it), I have (1) one SOLR server that is
used by my web application, and (2) a second SOLR server that is used to
index documents via a customer datasource.

The database of server 2 is considered the "master" and it is replicated
regularly to server 1, the "slave".

The advantage is the responsiveness of server 1 is not impacted with server
2 gets busy with lots of indexing.

QUESTION: When deploying a SOLR 5 setup, do I set things up the same way?
Or do I cluster bother servers together into one "cloud"?   That is, in
SOLR 5, how do I ensure the indexing process will not impact the
performance of the web app?

Any help is greatly appreciated!!

Rob


Re: Should I install 4.x or 5.x? Book recommendations?

2015-10-23 Thread Robert Hume
Hi Alex,

What's the title of your book?  An amazon link would be useful too.

Thanks!
Rob

On Fri, Oct 23, 2015 at 2:50 PM, Alexandre Rafalovitch 
wrote:

> Definitely 5.x. Lots of new goodies. It is true that some of the
> startup scripts are different and the example schemas could be
> slightly confusing if following a book, but I think it is well worth
> starting on a good foot. Just remember, no "collection1" anymore, all
> cores/collections are explicit. And there are tutorial and reference
> guide available to help you along.
>
> And "Solr in Action" is a great book to purchase. Though, I'd
> recommend an electronic copy unless you want an exercise regime as
> well :-)
>
> I would say grab my book as well if you just want step by step
> introduction, but frankly it is definitely out of date (Solr 4.3!) and
> publisher pushed the price up into the ridiculous territory last time
> I checked. So, don't buy it. But if you have O'Reilly Safari account
> of some other ways to get to it, give it a glance too.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 23 October 2015 at 14:22, Robert Hume  wrote:
> > Hi,
> >
> > I'm investigating installing a new Solr deployment to be able to search
> > about two million documents (mostly HTML and PDF).
> >
> > QUESTIONS:
> >
> > A. Should I use Solr 4.x or 5.x?  My concerns are mostly to do with
> > support.  Is 5.x too new to be able to get good answers and advice from
> the
> > community?  Or should I stick with the latest 4.x release?
> >
> > B. Anyone have a good book recommendation?  I was thinking of buying
> "Solr
> > In Action" but it looks like it was published in April 2014 so it won't
> > have any 5.x info in it?
> >
> > Thanks!
> > Rob
>


Should I install 4.x or 5.x? Book recommendations?

2015-10-23 Thread Robert Hume
Hi,

I'm investigating installing a new Solr deployment to be able to search
about two million documents (mostly HTML and PDF).

QUESTIONS:

A. Should I use Solr 4.x or 5.x?  My concerns are mostly to do with
support.  Is 5.x too new to be able to get good answers and advice from the
community?  Or should I stick with the latest 4.x release?

B. Anyone have a good book recommendation?  I was thinking of buying "Solr
In Action" but it looks like it was published in April 2014 so it won't
have any 5.x info in it?

Thanks!
Rob


[newbie] questions about 3.6.0 and 4.x or 5.x ?

2015-10-21 Thread Robert Hume
Hello, I'm hoping to get some quick advice from the Solr gurus out there ...



I’ve inherited a project that uses a Solr 3.6.0 deployment.   (Several
masters and several slaves – I think there are 6 Solr instances in total.)



I’ve been tasked with investigating if upgrading our 3.6.0 deployment will
improve performance – there’s a lot of data and things are getting slow,
apparently.



I’ve read Apache docs that from 3.6.x to 4.x there were improvements in
scalability and performance.



I see that from 4.x to 5.x that Solr is now a standable server and no
longer just a WAR running on Tomcat.





QUESTIONS:


A. Is it worth upgrading to 4.x or 5.x?  Will I see a big improvement in
performance?



B. Should I got to 4.x or 5.x?  Will 4.x be an easier upgrade path since
it's just a new WAR file?



C. In a nutshell ... what will the upgrade path look like, what kind of
steps am I in for, and how can I avoid trouble?




Any help is GREATLY appreciated!!


Rob