Re: Cloud Deployment Strategy... In the Cloud

2015-09-24 Thread Dan Davis
ant is very good at this sort of thing, and easier for Java devs to learn
than Make.  Python has a module called fabric that is also very fine, but
for my dev. ops. it is another thing to learn.
I tend to divide things into three categories:

 - Things that have to do with system setup, and need to be run as root.
For this I write a bash script (I should learn puppet, but...)
 - Things that have to do with one time installation as a solr admin user
with /bin/bash, including upconfig.   For this I use an ant build.
 - Normal operational procedures.   For this, I typically use Solr admin or
scripts, but I wish I had time to create a good webapp (or money to
purchase Fusion).


On Thu, Sep 24, 2015 at 12:39 AM, Erick Erickson 
wrote:

> bq: What tools do you use for the "auto setup"? How do you get your config
> automatically uploaded to zk?
>
> Both uploading the config to ZK and creating collections are one-time
> operations, usually done manually. Currently uploading the config set is
> accomplished with zkCli (yes, it's a little clumsy). There's a JIRA to put
> this into solr/bin as a command though. They'd be easy enough to script in
> any given situation though with a shell script or wizard
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:33 PM, Steve Davids  wrote:
>
> > What tools do you use for the "auto setup"? How do you get your config
> > automatically uploaded to zk?
> >
> > On Tue, Sep 22, 2015 at 2:35 PM, Gili Nachum 
> wrote:
> >
> > > Our auto setup sequence is:
> > > 1.deploy 3 zk nodes
> > > 2. Deploy solr nodes and start them connecting to zk.
> > > 3. Upload collection config to zk.
> > > 4. Call create collection rest api.
> > > 5. Done. SolrCloud ready to work.
> > >
> > > Don't yet have automation for replacing or adding a node.
> > > On Sep 22, 2015 18:27, "Steve Davids"  wrote:
> > >
> > > > Hi,
> > > >
> > > > I am trying to come up with a repeatable process for deploying a Solr
> > > Cloud
> > > > cluster from scratch along with the appropriate security groups, auto
> > > > scaling groups, and custom Solr plugin code. I saw that LucidWorks
> > > created
> > > > a Solr Scale Toolkit but that seems to be more of a one-shot deal
> than
> > > > really setting up your environment for the long-haul. Here is were we
> > are
> > > > at right now:
> > > >
> > > >1. ZooKeeper ensemble is easily brought up via a Cloud Formation
> > > Script
> > > >2. We have an RPM built to lay down the Solr distribution + Custom
> > > >plugins + Configuration
> > > >3. Solr machines come up and connect to ZK
> > > >
> > > > Now, we are using Puppet which could easily create the
> core.properties
> > > file
> > > > for the corresponding core and have ZK get bootstrapped but that
> seems
> > to
> > > > be a no-no these days... So, can anyone think of a way to get ZK
> > > > bootstrapped automatically with pre-configured Collection
> > configurations?
> > > > Also, is there a recommendation on how to deal with machines that are
> > > > coming/going? As I see it machines will be getting spun up and
> > terminated
> > > > from time to time and we need to have a process of dealing with that,
> > the
> > > > first idea was to just use a common node name so if a machine was
> > > > terminated a new one can come up and replace that particular node but
> > on
> > > > second thought it would seem to require an auto scaling group *per*
> > node
> > > > (so it knows what node name it is). For a large cluster this seems
> > crazy
> > > > from a maintenance perspective, especially if you want to be elastic
> > with
> > > > regard to the number of live replicas for peak times. So, then the
> next
> > > > idea was to have some outside observer listen to when new ec2
> instances
> > > are
> > > > created or terminated (via CloudWatch SQS) and make the appropriate
> API
> > > > calls to either add the replica or delete it, this seems doable but
> > > perhaps
> > > > not the simplest solution that could work.
> > > >
> > > > I was hoping others have already gone through this and have valuable
> > > advice
> > > > to give, we are trying to setup Solr Cloud the "right way" so we
> don't
> > > get
> > > > nickel-and-dimed to death from an O perspective.
> > > >
> > > > Thanks,
> > > >
> > > > -Steve
> > > >
> > >
> >
>


Re: Solr authentication - Error 401 Unauthorized

2015-09-12 Thread Dan Davis
It seems that you have secured Solr so thoroughly that you cannot now run
bin/solr status!

bin/solr has no arguments as yet for providing a username/password - as a
mostly user like you I'm not sure of the roadmap.

I think you should relax those restrictions a bit and try again.

On Fri, Sep 11, 2015 at 5:06 AM, Merlin Morgenstern <
merlin.morgenst...@gmail.com> wrote:

> I have secured solr cloud via basic authentication.
>
> Now I am having difficulties creating cores and getting status information.
> Solr keeps telling me that the request is unothorized. However, I have
> access to the admin UI after login.
>
> How do I configure solr to use the basic authentication credentials?
>
> This is the error message:
>
> /opt/solr-5.3.0/bin/solr status
>
> Found 1 Solr nodes:
>
> Solr process 31114 running on port 8983
>
> ERROR: Failed to get system information from http://localhost:8983/solr
> due
> to: org.apache.http.client.ClientProtocolException: Expected JSON response
> from server but received: 
>
> 
>
> 
>
> Error 401 Unauthorized
>
> 
>
> HTTP ERROR 401
>
> Problem accessing /solr/admin/info/system. Reason:
>
> UnauthorizedPowered by
> Jetty://
>
>
> 
>
> 
>


Re: Solr authentication - Error 401 Unauthorized

2015-09-12 Thread Dan Davis
Noble,

You should also look at this if it is intended to be more than an internal
API.   Using the minor protections I added to test SOLR-8000, I was able to
reproduce a problem very like this:

bin/solr healthcheck -z localhost:2181 -c mycollection

Since Solr /select is protected...

On Sat, Sep 12, 2015 at 9:40 AM, Dan Davis <dansm...@gmail.com> wrote:

> It seems that you have secured Solr so thoroughly that you cannot now run
> bin/solr status!
>
> bin/solr has no arguments as yet for providing a username/password - as a
> mostly user like you I'm not sure of the roadmap.
>
> I think you should relax those restrictions a bit and try again.
>
> On Fri, Sep 11, 2015 at 5:06 AM, Merlin Morgenstern <
> merlin.morgenst...@gmail.com> wrote:
>
>> I have secured solr cloud via basic authentication.
>>
>> Now I am having difficulties creating cores and getting status
>> information.
>> Solr keeps telling me that the request is unothorized. However, I have
>> access to the admin UI after login.
>>
>> How do I configure solr to use the basic authentication credentials?
>>
>> This is the error message:
>>
>> /opt/solr-5.3.0/bin/solr status
>>
>> Found 1 Solr nodes:
>>
>> Solr process 31114 running on port 8983
>>
>> ERROR: Failed to get system information from http://localhost:8983/solr
>> due
>> to: org.apache.http.client.ClientProtocolException: Expected JSON response
>> from server but received: 
>>
>> 
>>
>> 
>>
>> Error 401 Unauthorized
>>
>> 
>>
>> HTTP ERROR 401
>>
>> Problem accessing /solr/admin/info/system. Reason:
>>
>> UnauthorizedPowered by
>> Jetty://
>>
>>
>> 
>>
>> 
>>
>
>


Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-10 Thread Dan Davis
Kevin & Noble,

I've manually verified the fix for SOLR-8000, but not yet for SOLR-8004.

I reproduced the initial problem with reloading security.json after
restarting both Solr and ZooKeeper.   I verified using zkcli.sh that
ZooKeeper does retain the changes to the file after using
/solr/admin/authorization, and that therefore the problem was Solr.

After building solr-5.3.1-SNAPSHOT.tgz with ant package (because I don't
know how to give parameters to ant server), I expanded it, copied in the
core data, and then started it.   I was prompted for a password, and it let
me in once the password was given.

I'll probably get to SOLR-8004 shortly, since I have both environments
built and working.

It also occurs to me that it might be better to forbid all permissions and
grant specific permissions to specific roles.   Is there a comprehensive
list of the permissions available?


On Tue, Sep 8, 2015 at 1:07 PM, Kevin Lee <kgle...@yahoo.com.invalid> wrote:

> Thanks Dan!  Please let us know what you find.  I’m interested to know if
> this is an issue with anyone else’s setup or if I have an issue in my local
> configuration that is still preventing it to work on start/restart.
>
> - Kevin
>
> > On Sep 5, 2015, at 8:45 AM, Dan Davis <dansm...@gmail.com> wrote:
> >
> > Kevin & Noble,
> >
> > I'll take it on to test this.   I've built from source before, and I've
> > wanted this authorization capability for awhile.
> >
> > On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee <kgle...@yahoo.com.invalid>
> wrote:
> >
> >> Noble,
> >>
> >> Does SOLR-8000 need to be re-opened?  Has anyone else been able to test
> >> the restart fix?
> >>
> >> At startup, these are the log messages that say there is no security
> >> configuration and the plugins aren’t being used even though
> security.json
> >> is in Zookeeper:
> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer
> Security
> >> conf doesn't exist. Skipping setup for authorization module.
> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer No
> >> authentication plugin used.
> >>
> >> Thanks,
> >> Kevin
> >>
> >>> On Sep 4, 2015, at 5:47 AM, Noble Paul <noble.p...@gmail.com> wrote:
> >>>
> >>> There are no download links for 5.3.x branch  till we do a bug fix
> >> release
> >>>
> >>> If you wish to download the trunk nightly (which is not same as 5.3.0)
> >>> check here
> >>
> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/
> >>>
> >>> If you wish to get the binaries for 5.3 branch you will have to make it
> >>> (you will need to install svn and ant)
> >>>
> >>> Here are the steps
> >>>
> >>> svn checkout
> >> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/
> >>> cd lucene_solr_5_3/solr
> >>> ant server
> >>>
> >>>
> >>>
> >>> On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian
> >>> <davidphilipcher...@gmail.com> wrote:
> >>>> Hi Kevin/Noble,
> >>>>
> >>>> What is the download link to take the latest? What are the steps to
> >> compile
> >>>> it, test and use?
> >>>> We also have a use case to have this feature in solr too. Therefore,
> >> wanted
> >>>> to test and above info would help a lot to get started.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee <kgle...@yahoo.com.invalid>
> >> wrote:
> >>>>
> >>>>> Thanks, I downloaded the source and compiled it and replaced the jar
> >> file
> >>>>> in the dist and solr-webapp’s WEB-INF/lib directory.  It does seem to
> >> be
> >>>>> protecting the Collections API reload command now as long as I upload
> >> the
> >>>>> security.json after startup of the Solr instances.  If I shutdown and
> >> bring
> >>>>> the instances back up, the security is no longer in place and I have
> to
> >>>>> upload the security.json again for it to take effect.
> >>>>>
> >>>>> - Kevin
> >>>>>
> >>>>>> On Sep 3, 2015, at 10:29 PM, Noble Paul <noble.p...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Both these are committed.

Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-10 Thread Dan Davis
SOLR-8004 also appears to work to me.   I manually edited security.json and
did putfile.   I didn't bother with browse permission, because it was
Kevin's workaround.solr-5.3.1-SNAPSHOT did challenge me for credentials
when going to curl
http://localhost:8983/solr/admin/collections?action=CREATE and so on...

On Thu, Sep 10, 2015 at 11:10 PM, Dan Davis <dansm...@gmail.com> wrote:

> Kevin & Noble,
>
> I've manually verified the fix for SOLR-8000, but not yet for SOLR-8004.
>
> I reproduced the initial problem with reloading security.json after
> restarting both Solr and ZooKeeper.   I verified using zkcli.sh that
> ZooKeeper does retain the changes to the file after using
> /solr/admin/authorization, and that therefore the problem was Solr.
>
> After building solr-5.3.1-SNAPSHOT.tgz with ant package (because I don't
> know how to give parameters to ant server), I expanded it, copied in the
> core data, and then started it.   I was prompted for a password, and it let
> me in once the password was given.
>
> I'll probably get to SOLR-8004 shortly, since I have both environments
> built and working.
>
> It also occurs to me that it might be better to forbid all permissions and
> grant specific permissions to specific roles.   Is there a comprehensive
> list of the permissions available?
>
>
> On Tue, Sep 8, 2015 at 1:07 PM, Kevin Lee <kgle...@yahoo.com.invalid>
> wrote:
>
>> Thanks Dan!  Please let us know what you find.  I’m interested to know if
>> this is an issue with anyone else’s setup or if I have an issue in my local
>> configuration that is still preventing it to work on start/restart.
>>
>> - Kevin
>>
>> > On Sep 5, 2015, at 8:45 AM, Dan Davis <dansm...@gmail.com> wrote:
>> >
>> > Kevin & Noble,
>> >
>> > I'll take it on to test this.   I've built from source before, and I've
>> > wanted this authorization capability for awhile.
>> >
>> > On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee <kgle...@yahoo.com.invalid>
>> wrote:
>> >
>> >> Noble,
>> >>
>> >> Does SOLR-8000 need to be re-opened?  Has anyone else been able to test
>> >> the restart fix?
>> >>
>> >> At startup, these are the log messages that say there is no security
>> >> configuration and the plugins aren’t being used even though
>> security.json
>> >> is in Zookeeper:
>> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer
>> Security
>> >> conf doesn't exist. Skipping setup for authorization module.
>> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer No
>> >> authentication plugin used.
>> >>
>> >> Thanks,
>> >> Kevin
>> >>
>> >>> On Sep 4, 2015, at 5:47 AM, Noble Paul <noble.p...@gmail.com> wrote:
>> >>>
>> >>> There are no download links for 5.3.x branch  till we do a bug fix
>> >> release
>> >>>
>> >>> If you wish to download the trunk nightly (which is not same as 5.3.0)
>> >>> check here
>> >>
>> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/
>> >>>
>> >>> If you wish to get the binaries for 5.3 branch you will have to make
>> it
>> >>> (you will need to install svn and ant)
>> >>>
>> >>> Here are the steps
>> >>>
>> >>> svn checkout
>> >> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/
>> >>> cd lucene_solr_5_3/solr
>> >>> ant server
>> >>>
>> >>>
>> >>>
>> >>> On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian
>> >>> <davidphilipcher...@gmail.com> wrote:
>> >>>> Hi Kevin/Noble,
>> >>>>
>> >>>> What is the download link to take the latest? What are the steps to
>> >> compile
>> >>>> it, test and use?
>> >>>> We also have a use case to have this feature in solr too. Therefore,
>> >> wanted
>> >>>> to test and above info would help a lot to get started.
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>>
>> >>>> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee <kgle...@yahoo.com.invalid
>> >
>> >> wrote:
>> >>>>
>> >>>>> Thanks, I downloaded the source and compiled it and replaced

Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-05 Thread Dan Davis
Kevin & Noble,

I'll take it on to test this.   I've built from source before, and I've
wanted this authorization capability for awhile.

On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee  wrote:

> Noble,
>
> Does SOLR-8000 need to be re-opened?  Has anyone else been able to test
> the restart fix?
>
> At startup, these are the log messages that say there is no security
> configuration and the plugins aren’t being used even though security.json
> is in Zookeeper:
> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer Security
> conf doesn't exist. Skipping setup for authorization module.
> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer No
> authentication plugin used.
>
> Thanks,
> Kevin
>
> > On Sep 4, 2015, at 5:47 AM, Noble Paul  wrote:
> >
> > There are no download links for 5.3.x branch  till we do a bug fix
> release
> >
> > If you wish to download the trunk nightly (which is not same as 5.3.0)
> > check here
> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/
> >
> > If you wish to get the binaries for 5.3 branch you will have to make it
> > (you will need to install svn and ant)
> >
> > Here are the steps
> >
> > svn checkout
> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/
> > cd lucene_solr_5_3/solr
> > ant server
> >
> >
> >
> > On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian
> >  wrote:
> >> Hi Kevin/Noble,
> >>
> >> What is the download link to take the latest? What are the steps to
> compile
> >> it, test and use?
> >> We also have a use case to have this feature in solr too. Therefore,
> wanted
> >> to test and above info would help a lot to get started.
> >>
> >> Thanks.
> >>
> >>
> >> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee 
> wrote:
> >>
> >>> Thanks, I downloaded the source and compiled it and replaced the jar
> file
> >>> in the dist and solr-webapp’s WEB-INF/lib directory.  It does seem to
> be
> >>> protecting the Collections API reload command now as long as I upload
> the
> >>> security.json after startup of the Solr instances.  If I shutdown and
> bring
> >>> the instances back up, the security is no longer in place and I have to
> >>> upload the security.json again for it to take effect.
> >>>
> >>> - Kevin
> >>>
>  On Sep 3, 2015, at 10:29 PM, Noble Paul  wrote:
> 
>  Both these are committed. If you could test with the latest 5.3 branch
>  it would be helpful
> 
>  On Wed, Sep 2, 2015 at 5:11 PM, Noble Paul 
> wrote:
> > I opened a ticket for the same
> > https://issues.apache.org/jira/browse/SOLR-8004
> >
> > On Wed, Sep 2, 2015 at 1:36 PM, Kevin Lee  >
> >>> wrote:
> >> I’ve found that completely exiting Chrome or Firefox and opening it
> >>> back up re-prompts for credentials when they are required.  It was
> >>> re-prompting with the /browse path where authentication was working
> each
> >>> time I completely exited and started the browser again, however it
> won’t
> >>> re-prompt unless you exit completely and close all running instances
> so I
> >>> closed all instances each time to test.
> >>
> >> However, to make sure I ran it via the command line via curl as
> >>> suggested and it still does not give any authentication error when
> trying
> >>> to issue the command via curl.  I get a success response from all the
> Solr
> >>> instances that the reload was successful.
> >>
> >> Not sure why the pre-canned permissions aren’t working, but the one
> to
> >>> the request handler at the /browse path is.
> >>
> >>
> >>> On Sep 1, 2015, at 11:03 PM, Noble Paul 
> wrote:
> >>>
> >>> " However, after uploading the new security.json and restarting the
> >>> web browser,"
> >>>
> >>> The browser remembers your login , So it is unlikely to prompt for
> the
> >>> credentials again.
> >>>
> >>> Why don't you try the RELOAD operation using command line (curl) ?
> >>>
> >>> On Tue, Sep 1, 2015 at 10:31 PM, Kevin Lee
> 
> >>> wrote:
>  The restart issues aside, I’m trying to lockdown usage of the
> >>> Collections API, but that also does not seem to be working either.
> 
>  Here is my security.json.  I’m using the “collection-admin-edit”
> >>> permission and assigning it to the “adminRole”.  However, after
> uploading
> >>> the new security.json and restarting the web browser, it doesn’t seem
> to be
> >>> requiring credentials when calling the RELOAD action on the Collections
> >>> API.  The only thing that seems to work is the custom permission
> “browse”
> >>> which is requiring authentication before allowing me to pull up the
> page.
> >>> Am I using the permissions correctly for the
> RuleBasedAuthorizationPlugin?
> 
> 

Re: analyzer, indexAnalyzer and queryAnalyzer

2015-04-30 Thread Dan Davis
Hi Doug, nice write-up and 2 questions:

- You write your own QParser plugins - can one keep the features of edismax
for field boosting/phrase-match boosting by subclassing edismax?   Assuming
yes...

- What do pf2 and pf3 do in the edismax query parser?

hon-lucene-synonyms plugin links corrections:

http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
https://github.com/healthonnet/hon-lucene-synonyms


On Wed, Apr 29, 2015 at 9:24 PM, Doug Turnbull 
dturnb...@opensourceconnections.com wrote:

 So Solr has the idea of a query parser. The query parser is a convenient
 way of passing a search string to Solr and having Solr parse it into
 underlying Lucene queries: You can see a list of query parsers here
 http://wiki.apache.org/solr/QueryParser

 What this means is that the query parser does work to pull terms into
 individual clauses *before* analysis is run. It's a parsing layer that sits
 outside the analysis chain. This creates problems like the sea biscuit
 problem, whereby we declare sea biscuit as a query time synonym of
 seabiscuit. As you may know synonyms are checked during analysis.
 However, if the query parser splits up sea from biscuit before running
 analysis, the query time analyzer will fail. The string sea is brought by
 itself to the query time analyzer and of course won't match sea biscuit.
 Same with the string biscuit in isolation. If the full string sea
 biscuit was brought to the analyzer, it would see [sea] next to [biscuit]
 and declare it a synonym of seabiscuit. Thanks to the query parser, the
 analyzer has lost the association between the terms, and both terms aren't
 brought together to the analyzer.

 My colleague John Berryman wrote a pretty good blog post on this

 http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/

 There's several solutions out there that attempt to address this problem.
 One from Ted Sullivan at Lucidworks

 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/

 Another popular one is the hon-lucene-synonyms plugin:

 http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html

 Yet another work-around is to use the field query parser:

 http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html

 I also tend to write my own query parsers, so on the one hand its annoying
 that query parsers have the problems above, on the flipside Solr makes it
 very easy to implement whatever parsing you think is appropriatte with a
 small bit of Java/Lucene knowledge.

 Hopefully that explanation wasn't too deep, but its an important thing to
 know about Solr. Are you asking out of curiosity, or do you have a specific
 problem?

 Thanks
 -Doug

 On Wed, Apr 29, 2015 at 6:32 PM, Steven White swhite4...@gmail.com
 wrote:

  Hi Doug,
 
  I don't understand what you mean by the following:
 
   For example, if a user searches for q=hot dogsdefType=edismaxqf=title
   body the *query parser* *not* the *analyzer* first turns the query
 into:
 
  If I have indexAnalyzer and queryAnalyzer in a fieldType that are 100%
  identical, the example you provided, does it stand?  If so, why?  Or do
 you
  mean something totally different by query parser?
 
  Thanks
 
  Steve
 
 
  On Wed, Apr 29, 2015 at 4:18 PM, Doug Turnbull 
  dturnb...@opensourceconnections.com wrote:
 
   * 1) If the content of indexAnalyzer and queryAnalyzer are exactly the
   same,that's the same as if I have an analyzer only, right?*
   1) Yes
  
   *  2) Under the hood, all three are the same thing when it comes to
 what
   kind*
   *of data and configuration attributes can take, right?*
   2) Yes. Both take in text and output a token stream.
  
   *What I'm trying to figure out is this: beside being able to configure
  a*
  
   *fieldType to have different analyzer setting at index and query time,
   thereis nothing else that's unique about each.*
  
   The only thing to look out for in Solr land is the query parser. Most
  Solr
   query parsers treat whitespace as meaningful.
  
   For example, if a user searches for q=hot dogsdefType=edismaxqf=title
   body the *query parser* *not* the *analyzer* first turns the query
 into:
  
   (title:hot title:dog) | (body:hot body:dog)
  
   each word which *then *gets analyzed. This is because the query parser
   tries to be smart and turn hot dog into hot OR dog, or more
  specifically
   making them two must clauses.
  
   This trips quite a few folks up, you can use the field query parser
 which
   uses the field as a phrase query. Hope that helps
  
  
   --
   *Doug Turnbull **| *Search Relevance Consultant | OpenSource
 Connections,
   LLC | 240.476.9983 | http://www.opensourceconnections.com
   Author: Taming Search http://manning.com/turnbull from Manning
   Publications
   This e-mail and all contents, including attachments, is considered to
 be
   Company Confidential unless 

Re: Odp.: solr issue with pdf forms

2015-04-23 Thread Dan Davis
Steve,

You gave as an example:

Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�
vollständig�sind

This sentence is probably from the PDF form label content, rather than form
values.   Sometimes in PDF, the form's value fields are kept in a separate
file.   I'm 99% sure Tika won't be able to handle that, because it handles
one file at a time.   If the form's value fields are in the PDF, Tika
should be able to handle it, but may be making some small errors that could
be addressed.

When you look at the form in Acrobat Reader, can you see whether the
indexed words contain any words from the form fields's values?

If you have a form where the data is not sensitive, I can investigate.   If
you are interested in this contact me offline - to dansm...@gmail.com or
d...@danizen.net.

Thanks,

Dan

On Thu, Apr 23, 2015 at 11:59 AM, Erick Erickson erickerick...@gmail.com
wrote:

 When you say they're not indexed correctly, what's your evidence?
 You cannot rely
 on the display in the browser, that's the raw input just as it was
 sent to Solr, _not_
 the actual tokens in the index. What do you see when you go to the admin
 schema browser pate and load the actual tokens.

 Or use the TermsComponent
 (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
 to see the actual terms in the index as opposed to the stored data you
 see in the browser
 when you look at search results.

 If the actual terms don't seem right _in the index_ we need to see
 your analysis chain,
 i.e. your fieldType definition.

 I'm, 90% sure you're seeing the stored data and your terms are indexed
 just fine, but
 I've certainly been wrong before, more times than I want to remember.

 Best,
 Erick

 On Thu, Apr 23, 2015 at 1:18 AM,  steve.sch...@t-systems.com wrote:
  Hey Erick,
 
  thanks for your answer. They are not indexed correctly. Also throught
 the solr admin interface I see these typical questionmarks within a rhombus
 where a blank space should be.
  I now figured out the following (not sure if it is relevant at all):
  - PDF documents created with Acrobat PDFMaker 10.0 for Word are
 indexed correctly, no issues
  - PDF documents (with editable form fields) created with Adobe InDesign
 CS5 (7.0.1)  are indexed with the blank space issue
 
  Best
  Steve
 
  -Ursprüngliche Nachricht-
  Von: Erick Erickson [mailto:erickerick...@gmail.com]
  Gesendet: Mittwoch, 22. April 2015 17:11
  An: solr-user@lucene.apache.org
  Betreff: Re: Odp.: solr issue with pdf forms
 
  Are they not _indexed_ correctly or not being displayed correctly?
  Take a look at admin UIschema browser your field and press the load
 terms button. That'll show you what is _in_ the index as opposed to what
 the raw data looked like.
 
  When you return the field in a Solr search, you get a verbatim,
 un-analyzed copy of your original input. My guess is that your browser
 isn't using the compatible character encoding for display.
 
  Best,
  Erick
 
  On Wed, Apr 22, 2015 at 7:08 AM,  steve.sch...@t-systems.com wrote:
  Thanks for your answer. Maybe my English is not good enough, what are
 you trying to say? Sorry I didn't get the point.
  :-(
 
 
  -Ursprüngliche Nachricht-
  Von: LAFK [mailto:tomasz.bo...@gmail.com]
  Gesendet: Mittwoch, 22. April 2015 14:01
  An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
  Betreff: Odp.: solr issue with pdf forms
 
  Out of my head I'd follow how are writable PDFs created and encoded.
 
  @LAFK_PL
Oryginalna wiadomość
  Od: steve.sch...@t-systems.com
  Wysłano: środa, 22 kwietnia 2015 12:41
  Do: solr-user@lucene.apache.org
  Odpowiedz: solr-user@lucene.apache.org
  Temat: solr issue with pdf forms
 
  Hi guys,
 
  hopefully you can help me with my issue. We are using a solr setup and
 have the following issue:
  - usual pdf files are indexed just fine
  - pdf files with writable form-fields look like this:
  Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und v
  ollständig sind
 
  Somehow the blank space character is not indexed correctly.
 
  Is this a know issue? Does anybody have an idea?
 
  Thanks a lot
  Best
  Steve



Re: solr issue with pdf forms

2015-04-22 Thread Dan Davis
Steve,

Are you using ExtractingRequestHandler / DataImportHandler or extracting
the text content from the PDF outside of Solr?

On Wed, Apr 22, 2015 at 6:40 AM, steve.sch...@t-systems.com wrote:

 Hi guys,

 hopefully you can help me with my issue. We are using a solr setup and
 have the following issue:
 - usual pdf files are indexed just fine
 - pdf files with writable form-fields look like this:

 Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind

 Somehow the blank space character is not indexed correctly.

 Is this a know issue? Does anybody have an idea?

 Thanks a lot
 Best
 Steve



Re: Odp.: solr issue with pdf forms

2015-04-22 Thread Dan Davis
+1 - I like Erick's answer.  Let me know if that turns out to be the
problem - I'm interested in this problem and would be happy to help.

On Wed, Apr 22, 2015 at 11:11 AM, Erick Erickson erickerick...@gmail.com
wrote:

 Are they not _indexed_ correctly or not being displayed correctly?
 Take a look at admin UIschema browser your field and press the
 load terms button. That'll show you what is _in_ the index as
 opposed to what the raw data looked like.

 When you return the field in a Solr search, you get a verbatim,
 un-analyzed copy of your original input. My guess is that your browser
 isn't using the compatible character encoding for display.

 Best,
 Erick

 On Wed, Apr 22, 2015 at 7:08 AM,  steve.sch...@t-systems.com wrote:
  Thanks for your answer. Maybe my English is not good enough, what are
 you trying to say? Sorry I didn't get the point.
  :-(
 
 
  -Ursprüngliche Nachricht-
  Von: LAFK [mailto:tomasz.bo...@gmail.com]
  Gesendet: Mittwoch, 22. April 2015 14:01
  An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
  Betreff: Odp.: solr issue with pdf forms
 
  Out of my head I'd follow how are writable PDFs created and encoded.
 
  @LAFK_PL
Oryginalna wiadomość
  Od: steve.sch...@t-systems.com
  Wysłano: środa, 22 kwietnia 2015 12:41
  Do: solr-user@lucene.apache.org
  Odpowiedz: solr-user@lucene.apache.org
  Temat: solr issue with pdf forms
 
  Hi guys,
 
  hopefully you can help me with my issue. We are using a solr setup and
 have the following issue:
  - usual pdf files are indexed just fine
  - pdf files with writable form-fields look like this:
 
 Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind
 
  Somehow the blank space character is not indexed correctly.
 
  Is this a know issue? Does anybody have an idea?
 
  Thanks a lot
  Best
  Steve



Re: Securing solr index

2015-04-13 Thread Dan Davis
Where you want true Role-Based Access Control (RBAC) on each index (core or
collection), one solution is to buy Solr Enterprise from LucidWorks.

My personal practice is mostly dictated by financial decisions:

   - Each core/index has its configuration directory in a Git
   repository/branch where the Git repository software provides RBAC.
   - This relies on developers to keep a separate Solr for development, and
   then to check-in their configuration directory changes when they are
   satisfied with the changes.   This is probably a best practice anyway :)
   - Continuous Integration pushes the Git configuration appropriately
   when a particular branch changes.
   - The main URL /solr has security provided by Apache httpd on port 80
   (a reverse proxy to http://localhost:8983/solr/)
   - That port is also open, secured by IP address, to other Solr nodes in
   the cluster.
   - The /select request Handler for each core/collection is reverse
   proxied to /search/corename.
   - The Solr Amin UI uses a authentication/authorization handler such that
   only the Search Administrators group has access to it.

The security here relies on search developers not enabling handleSelect
in their solrconfig.xml.The security can also be extended by adding
security on reverse proxied URLs such as /search/corename and
/update/corename so that the client application needs to know some key,
or have access to an SSL private key file.

The downside is that only Search Administrators group has access to the
QA or production Solr Admin UI.


On Mon, Apr 13, 2015 at 6:13 AM, Suresh Vanasekaran 
suresh_vanaseka...@infosys.com wrote:

 Hi,

 We are having the solr index maintained in a central server and multiple
 users might be able to access the index data.

 May I know what are best practice for securing the solr index folder where
 ideally only application user should be able to access. Even an admin user
 should not be able to copy the data and use it in another schema.

 Thanks



  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
 solely
 for the use of the addressee(s). If you are not the intended recipient,
 please
 notify the sender by e-mail and delete the original message. Further, you
 are not
 to copy, disclose, or distribute this e-mail or its contents to any other
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys
 has taken
 every reasonable precaution to minimize this risk, but is not liable for
 any damage
 you may sustain as a result of any virus in this e-mail. You should carry
 out your
 own virus checks before opening the e-mail or attachment. Infosys reserves
 the
 right to monitor and review the content of all messages sent to or from
 this e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***



Re: What is the best way of Indexing different formats of documents?

2015-04-07 Thread Dan Davis
Sangeetha,

You can also run Tika directly from data import handler, and Data Import
Handler can be made to run several threads if you can partition the input
documents by directory or database id.   I've done 4 threads by having a
base configuration that does an Oracle query like this:

  SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ...
WHERE ...) WHERE threadid = %d

A bash/sed script writes several data import handler XML files.
I can then index several threads at a time.

Each of these threads can then use all the transformers, e.g.
templateTransformer, etc.
XML can be transformed via XSLT.

The Data Import Handler has other entities that go out to the web and then
index the document via Tika.

If you are indexing generic HTML, you may want to figure out an approach to
SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika
locally, because Boilerpipe has a bug that has been fixed, but not pushed
to Maven Central.   Without that, the ASF cannot include the fix, but
distributions such as LucidWorks Solr Enterprise can.

I can drop some configs into github.com if I clean them up to obfuscate
host names, passwords, and such.


On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain yavarhus...@gmail.com wrote:

 Well have indexed heterogeneous sources including a variety of NoSQL's,
 RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
 of using SolrJ is that you should have an API to fetch data from your data
 source (Say JDBC for RDBMS, Tika for extracting text content from rich
 documents etc.) than SolrJ is so damn great and simple. Its as simple as
 downloading the jar and few lines of code to send data to your solr server
 after pre-processing your data. More details here:

 http://lucidworks.com/blog/indexing-with-solrj/

 https://wiki.apache.org/solr/Solrj

 http://www.solrtutorial.com/solrj-tutorial.html

 Cheers,
 Yavar



 On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com 
 sangeetha.subraman...@gtnexus.com wrote:

  Hi,
 
  I am a newbie to SOLR and basically from database background. We have a
  requirement of indexing files of different formats (x12,edifact,
 csv,xml).
  The files which are inputted can be of any format and we need to do a
  content based search on it.
 
  From the web I understand we can use TIKA processor to extract the
 content
  and store it in SOLR. What I want to know is, is there any better
 approach
  for indexing files in SOLR ? Can we index the document through streaming
  directly from the Application ? If so what is the disadvantage of using
 it
  (against DIH which fetches from the database)? Could someone share me
 some
  insight on this ? ls there any web links which I can refer to get some
 idea
  on it ? Please do help.
 
  Thanks
  Sangeetha
 
 



Re: Customzing Solr Dedupe

2015-04-01 Thread Dan Davis
But you can potentially still use Solr dedupe if you do the upfront work
(in RDMS or NoSQL pre-index processing) to assign some sort of Group ID.
  See OCLC's FRBR Work-Set Algorithm,
http://www.oclc.org/content/dam/research/activities/frbralgorithm/2009-08.pdf?urlm=161376
, for some details on one such algorithm.

If the job is too big for RDBMS, and/or you don't want to use/have a
suitable NoSQL, you can have two Solr indexes (collection/core/whatever) -
one for classification with only id, field1, field2, field3, and another
for production query.   Then, you put stuff into the classification index,
use queries and your own algorithm to do classification, assigning a
groupId, and then put the document with groupId assigned into the
production database.

A key question is whether you want to preserve the groupId.   In some
cases, you do, and in some cases, it is just an internal signature.   In
both cases, a non-deterministic up-front algorithm can work, but if the
groupId needs to be preserved, you need to work harder to make sure it all
hangs together.

Hope this helps,

-Dan

On Wed, Apr 1, 2015 at 7:05 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Solr dedupe is based on the concept of a signature - some fields and rules
 that reduce a document into a discrete signature, and then checking if that
 signature exists as a document key that can be looked up quickly in the
 index. That's the conceptual basis. It is not based on any kind of field by
 field comparison to all existing documents.

 -- Jack Krupansky

 On Wed, Apr 1, 2015 at 6:35 AM, thakkar.aayush thakkar.aay...@gmail.com
 wrote:

  I'm facing a challenges using de-dupliation of Solr documents.
 
  De-duplicate is done using TextProfileSignature with following
 parameters:
  str name=fieldsfield1, field2, field3/str
  str name=quantRate0.5/str
  str name=minTokenLen3/str
 
  Here Field3 is normal text with few lines of data.
  Field1 and Field2 can contain upto 5 or 6 words of data.
 
  I want to de-duplicate when data in field1 and field2 are exactly the
 same
  and 90% of the lines in field3 is matched to that in another document.
 
  Is there anyway to achieve this?
 
 
 
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/Customzing-Solr-Dedupe-tp4196879.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: Solr on Tomcat

2015-02-10 Thread Dan Davis
As an application developer, I have to agree with this direction.   I ran
ManifoldCF and Solr together in the same Tomcat, and the sl4j
configurations of the two conflicted with strange results.   From a systems
administrator/operations perspective, a separate install allows better
packaging, e.g. Debian and RPM packages are then possible, although may not
be preferred as many enterprises will want to use Oracle Java rather than
OpenJDK.

On Tue, Feb 10, 2015 at 1:12 PM, Matt Kuiper matt.kui...@issinc.com wrote:

 Thanks for all the responses.  I am planning a new project, and
 considering deployment options at this time.  It's helpful to see where
 Solr is headed.

 Thanks,

 Matt Kuiper

 -Original Message-
 From: Shawn Heisey [mailto:apa...@elyograg.org]
 Sent: Tuesday, February 10, 2015 10:05 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr on Tomcat

 On 2/10/2015 9:48 AM, Matt Kuiper wrote:
  I am starting to look in to Solr 5.0.  I have been running Solr 4.* on
 Tomcat.   I was surprised to find the following notice on
 https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+Tomcat
  (Marked as Unreleased)
 
   Beginning with Solr 5.0, Support for deploying Solr as a WAR in
 servlet containers like Tomcat is no longer supported.
 
  I want to verify that it is true that Solr 5.0 will not be able to run
 on Tomcat, and confirm that the recommended way to deploy Solr 5.0 is as a
 Linux service.

 Solr will eventually (hopefully soon) be entirely its own application.
 The documentation you have seen in the reference guide is there to prepare
 users for this eventuality.

 Right now we are in a transition period.  We have built scripts for
 controlling the start and stop of the example server installation.
 Under the covers, Solr is still a web application contained in a war and
 the example server still runs an unmodified copy of jetty.  Down the road,
 when Solr will becomes a completely standalone application, we will merely
 have to modify the script wrapper to use it, and the user may not even
 notice the change.

 With 5.0, if you want to run in tomcat, you will be able to find the war
 in the download's server/webapps directory and use it just like you do now
 ... but we will be encouraging people to NOT do this, because eventually it
 will be completely unsupported.

 Thanks,
 Shawn




Re: clarification regarding shard splitting and composite IDs

2015-02-05 Thread Dan Davis
Thanks, Anshum - I should never have posted so late.It is true that
different users will have different word frequencies, but an application
exploiting that for better relevancy would be going far for the relevancy
of individual user's results.

On Thu, Feb 5, 2015 at 12:41 AM, Anshum Gupta ans...@anshumgupta.net
wrote:

 Solr 5.0 has support for distributed IDF. Also, users having the same IDF
 is orthogonal to the original question.

 In general, the Doc Freq. is only per-shard. If for some reason, a single
 user has documents split across shards, the IDF used would be different for
 docs on different shards.

 On Wed, Feb 4, 2015 at 9:06 PM, Dan Davis dansm...@gmail.com wrote:

 Doesn't relevancy for that assume that the IDF and TF for user1 and user2
 are not too different?SolrCloud still doesn't use a distributed IDF,
 correct?

 On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum gilinac...@gmail.com wrote:

  Alright. So shard splitting and composite routing plays nicely together.
  Thank you Anshum.
 
  On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta ans...@anshumgupta.net
  wrote:
 
   In one line, shard splitting doesn't cater to depend on the routing
   mechanism but just the hash range so you could have documents for the
  same
   prefix split up.
  
   Here's an overview of routing in SolrCloud:
   * Happens based on a hash value
   * The hash is calculated using the multiple parts of the routing key.
 In
   case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16
 bits
  of
   the routing key are obtained from murmurhash(B). This sends the docs
 to
  the
   right shard.
   * When querying using A!, all shards that contain hashes from the
 range
  16
   bits from murmurhash(A)- to murmurhash(A)- are used.
  
   When you split a shard, for say range  -  , it is
 split
   from the middle (by default) and over multiple split, docs for the
 same
  A!
   prefix might end up on different shards, but the request routing
 should
   take care of that.
  
   You can read more about routing here:
   https://lucidworks.com/blog/solr-cloud-document-routing/
  
 http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/
  
   and shard splitting here:
   http://lucidworks.com/blog/shard-splitting-in-solrcloud/
  
  
   On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum gilinac...@gmail.com
  wrote:
  
Hi, I'm also interested. When using composite the ID, the _route_
information is not kept on the document itself, so to me it looks
 like
   it's
not possible as the split API

   
  
 
 https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3

doesn't have a relevant parameter to split correctly.
Could report back once I try it in practice.
   
On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose ianr...@fullstory.com
  wrote:
   
 Howdy -

 We are using composite IDs of the form user!event.  This
 ensures
   that
 all events for a user are stored in the same shard.

 I'm assuming from the description of how composite ID routing
 works,
   that
 if you split a shard the split point of the hash range for that
  shard
is
 chosen to maintain the invariant that all documents that share a
   routing
 prefix (before the !) will still map to the same (new) shard.
 Is
   that
 accurate?

 A naive shard-split implementation (e.g. that chose the hash range
   split
 point arbitrarily) could end up with child shards that split a
   routing
 prefix.

 Thanks,
 Ian

   
  
  
  
   --
   Anshum Gupta
   http://about.me/anshumgupta
  
 




 --
 Anshum Gupta
 http://about.me/anshumgupta



Re: Delta import query not working

2015-02-05 Thread Dan Davis
It looks like you are returning the transformed ID, along with some other
fields, in the deltaQuery command.deltaQuery should only return the ID,
without the stk_ prefix, and then deltaImportQuery should retrieve the
transformed ID.   I'd suggest:

entity ...
 deltaQuery=SELECT id WHERE updated_at  '${dih.delta.last_index_time}'
 deltaImportQuery=SELECT CONCAT('stk_',id) AS id, part_no, name,
description FROM stock_items WHERE id='${dih.delta.id}'

I'm not sure which RDBMS you are using, but you probably don't need to work
around the column names at all.


On Thu, Feb 5, 2015 at 5:18 PM, willbrindle m...@willbrindle.com wrote:

 Hi,

 I am very new to Solr but I have been playing around with it a bit and my
 imports are all working fine. However, now I wish to perform a delta import
 on my query and I'm just getting nothing.

 I have the entity:

  entity name=stock
   query=SELECT CONCAT('stk_',id) AS id,part_no,name,description
 FROM
 stock_items
   deltaQuery=SELECT CONCAT('stk_',id) AS
 id,part_no,name,description,updated_at FROM stock_items WHERE updated_at 
 '${dih.delta.last_index_time}'
   deltaImportQuery=SELECT CONCAT('stk_',id) AS id,id AS
 id2,part_no,name,description FROM stock_items WHERE id2='${dih.delta.id
 }'


 I am not too sure if ${dih.delta.id} is supposed to be id or id2 but I
 have
 tried both and neither work. My output is something along the lines of:

 {
   responseHeader: {
 status: 0,
 QTime: 0
   },
   initArgs: [
 defaults,
 [
   config,
   data-config.xml
 ]
   ],
   command: status,
   status: idle,
   importResponse: ,
   statusMessages: {
 Time Elapsed: 0:0:16.778,
 Total Requests made to DataSource: 2,
 Total Rows Fetched: 0,
 Total Documents Skipped: 0,
 Delta Dump started: 2015-02-05 16:17:54,
 Identifying Delta: 2015-02-05 16:17:54,
 Deltas Obtained: 2015-02-05 16:17:54,
 Building documents: 2015-02-05 16:17:54,
 Total Changed Documents: 0,
 Delta Import Failed: 2015-02-05 16:17:54
   },
   WARNING: This response format is experimental.  It is likely to change
 in the future.
 }

 My full import query is working fine.

 Thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Delta-import-query-not-working-tp4184280.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Delta import query not working

2015-02-05 Thread Dan Davis
It also should be ${dataimporter.last_index_time}

Also, that's two queries - an outer query to get the IDs that are modified,
and another query (done repeatedly) to get the data.   You can go faster
using a parameterized data import as described in the wiki:

http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

Hope this helps,

Dan

On Thu, Feb 5, 2015 at 9:30 PM, Dan Davis dansm...@gmail.com wrote:

 It looks like you are returning the transformed ID, along with some other
 fields, in the deltaQuery command.deltaQuery should only return the ID,
 without the stk_ prefix, and then deltaImportQuery should retrieve the
 transformed ID.   I'd suggest:

 entity ...
  deltaQuery=SELECT id WHERE updated_at  '${dih.delta.last_index_time}'
  deltaImportQuery=SELECT CONCAT('stk_',id) AS id, part_no, name,
 description FROM stock_items WHERE id='${dih.delta.id}'

 I'm not sure which RDBMS you are using, but you probably don't need to
 work around the column names at all.


 On Thu, Feb 5, 2015 at 5:18 PM, willbrindle m...@willbrindle.com wrote:

 Hi,

 I am very new to Solr but I have been playing around with it a bit and my
 imports are all working fine. However, now I wish to perform a delta
 import
 on my query and I'm just getting nothing.

 I have the entity:

  entity name=stock
   query=SELECT CONCAT('stk_',id) AS id,part_no,name,description
 FROM
 stock_items
   deltaQuery=SELECT CONCAT('stk_',id) AS
 id,part_no,name,description,updated_at FROM stock_items WHERE updated_at 
 '${dih.delta.last_index_time}'
   deltaImportQuery=SELECT CONCAT('stk_',id) AS id,id AS
 id2,part_no,name,description FROM stock_items WHERE id2='${dih.delta.id
 }'


 I am not too sure if ${dih.delta.id} is supposed to be id or id2 but I
 have
 tried both and neither work. My output is something along the lines of:

 {
   responseHeader: {
 status: 0,
 QTime: 0
   },
   initArgs: [
 defaults,
 [
   config,
   data-config.xml
 ]
   ],
   command: status,
   status: idle,
   importResponse: ,
   statusMessages: {
 Time Elapsed: 0:0:16.778,
 Total Requests made to DataSource: 2,
 Total Rows Fetched: 0,
 Total Documents Skipped: 0,
 Delta Dump started: 2015-02-05 16:17:54,
 Identifying Delta: 2015-02-05 16:17:54,
 Deltas Obtained: 2015-02-05 16:17:54,
 Building documents: 2015-02-05 16:17:54,
 Total Changed Documents: 0,
 Delta Import Failed: 2015-02-05 16:17:54
   },
   WARNING: This response format is experimental.  It is likely to
 change
 in the future.
 }

 My full import query is working fine.

 Thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Delta-import-query-not-working-tp4184280.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dan Davis
Suresh and Meena,

I have solved this problem by taking a row count on a query, and adding its
modulo as another field called threadid. The base query is wrapped in a
query that selects a subset of the results for indexing.   The modulo on
the row number was intentional - you cannot rely on id columns to be well
distributed and you cannot rely on the number of rows to stay constant over
time.

To make it more concrete, I have a base DataImportHandler configuration
that looks something like what's below - your SQL may differ as we use
Oracle.

 entity name=medsite dataSource=oltp01_prod
rootEntity=true
query=SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
medplus.public_topic_sites_us_v t) WHERE threadid = %%d%%
transformer=TemplateTransformer
...

 /entity


To get it to be multi-threaded, I then copy it to 4 different configuration
files as follows:

echo Medical Sites Configuration - 
${MEDSITES_CONF:=medical-sites-conf.xml}
echo Medical Sites Prototype - 
${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml}
for tid in `seq 0 3`; do
   MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e s/%%d%%/$tid/`
   sed -e s/%%d%%/$tid/ $MEDSITES_CONF  $MEDSITES_OUT
done


Then, I have 4 requestHandlers in solrconfig.xml that point to each of
these files.They are /import/medical-sites-0 through
/import/medical-sites-3.   Note that this wouldn't work with a single
Data Import Handler that was parameterized - a particular data Import
Handler is either idle or busy, and no longer should be run in multiple
threads.   How this would work if the first entity weren't the root entity
is another question - you can usually structure it with the first SQL query
being the root entity if you are using SQL.   XML is another story, however.

I did it this way because I wanted to stay with Solr out-of-the-box
because it was an evaluation of what Data Import Handler could do.   If I
were doing this without some business requirement to evaluate whether Solr
out-of-the-box could do multithreaded database improt, I'd probably write
a multi-threaded front-end that did the queries and transformations I
needed to do.   In this case, I was considering the best way to do all
our data imports from RDBMS, and Data Import Handler is the only good
solution that involves writing configuration, not code.   The distinction
is slight, I think.

Hope this helps,

Dan Davis

On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Suresh,

 There are a few common workaround for such problem. But, I think that
 submitting more than maxIndexingThreads is not really productive. Also, I
 think that out-of-memory problem is caused not by indexing, but by opening
 searcher. Do you really need to open it? I don't think it's a good idea to
 search on the instance which cooks many T index at the same time. Are you
 sure you don't issue superfluous commit, and you've disabled auto-commit?

 let's nail down oom problem first, and then deal with indexing speedup. I
 like huge indices!

 On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com
 wrote:

  We are also facing the same problem in loading 14 Billion documents into
  Solr 4.8.10.
 
  Dataimport is working in Single threaded, which is taking more than 3
  weeks. This is working fine without any issues but it takes months to
  complete the load.
 
  When we tried SolrJ with the below configuration in Multithreaded load,
  the Solr is taking more memory  at one point we will end up in out of
  memory as well.
 
  Batch Doc count  :  10 docs
  No of Threads  : 16/32
 
  Solr Memory Allocated : 200 GB
 
  The reason can be as below.
 
  Solr is taking the snapshot, whenever we open a SearchIndexer.
  Due to this more memory is getting consumed  solr is extremely
  slow while running 16 or more threads for loading.
 
  If anyone have already done the multithreaded data load into Solr in a
  quicker way, Can you please share the code or logic in using the SolrJ
 API?
 
  Thanks in advance.
 
  Regards,
  Suresh.A
 
  -Original Message-
  From: Dyer, James [mailto:james.d...@ingramcontent.com]
  Sent: Tuesday, February 03, 2015 1:58 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Solr 4.9 Calling DIH concurrently
 
  DIH is single-threaded.  There was once a threaded option, but it was
  buggy and subsequently was removed.
 
  What I do is partition my data and run multiple dih request handlers at
  the same time.  It means redundant sections in solrconfig.xml and its not
  very elegant but it works.
 
  For instance, for a sql query, I add something like this: where mod(id,
 
 ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}.
 
  I think, though, most users who want to make the most out of
  multithreading write their own program and use the solrj api to send the
  updates.
 
  James Dyer
  Ingram Content Group

Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dan Davis
Data Import Handler is the only good solution that involves writing
configuration, not code.  - I also had a requirement not to look at
product-oriented enhancements to Solr, and there are many products I didn't
look at, or rejected, like django-haystack.   Perl, ruby, and python have
good handling of both databases and Solr, as does Java with JDBC and SolrJ.
  Pushing to Solr probably has more legs than Data Import Handler going
forward.

On Wed, Feb 4, 2015 at 11:13 AM, Dan Davis dansm...@gmail.com wrote:

 Suresh and Meena,

 I have solved this problem by taking a row count on a query, and adding
 its modulo as another field called threadid. The base query is wrapped
 in a query that selects a subset of the results for indexing.   The modulo
 on the row number was intentional - you cannot rely on id columns to be
 well distributed and you cannot rely on the number of rows to stay constant
 over time.

 To make it more concrete, I have a base DataImportHandler configuration
 that looks something like what's below - your SQL may differ as we use
 Oracle.

  entity name=medsite dataSource=oltp01_prod
 rootEntity=true
 query=SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
 medplus.public_topic_sites_us_v t) WHERE threadid = %%d%%
 transformer=TemplateTransformer
 ...

  /entity


 To get it to be multi-threaded, I then copy it to 4 different
 configuration files as follows:

 echo Medical Sites Configuration - 
 ${MEDSITES_CONF:=medical-sites-conf.xml}
 echo Medical Sites Prototype - 
 ${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml}
 for tid in `seq 0 3`; do
MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e s/%%d%%/$tid/`
sed -e s/%%d%%/$tid/ $MEDSITES_CONF  $MEDSITES_OUT
 done


 Then, I have 4 requestHandlers in solrconfig.xml that point to each of
 these files.They are /import/medical-sites-0 through
 /import/medical-sites-3.   Note that this wouldn't work with a single
 Data Import Handler that was parameterized - a particular data Import
 Handler is either idle or busy, and no longer should be run in multiple
 threads.   How this would work if the first entity weren't the root entity
 is another question - you can usually structure it with the first SQL query
 being the root entity if you are using SQL.   XML is another story, however.

 I did it this way because I wanted to stay with Solr out-of-the-box
 because it was an evaluation of what Data Import Handler could do.   If I
 were doing this without some business requirement to evaluate whether Solr
 out-of-the-box could do multithreaded database improt, I'd probably write
 a multi-threaded front-end that did the queries and transformations I
 needed to do.   In this case, I was considering the best way to do all
 our data imports from RDBMS, and Data Import Handler is the only good
 solution that involves writing configuration, not code.   The distinction
 is slight, I think.

 Hope this helps,

 Dan Davis

 On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Suresh,

 There are a few common workaround for such problem. But, I think that
 submitting more than maxIndexingThreads is not really productive. Also,
 I
 think that out-of-memory problem is caused not by indexing, but by opening
 searcher. Do you really need to open it? I don't think it's a good idea to
 search on the instance which cooks many T index at the same time. Are you
 sure you don't issue superfluous commit, and you've disabled auto-commit?

 let's nail down oom problem first, and then deal with indexing speedup. I
 like huge indices!

 On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com
 
 wrote:

  We are also facing the same problem in loading 14 Billion documents into
  Solr 4.8.10.
 
  Dataimport is working in Single threaded, which is taking more than 3
  weeks. This is working fine without any issues but it takes months to
  complete the load.
 
  When we tried SolrJ with the below configuration in Multithreaded load,
  the Solr is taking more memory  at one point we will end up in out of
  memory as well.
 
  Batch Doc count  :  10 docs
  No of Threads  : 16/32
 
  Solr Memory Allocated : 200 GB
 
  The reason can be as below.
 
  Solr is taking the snapshot, whenever we open a SearchIndexer.
  Due to this more memory is getting consumed  solr is extremely
  slow while running 16 or more threads for loading.
 
  If anyone have already done the multithreaded data load into Solr in a
  quicker way, Can you please share the code or logic in using the SolrJ
 API?
 
  Thanks in advance.
 
  Regards,
  Suresh.A
 
  -Original Message-
  From: Dyer, James [mailto:james.d...@ingramcontent.com]
  Sent: Tuesday, February 03, 2015 1:58 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Solr 4.9 Calling DIH concurrently
 
  DIH is single-threaded.  There was once a threaded option

Re: clarification regarding shard splitting and composite IDs

2015-02-04 Thread Dan Davis
Doesn't relevancy for that assume that the IDF and TF for user1 and user2
are not too different?SolrCloud still doesn't use a distributed IDF,
correct?

On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum gilinac...@gmail.com wrote:

 Alright. So shard splitting and composite routing plays nicely together.
 Thank you Anshum.

 On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta ans...@anshumgupta.net
 wrote:

  In one line, shard splitting doesn't cater to depend on the routing
  mechanism but just the hash range so you could have documents for the
 same
  prefix split up.
 
  Here's an overview of routing in SolrCloud:
  * Happens based on a hash value
  * The hash is calculated using the multiple parts of the routing key. In
  case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16 bits
 of
  the routing key are obtained from murmurhash(B). This sends the docs to
 the
  right shard.
  * When querying using A!, all shards that contain hashes from the range
 16
  bits from murmurhash(A)- to murmurhash(A)- are used.
 
  When you split a shard, for say range  -  , it is split
  from the middle (by default) and over multiple split, docs for the same
 A!
  prefix might end up on different shards, but the request routing should
  take care of that.
 
  You can read more about routing here:
  https://lucidworks.com/blog/solr-cloud-document-routing/
  http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/
 
  and shard splitting here:
  http://lucidworks.com/blog/shard-splitting-in-solrcloud/
 
 
  On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum gilinac...@gmail.com
 wrote:
 
   Hi, I'm also interested. When using composite the ID, the _route_
   information is not kept on the document itself, so to me it looks like
  it's
   not possible as the split API
   
  
 
 https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
   
   doesn't have a relevant parameter to split correctly.
   Could report back once I try it in practice.
  
   On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose ianr...@fullstory.com
 wrote:
  
Howdy -
   
We are using composite IDs of the form user!event.  This ensures
  that
all events for a user are stored in the same shard.
   
I'm assuming from the description of how composite ID routing works,
  that
if you split a shard the split point of the hash range for that
 shard
   is
chosen to maintain the invariant that all documents that share a
  routing
prefix (before the !) will still map to the same (new) shard.  Is
  that
accurate?
   
A naive shard-split implementation (e.g. that chose the hash range
  split
point arbitrarily) could end up with child shards that split a
  routing
prefix.
   
Thanks,
Ian
   
  
 
 
 
  --
  Anshum Gupta
  http://about.me/anshumgupta
 



Re: role of the wiki and cwiki

2015-02-02 Thread Dan Davis
Hoss et. al,

I'm not intending on contributing documentation in any immediate sense (the
disclaimer), but I thank you all for the clarification.

It makes some sense to require a committer to review each suggested piece
of official documentation, but I wonder abstractly how a non-committer then
should contribute to the documentation.  I just did an evaluation of
several WCM systems, and it sounds almost like you need something more like
a WCM that supports some moderation workflow, rather than a wiki.

With current technology, possibilities include:

 * Make a comment within Confluence suggesting content or making a
clarification,
 * Create a blog post or MoinMoin edit with whatever content seems to be
needed,
 * Paste text and/or content into a JIRA ticket, or upload an attachment to
the JIRA ticket.

I think the JIRA ticket is the strongest, honestly, because it is true
moderation - nothing shows up until evaluated by a committer.

I also want to say that I value the very technical nature of the Solr
documentation, even as I welcome better organization   Many product's
documentation is very much too much abstracted, because it is written by a
technical writer not deeply familiar with either the technology or with
what users specifically want to do.   This is addressed by surfacing what
the user's want to do, and then How-to specific documentation is written
that is still too vague on the technical details.   Sometimes a worked
example is very useful. I see a little, though not too much, of this
transition in the Data Import Handler documentation -
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
is more abstract, and moves too fast, relative to
http://wiki.apache.org/solr/DataImportHandler.   The ability to nest SQL
based entities is very key to understanding, and not covered in the former.
  One needs to see that entity is not always a root entity.

So, I agree with the direction, but I hope the Solr Reference Guide can go
into more depth in some places, even as it continues to be better organized
if you are reading from scratch rather than starting with Solr In Action or
something like that.

Thanks again,

Dan


On Mon, Feb 2, 2015 at 11:57 AM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : Because they have different potential authors, the two systems now serve
 : different purposes.
 :
 : There are still some pages on the MoinMoin wiki that contain
 : documentation that should be in the reference guide, but isn't.
 :
 : The MoinMoin wiki is still useful, as a place where users can collect
 : information that is useful to others, but doesn't qualify as official
 : documentation, or perhaps simply hasn't been verified.  I believe this
 : means that a lot of information which has been migrated into the
 : reference guide will eventually be removed from MoinMoin.

 +1 ... it's just a matter of time/energy to clean things up...


 https://cwiki.apache.org/confluence/display/solr/Internal+-+Maintaining+Documentation#Internal-MaintainingDocumentation-WhatShouldandShouldNotbeIncludedinThisDocumentation


 FWIW: Emmanuel Stalling has started doing an audit of the wiki content
 vs the ref guide ... once more folks have a chance to review  dive
 in with edits should be really helpful to cleaning all this up...

 https://wiki.apache.org/solr/WikiManualComparison



 -Hoss
 http://www.lucidworks.com/



Re: Calling custom request handler with data import

2015-01-30 Thread Dan Davis
The Data Import Handler isn't pushing data into the /update request
handler.   However, Data Import Handler can be extended with transformers.
  Two such transformers are the TemplateTransformer and the
ScriptTransformer.   It may be possible to get a script function to load
your custom Java code.   You could also just write a
StandfordNerTransformer.

Hope this helps,

Dan

On Fri, Jan 30, 2015 at 9:07 AM, vineet yadav vineet.yadav.i...@gmail.com
wrote:

 Hi,
 I am using data import handler to import data from mysql, and I want to
 identify name entities from it. So I am using following example(
 http://www.searchbox.com/named-entity-recognition-ner-in-solr/). where I
 am
 using stanford ner to identify name entities. I am using following
 requesthandler

 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
  str name=configdata-import.xml/str
  /lst
 /requestHandler

 for importing data from mysql and

 requestHandler name=/ner class=com.searchbox.ner.NerHandler /
   updateRequestProcessorChain name=mychain 
processor class=com.searchbox.ner.NerProcessorFactory 
  lst name=queryFields
str name=queryFieldcontent/str
  /lst
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain
  requestHandler name=/update class=solr.UpdateRequestHandler
lst name=defaults
  str name=update.chainmychain/str
/lst
   /requestHandler

 for identifying name entities.NER request handler identifies name entities
 from content field, but store extracted entities in solr fields.

 NER request handler was working when I am using nutch with solr. But When I
 am importing data from mysql, ner request handler is not invoked. So
 entities are not stored in solr for imported documents. Can anybody tell me
 how to call custom request handler in data import handler.

 Otherwise if I can invoke ner request handler externally, so that it can
 index person, organization and location in solr for imported document. It
 is also fine. Any suggestion are welcome.

 Thanks
 Vineet Yadav



Re: Calling custom request handler with data import

2015-01-30 Thread Dan Davis
You know, another thing you can do is just write some Java/perl/whatever to
pull data out of your database and push it to Solr.Not as convenient
for development perhaps, but it has more legs in the long run.   Data
Import Handler does not easily multi-thread.

On Sat, Jan 31, 2015 at 12:34 AM, Dan Davis dansm...@gmail.com wrote:

 The Data Import Handler isn't pushing data into the /update request
 handler.   However, Data Import Handler can be extended with transformers.
   Two such transformers are the TemplateTransformer and the
 ScriptTransformer.   It may be possible to get a script function to load
 your custom Java code.   You could also just write a
 StandfordNerTransformer.

 Hope this helps,

 Dan

 On Fri, Jan 30, 2015 at 9:07 AM, vineet yadav vineet.yadav.i...@gmail.com
  wrote:

 Hi,
 I am using data import handler to import data from mysql, and I want to
 identify name entities from it. So I am using following example(
 http://www.searchbox.com/named-entity-recognition-ner-in-solr/). where I
 am
 using stanford ner to identify name entities. I am using following
 requesthandler

 requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
 lst name=defaults
  str name=configdata-import.xml/str
  /lst
 /requestHandler

 for importing data from mysql and

 requestHandler name=/ner class=com.searchbox.ner.NerHandler /
   updateRequestProcessorChain name=mychain 
processor class=com.searchbox.ner.NerProcessorFactory 
  lst name=queryFields
str name=queryFieldcontent/str
  /lst
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain
  requestHandler name=/update class=solr.UpdateRequestHandler
lst name=defaults
  str name=update.chainmychain/str
/lst
   /requestHandler

 for identifying name entities.NER request handler identifies name entities
 from content field, but store extracted entities in solr fields.

 NER request handler was working when I am using nutch with solr. But When
 I
 am importing data from mysql, ner request handler is not invoked. So
 entities are not stored in solr for imported documents. Can anybody tell
 me
 how to call custom request handler in data import handler.

 Otherwise if I can invoke ner request handler externally, so that it can
 index person, organization and location in solr for imported document. It
 is also fine. Any suggestion are welcome.

 Thanks
 Vineet Yadav





role of the wiki and cwiki

2015-01-30 Thread Dan Davis
I've been thinking of https://wiki.apache.org/solr/ as the Old Wiki and
https://cwiki.apache.org/confluence/display/solr as the New Wiki.

I guess that's the wrong way to think about it - Confluence is being used
for the Solr Reference Guide, and MoinMoin is being used as a wiki.

Is this the correct understanding?


Re: Cannot reindex to add a new field

2015-01-29 Thread Dan Davis
For this I prefer TemplateTransformer to RegexTransformer - its not a
regex, just a pattern, and so should be more efficient to use
TemplateTransformer.   A script will also work, of course.

On Tue, Jan 27, 2015 at 5:54 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 On 27 January 2015 at 17:47, Carl Roberts carl.roberts.zap...@gmail.com
 wrote:
  field column=product sourceColName=vulnerable-software
  commonField=false regex=: replaceWith= /

 Yes, that works because the transformer copies it, not the
 EntityProcessor. So, no conflict on xpath.

 Regards,
Alex.

 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/



Re: Need help importing data

2015-01-26 Thread Dan Davis
Glad it worked out.

On Fri, Jan 23, 2015 at 9:50 PM, Carl Roberts carl.roberts.zap...@gmail.com
 wrote:

 NVM

 I figured this out.  The problem was this:  pk=link in
 rss-dat.config.xml but unique id not link in schema.xml - it is id.

 From rss-data-config.xml:

 entity name=cve-2002
 *pk=link*
 url=https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip;
 processor=XPathEntityProcessor
 forEach=/nvd/entry
 field column=id xpath=/nvd/entry/@id commonField=true /
 field column=cve xpath=/nvd/entry/cve-id
 commonField=true /
 field column=cwe xpath=/nvd/entry/cwe/@id
 commonField=true /
 !--
 field column=vulnerable-configuration
 xpath=/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name
 commonField=false /
 field column=vulnerable-software
 xpath=/nvd/entry/vulnerable-software-list/product commonField=false /
 field column=published xpath=/nvd/entry/published-datetime
 commonField=false /
 field column=modified xpath=/nvd/entry/last-modified-datetime
 commonField=false /
 field column=summary xpath=/nvd/entry/summary
 commonField=false /
 --
 /entity

 From schema.xml:

 * uniqueKeyid/uniqueKey

 *What really bothers me is that there were no errors output by Solr to
 indicate this type of misconfiguration error and all the messages that Solr
 gave indicated the import was successful.  This lack of appropriate error
 reporting is a pain, especially for someone learning Solr.

 Switching pk=link to pk=id solved the problem and I was then able to
 import the data.



 On 1/23/15, 9:39 PM, Carl Roberts wrote:

 Hi,

 I have set log4j logging to level DEBUG and I have also modified the code
 to see what is being imported and I can see the nextRow() records, and the
 import is successful, however I have no data. Can someone please help me
 figure this out?

 Here is the logging output:

 ow:  r1={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264,
 $forEach=/nvd/entry}}
 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r3={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 URL={url}
 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r1={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r3={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 URL={url}
 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r1={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r3={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 URL={url}
 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r1={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r3={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 URL={url}
 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r1={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 r3={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, $forEach=/nvd/entry}}
 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
 -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
 URL={url}
 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
 

Re: Need Help with custom ZIPURLDataSource class

2015-01-26 Thread Dan Davis
I have seen such errors by looking under Logging in the Solr Admin UI.
There is also the LogTransformer for Data Import Handler.

However, it is a design choice in Data Import Handler to skip fields not in
the schema.   I would suggest you always use Debug and Verbose to do the
first couple of documents through the GUI, and then look at the debugging
output with a fine toothed comb.

I'm not sure whether there's an option for it, but it would be nice if the
Data Import Handler could collect skipped fields into the status response.
  That would highlight your problem without forcing you to look in other
areas.


On Fri, Jan 23, 2015 at 9:51 PM, Carl Roberts carl.roberts.zap...@gmail.com
 wrote:

 NVM - I have this working.

 The problem was this:  pk=link in rss-dat.config.xml but unique id not
 link in schema.xml - it is id.

 From rss-data-config.xml:

 entity name=cve-2002
 *pk=link*
 url=https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.
 xml.zip
 processor=XPathEntityProcessor
 forEach=/nvd/entry
 field column=id xpath=/nvd/entry/@id commonField=true /
 field column=cve xpath=/nvd/entry/cve-id
 commonField=true /
 field column=cwe xpath=/nvd/entry/cwe/@id
 commonField=true /
 !--
 field column=vulnerable-configuration
 xpath=/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name
 commonField=false /
 field column=vulnerable-software
 xpath=/nvd/entry/vulnerable-software-list/product commonField=false /
 field column=published xpath=/nvd/entry/published-datetime
 commonField=false /
 field column=modified xpath=/nvd/entry/last-modified-datetime
 commonField=false /
 field column=summary xpath=/nvd/entry/summary
 commonField=false /
 --
 /entity

 From schema.xml:

 * uniqueKeyid/uniqueKey

 *What really bothers me is that there were no errors output by Solr to
 indicate this type of misconfiguration error and all the messages that Solr
 gave indicated the import was successful.  This lack of appropriate error
 reporting is a pain, especially for someone learning Solr.

 Switching pk=link to pk=id solved the problem and I was then able to
 import the data.

 On 1/23/15, 6:34 PM, Carl Roberts wrote:


 Hi,

 I created a custom ZIPURLDataSource class to unzip the content from an
 http URL for an XML ZIP file and it seems to be working (at least I have
 no errors), but no data is imported.

 Here is my configuration in rss-data-config.xml:

 dataConfig
 dataSource type=ZIPURLDataSource connectionTimeout=15000
 readTimeout=3/
 document
 entity name=cve-2002
 pk=link
 url=https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip;
 processor=XPathEntityProcessor
 forEach=/nvd/entry
 transformer=DateFormatTransformer
 field column=id xpath=/nvd/entry/@id commonField=true /
 field column=cve xpath=/nvd/entry/cve-id commonField=true /
 field column=cwe xpath=/nvd/entry/cwe/@id commonField=true /
 field column=vulnerable-configuration
 xpath=/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name
 commonField=false /
 field column=vulnerable-software
 xpath=/nvd/entry/vulnerable-software-list/product commonField=false
 /
 field column=published xpath=/nvd/entry/published-datetime
 commonField=false /
 field column=modified xpath=/nvd/entry/last-modified-datetime
 commonField=false /
 field column=summary xpath=/nvd/entry/summary commonField=false /
 /entity
 /document
 /dataConfig


 Attached is the ZIPURLDataSource.java file.

 It actually unzips and saves the raw XML to disk, which I have verified
 to be a valid XML file.  The file has one or more entries (here is an
 example):

 nvd xmlns:scap-core=http://scap.nist.gov/schema/scap-core/0.1;
 xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
 xmlns:patch=http://scap.nist.gov/schema/patch/0.1;
 xmlns:vuln=http://scap.nist.gov/schema/vulnerability/0.4;
 xmlns:cvss=http://scap.nist.gov/schema/cvss-v2/0.2;
 xmlns:cpe-lang=http://cpe.mitre.org/language/2.0;
 xmlns=http://scap.nist.gov/schema/feed/vulnerability/2.0;
 pub_date=2015-01-10T05:37:05
 xsi:schemaLocation=http://scap.nist.gov/schema/patch/0.1
 http://nvd.nist.gov/schema/patch_0.1.xsd
 http://scap.nist.gov/schema/scap-core/0.1
 http://nvd.nist.gov/schema/scap-core_0.1.xsd
 http://scap.nist.gov/schema/feed/vulnerability/2.0
 http://nvd.nist.gov/schema/nvd-cve-feed_2.0.xsd; nvd_xml_version=2.0
 entry id=CVE-1999-0001
 vuln:vulnerable-configuration id=http://nvd.nist.gov/;
 cpe-lang:logical-test operator=OR negate=false
 cpe-lang:fact-ref name=cpe:/o:bsdi:bsd_os:3.1/
 cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:1.0/
 cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:1.1/
 cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:1.1.5.1/
 cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:1.2/
 cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:2.0/
 cpe-lang:fact-ref name=cpe:/o:freebsd:freebsd:2.0.5/
 cpe-lang:fact-ref 

Re: Indexed epoch time in Solr

2015-01-26 Thread Dan Davis
I think copying to a new Solr date field is your best bet, because then you
have the flexibility to do date range facets in the future.

If you can re-index, and are using Data Import Handler, Jim Musil's
suggestion is just right.

If you can re-index, and are not using Data Import Handler:

   - This seems a job for an UpdateRequestProcessor
   https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors,
   but I don't see one for this.
   - This seems to be a good candidate for a standard, core
   UpdateRequestProcessor, but I haven't checked Jira for a bug report.

If the scale is too large to re-index, then there is surely still a way,
but I'm not sure I can advise you on the best one.  I'm not an Solr expert
yet... just someone on the list with a IR background.

On Mon, Jan 26, 2015 at 12:35 AM, Ahmed Adel ahmed.a...@badrit.com wrote:

 Hi All,

 Is there a way to convert unix time field that is already indexed to
 ISO-8601 format in query response? If this is not possible on the query
 level, what is the best way to copy this field to a new Solr standard date
 field.

 Thanks,

 --
 *Ahmed Adel*
 http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2F



Re: Solr admin Url issues

2015-01-26 Thread Dan Davis
Is Jetty actually running on port 80?Do you have Apache2 reverse proxy
in front?

On Mon, Jan 26, 2015 at 11:02 PM, Summer Shire shiresum...@gmail.com
wrote:

 Hi All,

 Running solr (4.7.2) locally and hitting the admin page like this works
 just fine http://localhost:8983/solr/ http://localhost:8983/solr/## 
 http://localhost:8983/solr/#

 But on my deployment server my path is
 http://example.org/jetty/MyApp/1/solr/# 
 http://example.org/jetty/MyApp/1/solr/#
 Or http://example.org/jetty/MyApp/1/solr/admin/cores 
 http://example.org/jetty/MyApp/1/solr/admin/cores or
 http://example.org/jetty/MyApp/1/solr/main/admin/ 
 http://example.org/jetty/MyApp/1/solr/main/admin/

 the above request in a browser loads the admin page half way and then
 spawns another request at
 http://example.org/solr/admin/cores http://example.org/solr/admin/cores
 ….

 how can I maintain my other params such as jetty/MyApp/1/

 btw http://example.org/jetty/MyApp/1/solr/main/select?q=*:* 
 http://example.org/jetty/MyApp/1/solr/main/select?q=*:* or any other
 requesthandlers work just fine.

 What is going on here ? any idea ?

 thanks,
 Summer


Re: How to implement Auto complete, suggestion client side

2015-01-26 Thread Dan Davis
Cannot get any easier than jquery-ui's autocomplete widget -
http://jqueryui.com/autocomplete/

Basically, you set some classes and implement a javascript that calls the
server to get the autocomplete data.   I never would expose Solr to
browsers, so I would have the AJAX call go to a php script (or
function/method if you are using a web framework such as CakePHP or
Symfony).

Then, on the server, you make a request to Solr /suggest or /spell with
wt=json, and then you reformulate this into a simple JSON response that is
a simple array of options.

You can do this in stages:

   - Constant suggestions - you change your html and implement Javascript
   that shows constant suggestions after for instance 2 seconds.
   - Constant suggestions from the server - you change your JavaScript to
   call the server, and have the server return a constant list.
   - Dynamic suggestions from the server - you implement the server-side to
   query Solr and turn the return from /suggest or /spell into a JSON array.
   - Tuning, tuning, tuning - you work hard on tuning it so that you get
   high quality suggestions for a wide variety of inputs.

Note that the autocomplete I've described for you is basically the simplest
thing possible, as you suggest you are new to it.   It is not based on data
mining of query and click-through logs, which is a very common pattern
these days.   There is no bolding of the portion of the words that are new.
  It is just a basic autocomplete widget with a delay.

On Mon, Jan 26, 2015 at 5:11 PM, Olivier Austina olivier.aust...@gmail.com
wrote:

 Hi All,

 I would say I am new to web technology.

 I would like to implement auto complete/suggestion in the user search box
 as the user type in the search box (like Google for example). I am using
 Solr as database. Basically I am  familiar with Solr and I can formulate
 suggestion queries.

 But now I don't know how to implement suggestion in the User Interface.
 Which technologies should I need. The website is in PHP. Any suggestions,
 examples, basic tutorial is welcome. Thank you.



 Regards
 Olivier



Re: [MASSMAIL]Weighting of prominent text in HTML

2015-01-26 Thread Dan Davis
Helps lots.   Thanks, Jorge Luis.   Good point about different fields -
I'll just put the h1 and h2 (however deep I want to go) into fields, and we
can sort out weighting and whether we want it later with edismax.   The
blogs on adding plugins for that sort of thing look straightforward.

On Mon, Jan 26, 2015 at 12:47 AM, Jorge Luis Betancourt González 
jlbetanco...@uci.cu wrote:

 Hi Dan:

 Agreed, this question is more Nutch related than Solr ;)

 Nutch doesn't send any data into /update/extract request handler, all the
 text and metadata extraction happens in Nutch side rather than relying in
 the ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the
 same technology as the ExtractRequestHandler provided by Solr so shouldn't
 be any greater difference.

 By default Nutch doesn't boost anything as is Solr job to boost the
 different content in the different fields, which is what happens when you
 do a query against Solr. Nutch calculates the LinkRank which is a variation
 of the famous PageRank (or the OPIC score, which is another scoring
 algorithm implemented in Nutch, which I believe is the default in Nutch
 2.x). What you can do is use the headings and map the heading tags into
 different fields and then apply different boosts to each field.

 The general idea with Nutch is to make pieces of the web page and store
 each piece in a different field in Solr, then you can tweak your relevance
 function using the values yo see fit, so you don't need to write any plugin
 to accomplish this (at least for the h1, h2, etc. example you provided, if
 you want to extract other parts of the webpage you'll need to write your
 own plugin to do so).

 Nutch is highly customizable, you can write a plugin for almost any piece
 of logic, from parsers to indexers, passing from URL filters, scoring
 algorithms, protocols and a long long list, usually the plugins are not so
 difficult to write, but the problem comes to know which extension point you
 need to use, this comes with experience and taking a good dive in the
 source code.

 Hope this helps,

 - Original Message -
 From: Dan Davis dansm...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Monday, January 26, 2015 12:08:13 AM
 Subject: [MASSMAIL]Weighting of prominent text in HTML

 By examining solr.log, I can see that Nutch is using the /update request
 handler rather than /update/extract.   So, this may be a more appropriate
 question for the nutch mailing list.   OTOH, y'all know the anwser off the
 top of your head.

 Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
 normal paragraph?Can this weighting be tuned without writing a plugin?
Is writing a plugin often needed because of the flexibility that is
 needed in practice?

 I wanted to call this post *Anatomy of a small scale search engine*, but
 lacked the nerve ;)

 Thanks, all and many,

 Dan Davis, Systems/Applications Architect
 National Library of Medicine


 ---
 XII Aniversario de la creación de la Universidad de las Ciencias
 Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.




Weighting of prominent text in HTML

2015-01-25 Thread Dan Davis
By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine


Re: solr replication vs. rsync

2015-01-25 Thread Dan Davis
@Erick,

Problem space is not constant indexing.   I thought SolrCloud replicas were
replication, and you imply parallel indexing.  Good to know.

On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote:

 @Shawn: Cool table, thanks!

 @Dan:
 Just to throw a different spin on it, if you migrate to SolrCloud, then
 this question becomes moot as the raw documents are sent to each of the
 replicas so you very rarely have to copy the full index. Kind of a tradeoff
 between constant load because you're sending the raw documents around
 whenever you index and peak usage when the index replicates.

 There are a bunch of other reasons to go to SolrCloud, but you know your
 problem space best.

 FWIW,
 Erick

 On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org
 javascript:; wrote:

  On 1/24/2015 10:56 PM, Dan Davis wrote:
   When I polled the various projects already using Solr at my
  organization, I
   was greatly surprised that none of them were using Solr replication,
   because they had talked about replicating the data.
  
   But we are not Pinterest, and do not expect to be taking in changes one
   post at a time (at least the engineers don't - just wait until its used
  for
   a Crud app that wants full-text search on a description field!).
  Still,
   rsync can be very, very fast with the right options (-W for gigabit
   ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
  over
   GigE previously.
  
   Does anyone have any numbers for how fast Solr replication goes, and
 what
   to do to tune it?
  
   I'm not enthusiastic to give-up recently tested cluster stability for a
   home grown mess, but I am interested in numbers that are out there.
 
  Numbers are included on the Solr replication wiki page, both in graph
  and numeric form.  Gathering these numbers must have been pretty easy --
  before the HTTP replication made it into Solr, Solr used to contain an
  rsync-based implementation.
 
  http://wiki.apache.org/solr/SolrReplication#Performance_numbers
 
  Other data on that wiki page discusses the replication config.  There's
  not a lot to tune.
 
  I run a redundant non-SolrCloud index myself through a different method
  -- my indexing program indexes each index copy completely independently.
   There is no replication.  This separation allows me to upgrade any
  component, or change any part of solrconfig or schema, on either copy of
  the index without affecting the other copy at all.  With replication, if
  something is changed on the master or the slave, you might find that the
  slave no longer works, because it will be handling an index created by
  different software or a different config.
 
  Thanks,
  Shawn
 
 



Re: solr replication vs. rsync

2015-01-25 Thread Dan Davis
Thanks!

On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote:

 @Shawn: Cool table, thanks!

 @Dan:
 Just to throw a different spin on it, if you migrate to SolrCloud, then
 this question becomes moot as the raw documents are sent to each of the
 replicas so you very rarely have to copy the full index. Kind of a tradeoff
 between constant load because you're sending the raw documents around
 whenever you index and peak usage when the index replicates.

 There are a bunch of other reasons to go to SolrCloud, but you know your
 problem space best.

 FWIW,
 Erick

 On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org
 javascript:; wrote:

  On 1/24/2015 10:56 PM, Dan Davis wrote:
   When I polled the various projects already using Solr at my
  organization, I
   was greatly surprised that none of them were using Solr replication,
   because they had talked about replicating the data.
  
   But we are not Pinterest, and do not expect to be taking in changes one
   post at a time (at least the engineers don't - just wait until its used
  for
   a Crud app that wants full-text search on a description field!).
  Still,
   rsync can be very, very fast with the right options (-W for gigabit
   ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
  over
   GigE previously.
  
   Does anyone have any numbers for how fast Solr replication goes, and
 what
   to do to tune it?
  
   I'm not enthusiastic to give-up recently tested cluster stability for a
   home grown mess, but I am interested in numbers that are out there.
 
  Numbers are included on the Solr replication wiki page, both in graph
  and numeric form.  Gathering these numbers must have been pretty easy --
  before the HTTP replication made it into Solr, Solr used to contain an
  rsync-based implementation.
 
  http://wiki.apache.org/solr/SolrReplication#Performance_numbers
 
  Other data on that wiki page discusses the replication config.  There's
  not a lot to tune.
 
  I run a redundant non-SolrCloud index myself through a different method
  -- my indexing program indexes each index copy completely independently.
   There is no replication.  This separation allows me to upgrade any
  component, or change any part of solrconfig or schema, on either copy of
  the index without affecting the other copy at all.  With replication, if
  something is changed on the master or the slave, you might find that the
  slave no longer works, because it will be handling an index created by
  different software or a different config.
 
  Thanks,
  Shawn
 
 



solr replication vs. rsync

2015-01-24 Thread Dan Davis
When I polled the various projects already using Solr at my organization, I
was greatly surprised that none of them were using Solr replication,
because they had talked about replicating the data.

But we are not Pinterest, and do not expect to be taking in changes one
post at a time (at least the engineers don't - just wait until its used for
a Crud app that wants full-text search on a description field!).Still,
rsync can be very, very fast with the right options (-W for gigabit
ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s over
GigE previously.

Does anyone have any numbers for how fast Solr replication goes, and what
to do to tune it?

I'm not enthusiastic to give-up recently tested cluster stability for a
home grown mess, but I am interested in numbers that are out there.


Re: OutOfMemoryError for PDF document upload into Solr

2015-01-15 Thread Dan Davis
Why re-write all the document conversion in Java ;)  Tika is very slow.   5
GB PDF is very big.

If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
mode.   The HTML mode captures some meta-data that would otherwise be lost.


If you need to go faster still, you can  also write some stuff linked
directly against poppler library.

Before you jump down by through about Tika being slow - I wrote a PDF
indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
getjmp/longjmp.   But fast...



On Thu, Jan 15, 2015 at 1:54 PM, ganesh.ya...@sungard.com wrote:

 Siegfried and Michael Thank you for your replies and help.

 -Original Message-
 From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
 Sent: Thursday, January 15, 2015 3:45 AM
 To: solr-user@lucene.apache.org
 Subject: Re: OutOfMemoryError for PDF document upload into Solr

 Hi Ganesh,

 you can increase the heap size but parsing a 4 GB PDF document will very
 likely consume A LOT OF memory - I think you need to check if that large
 PDF can be parsed at all :-)

 Cheers,

 Siegfried Goeschl

 On 14.01.15 18:04, Michael Della Bitta wrote:
  Yep, you'll have to increase the heap size for your Tomcat container.
 
  http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
  -heap-size-correctly
 
  Michael Della Bitta
 
  Senior Software Engineer
 
  o: +1 646 532 3062
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  https://plus.google.com/u/0/b/112002776285509593336/11200277628550959
  3336/posts
  w: appinions.com http://www.appinions.com/
 
  On Wed, Jan 14, 2015 at 12:00 PM, ganesh.ya...@sungard.com wrote:
 
  Hello,
 
  Can someone pass on the hints to get around following error? Is there
  any Heap Size parameter I can set in Tomcat or in Solr webApp that
  gets deployed in Solr?
 
  I am running Solr webapp inside Tomcat on my local machine which has
  RAM of 12 GB. I have PDF document which is 4 GB max in size that
  needs to be loaded into Solr
 
 
 
 
  Exception in thread http-apr-8983-exec-6 java.lang.: Java heap
 space
   at java.util.AbstractCollection.toArray(Unknown Source)
   at java.util.ArrayList.init(Unknown Source)
   at
  org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
   at
 org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
   at
 org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
   at
 org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
   at
 org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
   at
 org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
   at
  org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at
  org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at
  org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
   at
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   at
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at
 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
   at
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
   at
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
   at
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
   at
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
   at
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
   at
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
   at
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
   at
 
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
   at
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
   at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
   at
 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
   at
 
 

Improved suggester question

2015-01-13 Thread Dan Davis
The suggester is not working for me with Solr 4.10.2

Can anyone shed light over why I might be getting the exception below when
I build the dictionary?

response
lst name=responseHeader
int name=status500/int
int name=QTime26/int
/lst
lst name=error
str name=msglen must be = 32767; got 35680/str
str name=trace
java.lang.IllegalArgumentException: len must be = 32767; got 35680 at
org.apache.lucene.util.OfflineSorter$ByteSequencesWriter.write(OfflineSorter.java:479)
at
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(AnalyzingSuggester.java:493)
at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190) at
org.apache.solr.spelling.suggest.SolrSuggester.build(SolrSuggester.java:160)
at
org.apache.solr.handler.component.SuggestComponent.prepare(SuggestComponent.java:165)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200) at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
/str
int name=code500/int
/lst
/response

Thank you.

I've configured my suggester as follows:

searchComponent name=suggest class=solr.SuggestComponent
  lst name=suggester
str name=namemySuggester/str
str name=lookupImplFuzzyLookupFactory/str
str name=dictionaryImplDocumentDictionaryFactory/str
str name=fieldtext/str
str name=weightFieldmedsite_id/str
str name=suggestAnalyzerFieldTypetext_general/str
str name=buildOnCommittrue/str
str name=threshold0.1/str
  /lst
/searchComponent

requestHandler name=/suggest class=solr.SearchHandler startup=lazy
  lst name=defaults
str name=suggeston/str
str name=suggest.dictionarymySuggester/str
str name=suggest.count10/str
  /lst
  arr name=components
strsuggest/str
  /arr
/requestHandler


Re: Logging in Solr's DataImportHandler

2015-01-13 Thread Dan Davis
Mikhail,

Thanks - it works now.The script transformer was really not needed, a
template transformer is clearer, and the log transformer is now working.

On Mon, Dec 8, 2014 at 1:56 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Hello Dan,

 Usually it works well. Can you describe how you run it particularly, eg
 what you download exactly and what's the command line ?

 On Fri, Dec 5, 2014 at 11:37 PM, Dan Davis dansm...@gmail.com wrote:

 I have a script transformer and a log transformer, and I'm not seeing the
 log messages, at least not where I expect.
 Is there anyway I can simply log a custom message from within my script?
 Can the script easily interact with its containers logger?




 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Suggester questions

2015-01-13 Thread Dan Davis
I am having some trouble getting the suggester to work.   The spell
requestHandler is working, but I didn't like the results I was getting from
the word breaking dictionary and turned them off.
So some basic questions:

   - How can I check on the status of a dictionary?
   - How can I see what is in that dictionary?
   - How do I actually manually rebuild the dictionary - all attempts to
   set spellcheck.build=on or suggest.build=on have led to nearly instant
   results (0 suggestions for the latter), indicating something is wrong.


Thanks,

Daniel Davis


Re: Best way to implement Spotlight of certain results

2015-01-13 Thread Dan Davis
Maybe I can use grouping, but my understanding of the feature is not up to
figuring that out :)

I tried something like

http://localhost:8983/solr/collection/select?q=childhood+cancergroup=ongroup.query=childhood+cancer
Because the group.limit=1, I get a single result, and no other results.
If I add group.field=title, then I get each result, in a group of 1
member...

Eric's re-ranking I do understand - I can re-rank the top-N to make sure
the spotlighted result is always first, avoiding the potential problem of
having to overweight the title field.In practice, I may not ever need
to use the reranking, but its there if I need it.This is enough,
because it gives me talking points.


On Fri, Jan 9, 2015 at 3:05 PM, Michał B. . m.bienkow...@gmail.com wrote:

 Maybe I understand you badly but I thing that you could use grouping to
 achieve such effect. If you could prepare two group queries one with exact
 match and other, let's say, default than you will be able to extract
 matches from grouping results. i.e (using default solr example collection)


 http://localhost:8983/solr/collection1/select?q=*:*group=truegroup.query=manu%3A%22Ap+Computer+Inc.%22group.query=name:Apple%2060%20GB%20iPod%20with%20Video%20Playback%20Blackgroup.limit=10

 this query will return two groups one with exact match second with the rest
 standard results.

 Regars,
 Michal


 2015-01-09 20:44 GMT+01:00 Erick Erickson erickerick...@gmail.com:

  Hmm, I wonder if the RerankingQueryParser might help here?
  See: https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking
 
  Best,
  Erick
 
  On Fri, Jan 9, 2015 at 10:35 AM, Dan Davis dansm...@gmail.com wrote:
   I have a requirement to spotlight certain results if the query text
  exactly
   matches the title or see reference (indexed by me as alttitle_t).
   What that means is that these matching results are shown above the
   top-10/20 list with different CSS and fields.   Its like feeling lucky
 on
   google :)
  
   I have considered three ways of implementing this:
  
  1. Assume that edismax qf/pf will boost these results to be first
 when
  there is an exact match on these important fields.   The downside
  then is
  that my relevancy is constrained and I must maintain my
 configuration
  with
  title and alttitle_t as top search fields (see XML snippet below).
  I may
  have to overweight them to achieve the always first criteria.
   Another
  less major downside is that I must always return the spotlight
 summary
  field (for display) and the image to display on each search.   These
  could
  be got from a database by the id, however, it is convenient to get
  them
  from Solr.
  2. Issue two searches for every user search, and use a second set of
  parameters (change the search type and fields to search only by
 exact
  matching a specific string field spottitle_s).   The search for the
  spotlight can then have its own configuration.   The downside here
 is
  that
  I am using Django and pysolr for the front-end, and pysolr is both
  synchronous and tied to the requestHandler named select.
   Convention.
  Of course, running in parallel is not a fix-all - running a search
  takes
  some time, even if run in parallel.
  3. Automate the population of elevate.xml so that all these 959
  queries
  are here.   This is probably best, but forces me to restart/reload
  when
  there are changes to this components.   The elevation can be done
  through a
  query.
  
   What I'd love to do is to configure the select requestHandler to run
  both
   searches and return me both sets of results.   Is there anyway to do
  that -
   apply the same q= parameter to two configured way to run a search?
   Something like sub queries?
  
   I suspect that approach 1 will get me through my demo and a brief
   evaluation period, but that either approach 2 or 3 will be the winner.
  
   Here's a snippet from my current qf/pf configuration:
 str name=qf
   title^100
   alttitle_t^100
   ...
   text
 /str
 str name=pf
   title^1000
   alttitle_t^1000
   ...
   text^10
/str
  
   Thanks,
  
   Dan Davis
 



 --
 Michał Bieńkowski



Re: Occasionally getting error in solr suggester component.

2015-01-13 Thread Dan Davis
Related question -

I see mention of needing to rebuild the spellcheck/suggest dictionary after
solr core reload.   I see spellcheckIndexDir in both the old wiki entry and
the solr reference guide
https://cwiki.apache.org/confluence/display/solr/Spell+Checking.  If this
parameter is provided, it sounds like the index is stored on the filesystem
and need not be rebuilt each time the core is reloaded.

Is this a correct understanding?


On Tue, Jan 13, 2015 at 2:17 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 I think you are probably getting bitten by one of the issues addressed in
 LUCENE-5889

 I would recommend against using buildOnCommit=true - with a large index
 this can be a performance-killer.  Instead, build the index yourself using
 the Solr spellchecker support (spellcheck.build=true)

 -Mike


 On 01/13/2015 10:41 AM, Dhanesh Radhakrishnan wrote:

 Hi all,

 I am experiencing a problem in Solr SuggestComponent
 Occasionally solr suggester component throws an  error like

 Solr failed:
 {responseHeader:{status:500,QTime:1},error:{msg:suggester was
 not built,trace:java.lang.IllegalStateException: suggester was not
 built\n\tat
 org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.
 lookup(AnalyzingInfixSuggester.java:368)\n\tat
 org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.
 lookup(AnalyzingInfixSuggester.java:342)\n\tat
 org.apache.lucene.search.suggest.Lookup.lookup(Lookup.java:240)\n\tat
 org.apache.solr.spelling.suggest.SolrSuggester.
 getSuggestions(SolrSuggester.java:199)\n\tat
 org.apache.solr.handler.component.SuggestComponent.
 process(SuggestComponent.java:234)\n\tat
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(
 SearchHandler.java:218)\n\tat
 org.apache.solr.handler.RequestHandlerBase.handleRequest(
 RequestHandlerBase.java:135)\n\tat
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
 handleRequest(RequestHandlers.java:246)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.execute(
 SolrDispatchFilter.java:777)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:418)\n\tat
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(
 SolrDispatchFilter.java:207)\n\tat
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
 ApplicationFilterChain.java:243)\n\tat
 org.apache.catalina.core.ApplicationFilterChain.doFilter(
 ApplicationFilterChain.java:210)\n\tat
 org.apache.catalina.core.StandardWrapperValve.invoke(
 StandardWrapperValve.java:225)\n\tat
 org.apache.catalina.core.StandardContextValve.invoke(
 StandardContextValve.java:123)\n\tat
 org.apache.catalina.core.StandardHostValve.invoke(
 StandardHostValve.java:168)\n\tat
 org.apache.catalina.valves.ErrorReportValve.invoke(
 ErrorReportValve.java:98)\n\tat
 org.apache.catalina.valves.AccessLogValve.invoke(
 AccessLogValve.java:927)\n\tat
 org.apache.catalina.valves.RemoteIpValve.invoke(
 RemoteIpValve.java:680)\n\tat
 org.apache.catalina.core.StandardEngineValve.invoke(
 StandardEngineValve.java:118)\n\tat
 org.apache.catalina.connector.CoyoteAdapter.service(
 CoyoteAdapter.java:407)\n\tat
 org.apache.coyote.http11.AbstractHttp11Processor.process(
 AbstractHttp11Processor.java:1002)\n\tat
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
 process(AbstractProtocol.java:579)\n\tat
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.
 run(JIoEndpoint.java:312)\n\tat
 java.util.concurrent.ThreadPoolExecutor.runWorker(
 ThreadPoolExecutor.java:1145)\n\tat
 java.util.concurrent.ThreadPoolExecutor$Worker.run(
 ThreadPoolExecutor.java:615)\n\tat
 java.lang.Thread.run(Thread.java:745)\n,code:500}}

 This is not freequently happening, but idexing and suggestor component
 working togethere  this error will occur.




 In solr config

 searchComponent name=suggest class=solr.SuggestComponent
  lst name=suggester
str name=namehaSuggester/str
str name=lookupImplAnalyzingInfixLookupFactory/str  !--
 org.apache.solr.spelling.suggest.fst --
str name=suggestAnalyzerFieldTypetextSpell/str
str name=dictionaryImplDocumentDictionaryFactory/str
  !--
 org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --
str name=fieldname/str
str name=weightFieldpackageWeight/str
str name=buildOnCommittrue/str
  /lst
/searchComponent

requestHandler name=/suggest class=solr.SearchHandler
 startup=lazy
  lst name=defaults
str name=suggesttrue/str
str name=suggest.count10/str
  /lst
  arr name=components
strsuggest/str
  /arr
/requestHandler

 Can any one suggest where to look to figure out this error and why these
 errors are occurring?



 Thanks,
 dhanesh s.r




 --





Best way to implement Spotlight of certain results

2015-01-09 Thread Dan Davis
I have a requirement to spotlight certain results if the query text exactly
matches the title or see reference (indexed by me as alttitle_t).
What that means is that these matching results are shown above the
top-10/20 list with different CSS and fields.   Its like feeling lucky on
google :)

I have considered three ways of implementing this:

   1. Assume that edismax qf/pf will boost these results to be first when
   there is an exact match on these important fields.   The downside then is
   that my relevancy is constrained and I must maintain my configuration with
   title and alttitle_t as top search fields (see XML snippet below).I may
   have to overweight them to achieve the always first criteria.   Another
   less major downside is that I must always return the spotlight summary
   field (for display) and the image to display on each search.   These could
   be got from a database by the id, however, it is convenient to get them
   from Solr.
   2. Issue two searches for every user search, and use a second set of
   parameters (change the search type and fields to search only by exact
   matching a specific string field spottitle_s).   The search for the
   spotlight can then have its own configuration.   The downside here is that
   I am using Django and pysolr for the front-end, and pysolr is both
   synchronous and tied to the requestHandler named select.   Convention.
   Of course, running in parallel is not a fix-all - running a search takes
   some time, even if run in parallel.
   3. Automate the population of elevate.xml so that all these 959 queries
   are here.   This is probably best, but forces me to restart/reload when
   there are changes to this components.   The elevation can be done through a
   query.

What I'd love to do is to configure the select requestHandler to run both
searches and return me both sets of results.   Is there anyway to do that -
apply the same q= parameter to two configured way to run a search?
Something like sub queries?

I suspect that approach 1 will get me through my demo and a brief
evaluation period, but that either approach 2 or 3 will be the winner.

Here's a snippet from my current qf/pf configuration:
  str name=qf
title^100
alttitle_t^100
...
text
  /str
  str name=pf
title^1000
alttitle_t^1000
...
text^10
 /str

Thanks,

Dan Davis


Re: Spellchecker delivers far too few suggestions

2014-12-17 Thread Dan Davis
What about the frequency comparison - I haven't used the spellchecker
heavily, but it seems that if bnak is in the database, but bank is much
more frequent, then bank should be a suggestion anyway...

On Wed, Dec 17, 2014 at 10:41 AM, Erick Erickson erickerick...@gmail.com
wrote:

 First, I'd look in your corpus for bnak. The problem with index-based
 suggestions is that if your index contains garbage, they're correctly
 spelled since they're in the index. TermsComponent is very useful for
 this.

 You can also loosen up the match criteria, and as I remember the collations
 parameter does some permutations of the word (but my memory of how that
 works is shaky).

 Best,
 Erick

 On Wed, Dec 17, 2014 at 9:13 AM, Martin Dietze mdie...@gmail.com wrote:
  I recently upgraded to SOLR 4.10.1 and after that set up the spell
  checker which I use for returning suggestions after searches with few
  or no results.
  When the spellchecker is active, this request handler is used (most of
  which is taken from examples I found in the net):
 
requestHandler name=standardWithSpell class=solr.SearchHandler
  default=false
   lst name=defaults
 str name=echoParamsexplicit/str
 str name=spellchecktrue/str
 str name=spellcheck.onlyMorePopularfalse/str
 str name=spellcheck.count10/str
 str name=spellcheck.collatefalse/str
 str name=q.alt*:*/str
 str name=echoParamsexplicit/str
 int name=rows50/int
 str name=fl*,score/str
   /lst
   arr name=last-components
 strspellcheck/str
   /arr
/requestHandler
 
  The search component is configured as follows (again most of it copied
  from examples in the net):
 
searchComponent name=spellcheck class=solr.SpellCheckComponent
  str name=queryAnalyzerFieldTypetext/str
  lst name=spellchecker
str name=namedefault/str
str name=fieldtext/str
str name=classnamesolr.DirectSolrSpellChecker/str
str name=distanceMeasureinternal/str
float name=accuracy0.3/float
int name=maxEdits2/int
int name=minPrefix1/int
int name=maxInspections5/int
int name=minQueryLength4/int
float name=maxQueryFrequency0.01/float
float name=maxQueryFrequency.01/float
  /lst
/searchComponent
 
  With this setup I can get suggestions for misspelled words. The
  results on my developer machine were mostly fine, but on the test
  system (much larger database, much larger search index) I found it
  very hard to get suggestions at all. If for instance I misspell “bank”
  as “bnak” I’d expect to get a suggestion for “bank” (since that word
  can be found in the index very often).
 
  I’ve played around with maxQueryFrequency and maxQueryFrequency with
  no success.
 
  Does anyone see any obvious misconfiguration? Anything that I could try?
 
  Any way I can debug this? (problem is that my application uses the
  core API which makes trying out requests through the web interface
  does not work)
 
  Any help would be greatly appreciated!
 
  Cheers,
 
  Martin
 
 
  --
  -- mdie...@gmail.com --/-- mar...@the-little-red-haired-girl.org
 
  - / http://herbert.the-little-red-haired-girl.org /
 -



Re: Tika HTTP 400 Errors with DIH

2014-12-08 Thread Dan Davis
I would say that you could determine a row that gives a bad URL, and then
run it in DIH admin interface (or the command-line) with debug enabled
The url parameter going into tika should be present in its transformed form
before the next entity gets going.   This works in a similar scenario for
me.

On Tue, Dec 2, 2014 at 1:19 PM, Teague James teag...@insystechinc.com
wrote:

 Hi all,

 I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL
 field. In the DIH Tika uses that field to fetch and parse the documents.
 The
 URL from the field is valid and will download the document in the browser
 just fine. But Tika is getting HTTP response code 400. Any ideas why?

 ERROR
 BinURLDataSource
 java.io.IOException: Server returned HTTP response code: 400 for URL:

 EntityProcessorWrapper
 Exception in entity :
 tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException:
 Exception in invoking url

 DIH
 dataConfig
 dataSource type=JdbcDataSource
   name=ds-1
   driver=net.sourceforge.jtds.jdbc.Driver

 url=jdbc:jtds:sqlserver://
 1.2.3.4/database;instance=INSTANCE;user=USER;pass
 word=PASSWORD /

 dataSource type=BinURLDataSource name=ds-2 /

 document
 entity name=db_content dataSource=ds-1
 transformer=ClobTransformer, RegexTransformer
 query=SELECT ContentID,
 DownloadURL
 FROM DATABASE.VIEW
 field column=ContentID name=id /
 field column=DownloadURL clob=true
 name=DownloadURL /

 entity name=tika_content
 processor=TikaEntityProcessor url=${db_content.DownloadURL}
 onError=continue dataSource=ds-2
 field column=TikaParsedContent /
 /entity

 /entity
 /document
 /dataConfig

 SCHEMA - Fields
 field name=DownloadURL type=string indexed=true stored=true /
 field name=TikaParsedContent type=text_general indexed=true
 stored=true multiValued=true/






DIH XPathEntityProcessor question

2014-12-08 Thread Dan Davis
When I have a forEach attribute like the following:


forEach=/medical-topics/medical-topic/health-topic[@language='English']

And then need to match an attribute of that, is there any alternative to
spelling it all out:

 field column=url
xpath=/medical-topics/medical-topic/health-topic[@language='English']/@url/

I suppose I could do //health-topic/@url since the document should then
have a single health-topic (as long as I know they don't nest).


Re: DIH XPathEntityProcessor question

2014-12-08 Thread Dan Davis
In experimentation with a much simpler and smaller XML file, it doesn't
look like '//health-topic/@url will not work, nor will '//@url' etc.So
far, only spelling it all out will work.
With child elements, such as title, an xpath of //title works fine, but
it  is beginning to same dangerous.

Is there any short-hand for the current node or the match?

On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis dansm...@gmail.com wrote:

 When I have a forEach attribute like the following:


 forEach=/medical-topics/medical-topic/health-topic[@language='English']

 And then need to match an attribute of that, is there any alternative to
 spelling it all out:

  field column=url
 xpath=/medical-topics/medical-topic/health-topic[@language='English']/@url/

 I suppose I could do //health-topic/@url since the document should then
 have a single health-topic (as long as I know they don't nest).




Re: DIH XPathEntityProcessor question

2014-12-08 Thread Dan Davis
The problem is that XPathEntityProcessor implements Xpath on its own, and
implements a subset of XPath.  So, if the input document is small enough,
it makes no sense to fight it.   One possibility is to apply an XSLT to the
file before processing ite

This blog post
http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx
shows a worked example.   The XSL transform takes place before the forEach
or field specifications, which is the principal question I had about it
from the documentation.  This is also illustrated in the initQuery()
private method of XPathEntityProcessor.You can see the transformation
being applied before the forEach.  This will not scale to extremely large
XML documents including millions of rows - that is why they have the
stream=true argument there, so that you don't preprocess the document.
In my case, the entire XML file is 29M, and so I think I could do the XSL
transformation and then do for each document.

This potentially shortens my time frame of moving to Apache Solr
substantially, because the common case with our previous indexer is to run
XSLT to trasform to the document format desired by the indexer.

On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 I don't believe there are any alternatives. At least I could not get
 anything but the full path to work.

 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


 On 8 December 2014 at 17:01, Dan Davis dansm...@gmail.com wrote:
  In experimentation with a much simpler and smaller XML file, it doesn't
  look like '//health-topic/@url will not work, nor will '//@url' etc.
 So
  far, only spelling it all out will work.
  With child elements, such as title, an xpath of //title works fine,
 but
  it  is beginning to same dangerous.
 
  Is there any short-hand for the current node or the match?
 
  On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis dansm...@gmail.com wrote:
 
  When I have a forEach attribute like the following:
 
 
 
 forEach=/medical-topics/medical-topic/health-topic[@language='English']
 
  And then need to match an attribute of that, is there any alternative to
  spelling it all out:
 
   field column=url
 
 xpath=/medical-topics/medical-topic/health-topic[@language='English']/@url/
 
  I suppose I could do //health-topic/@url since the document should
 then
  have a single health-topic (as long as I know they don't nest).
 
 



Re: DIH XPathEntityProcessor question

2014-12-08 Thread Dan Davis
Yes, that worked quite well.   I still need the //tagname but that is the
only DIH incantation I need.   This will substantially accelerate things.

On Mon, Dec 8, 2014 at 5:37 PM, Dan Davis d...@danizen.net wrote:

 The problem is that XPathEntityProcessor implements Xpath on its own, and
 implements a subset of XPath.  So, if the input document is small enough,
 it makes no sense to fight it.   One possibility is to apply an XSLT to the
 file before processing ite

 This blog post
 http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx
 shows a worked example.   The XSL transform takes place before the forEach
 or field specifications, which is the principal question I had about it
 from the documentation.  This is also illustrated in the initQuery()
 private method of XPathEntityProcessor.You can see the transformation
 being applied before the forEach.  This will not scale to extremely large
 XML documents including millions of rows - that is why they have the
 stream=true argument there, so that you don't preprocess the document.
 In my case, the entire XML file is 29M, and so I think I could do the XSL
 transformation and then do for each document.

 This potentially shortens my time frame of moving to Apache Solr
 substantially, because the common case with our previous indexer is to run
 XSLT to trasform to the document format desired by the indexer.

 On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

 I don't believe there are any alternatives. At least I could not get
 anything but the full path to work.

 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


 On 8 December 2014 at 17:01, Dan Davis dansm...@gmail.com wrote:
  In experimentation with a much simpler and smaller XML file, it doesn't
  look like '//health-topic/@url will not work, nor will '//@url' etc.
   So
  far, only spelling it all out will work.
  With child elements, such as title, an xpath of //title works fine,
 but
  it  is beginning to same dangerous.
 
  Is there any short-hand for the current node or the match?
 
  On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis dansm...@gmail.com wrote:
 
  When I have a forEach attribute like the following:
 
 
 
 forEach=/medical-topics/medical-topic/health-topic[@language='English']
 
  And then need to match an attribute of that, is there any alternative
 to
  spelling it all out:
 
   field column=url
 
 xpath=/medical-topics/medical-topic/health-topic[@language='English']/@url/
 
  I suppose I could do //health-topic/@url since the document should
 then
  have a single health-topic (as long as I know they don't nest).
 
 





Logging in Solr's DataImportHandler

2014-12-05 Thread Dan Davis
I have a script transformer and a log transformer, and I'm not seeing the
log messages, at least not where I expect.
Is there anyway I can simply log a custom message from within my script?
Can the script easily interact with its containers logger?


Fwd: Best Practices for open source pipeline/connectors

2014-11-10 Thread Dan Davis
The volume and influx rate in my scenario are very modest.  Our largest
collections with existing indexing software is about 20 million objects,
second up is about 5 million, and more typical collections are in the tens
of thousands.   Aside from the 20 million object corpus, we re-index and
replicate nightly.

Note that I am not responsible for any specific operation, only for
advising my organization on how to go.   My organization wants to
understand how much programming will be involved using Solr rather than
higher level tools.   I have to acknowledge that our current solution
involves less programming, even as I urge them to think of programming as
not a bad thing ;)   From my perspective, 'programming', that is,
configuration files in a git archive (with internal comments and commit
comments) is much, much more productive than using form-based configuration
software.  So, my organizations' needs and mine may be different...

-- Forwarded message --
From: Jürgen Wagner (DVT) juergen.wag...@devoteam.com
Date: Tue, Nov 4, 2014 at 4:48 PM
Subject: Re: Best Practices for open source pipeline/connectors
To: solr-user@lucene.apache.org


 Hello Dan,
  ManifoldCF is a connector framework, not a processing framework.
Therefore, you may try your own lightweight connectors (which usually are
not really rocket science and may take less time to write than time to
configure a super-generic connector of some sort), any connector out there
(including Nutch and others), or even commercial offerings from some
companies. That, however, won't make you very happy all by itself - my
guess. Key to really creating value out of data dragged into a search
platform is the processing pipeline. Depending on the scale of data and the
amount of processing you need to do, you may have a simplistic approach
with just some more or less configurable Java components massaging your
data until it can be sent to Solr (without using Tika or any other
processing in Solr), or you can employ frameworks like Apache Spark to
really heavily transform and enrich data before feeding them into Solr.

I prefer to have a clear separation between connectors, processing,
indexing/querying and front-end visualization/interaction. Only the
indexing/querying task I grant to Solr (or naked Lucene or Elasticsearch).
Each of the different task types has entirely different scaling
requirements and computing/networking properties, so you definitely don't
want them depend on each other too much. Addressing the needs of several
customers, one needs to even swap one or the other component in favour of
what a customer prefers or needs.

So, my answer is YES. But we've also tried Nutch, our own specialized
crawlers and a number of elaborate connectors for special customer
applications. In any case, the result of that connector won't go into Solr.
It will go into processing. From there it will go into Solr. I suspect that
connectors won't be the challenge in your project. Solr requires a bit of
tuning and tweaking, but you'll be fine eventually. Document processing
will be the fun part. As you come to scaling the zoo of components, this
will become evident :-)

What is the volume and influx rate in your scenario?

Best regards,
--Jürgen



On 04.11.2014 22:01, Dan Davis wrote:

I'm trying to do research for my organization on the best practices for
open source pipeline/connectors.   Since we need Web Crawls, File System
crawls, and Databases, it seems to me that Manifold CF might be the best
case.

Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
DataImportHandler?   It would be nice to decide in ManifestCF which
resultHandler should receive a document or id, barring that, you can post
some fields including an URL and have Data Import Handler handle it - it
already supports scripts whereas ManifestCF may not at this time.

Suggestions and ideas?

Thanks,

Dan




-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center Intelligence
 Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
--
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


Re: Tika Integration problem with DIH and JDBC

2014-11-04 Thread Dan Davis
All,

The problem here was that I gave driver=BinURLDataSource rather than
type=BinURLDataSource.   Of course, saying driver=BinURLDataSource
caused it not to be able to find it.


Best Practices for open source pipeline/connectors

2014-11-04 Thread Dan Davis
I'm trying to do research for my organization on the best practices for
open source pipeline/connectors.   Since we need Web Crawls, File System
crawls, and Databases, it seems to me that Manifold CF might be the best
case.

Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
DataImportHandler?   It would be nice to decide in ManifestCF which
resultHandler should receive a document or id, barring that, you can post
some fields including an URL and have Data Import Handler handle it - it
already supports scripts whereas ManifestCF may not at this time.

Suggestions and ideas?

Thanks,

Dan


Re: Best Practices for open source pipeline/connectors

2014-11-04 Thread Dan Davis
We are looking at LucidWorks, but also want to see what we can do on our
own so we can evaluate the value-add of Lucid Works among other products.

On Tue, Nov 4, 2014 at 4:13 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 And, just to get the stupid question out of the way, you prefer to pay
 in developer integration time rather than in purchase/maintenance
 fees?

 Because, otherwise, I would look at LucidWorks commercial offering
 first, even to just have a comparison.

 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


 On 4 November 2014 16:01, Dan Davis dansm...@gmail.com wrote:
  I'm trying to do research for my organization on the best practices for
  open source pipeline/connectors.   Since we need Web Crawls, File System
  crawls, and Databases, it seems to me that Manifold CF might be the best
  case.
 
  Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
  DataImportHandler?   It would be nice to decide in ManifestCF which
  resultHandler should receive a document or id, barring that, you can post
  some fields including an URL and have Data Import Handler handle it - it
  already supports scripts whereas ManifestCF may not at this time.
 
  Suggestions and ideas?
 
  Thanks,
 
  Dan



Re: javascript form data save to XML in server side

2014-10-22 Thread Dan Davis
I always, always have a web application running that accepts the JavaScript
AJAX call and then forwards it on to the Apache Solr request handler.  Even
if you don't control the web application, and can only add JavaScript, you
can put up a API oriented webapp somewhere that only protects Solr for a
couple of posts.  Then, you can use CORS or JSONP to facilitate interaction
between the main web application and the ancillary webapp providing APIs
for Solr integration.

Of course, this only applies if you don't control the primary
application.   If you can use a Drupal or Typo3 to front-end Solr, than
this is a great way to solve the problem.

On Mon, Oct 20, 2014 at 11:02 PM, LongY zhangyulin8...@hotmail.com wrote:

 thank you very much. Alex. You reply is very informative and I really
 appreciate it. I hope I would be able to help others in this forum like you
 are in the future.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/javascript-form-data-save-to-XML-in-server-side-tp4165025p4165066.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problem with DIH

2014-10-16 Thread Dan Davis
This seems a little abstract.   What I'd do is double check that the SQL is
working correctly by running the stored procedure outside of Solr and see
what you get.   You should also be able to look at the corresponding
.properties file and see the inputs used for the delta import.  If the data
import XML is called dih-example.xml, then the properties file should be
called dih-example.properties and be in the same conf directory (for the
collection).Example contents are:

#Fri Oct 10 14:53:44 EDT 2014
last_index_time=2014-10-10 14\:53\:44
healthtopic.last_index_time=2014-10-10 14\:53\:44

Again, I'm suggesting you double check that the SQL is working correctly.
If that isn't the problem, provide more details on your data import
handler, e.g. the XML with some modifications (no passwords).

On Thu, Oct 16, 2014 at 2:11 AM, Jay Potharaju jspothar...@gmail.com
wrote:

 Hi
 I 'm using DIH for updating my core. I 'm using store procedure for doing a
 full/ delta imports. In order to avoid running delta imports for a long
 time, i limit the rows returned to a max of 100,000 rows at a given time.
 On an average the delta import runs for less than 1 minute.

 For the last couple of days I have been noticing that my delta imports has
 been running for couple of hours and tries to update all the records in the
 core. I 'm not sure why that has been happening. I cant reproduce this
 event all the time, it happens randomly.

 Has anyone noticed this kind of behavior. And secondly are there any solr
 logs that will tell me what is getting updated or what exactly is happening
 at the DIH ?
 Any suggestion appreciated.

 Document size: 20 million
 Solr 4.9
 3 Nodes in the solr cloud.


 Thanks
 J



Re: import solr source to eclipse

2014-10-16 Thread Dan Davis
I had a problem with the ant eclipse answer - it was unable to resolve
javax.activation for the Javadoc.  Updating
solr/contrib/dataimporthandler-extras/ivy.xml
as follows did the trick for me:

-  dependency org=javax.activation name=activation
rev=${/javax.activation/activation} conf=compile-*/
+  dependency org=javax.activation name=activation
rev=${/javax.activation/activation} conf=compile-default/

What I'm trying to do is to construct a failing Unit test for something
that I think is a bug.   But the first thing is to be able to run tests,
probably in eclipse, but the command-line might be good enough although not
ideal.


On Tue, Oct 14, 2014 at 10:38 AM, Erick Erickson erickerick...@gmail.com
wrote:

 I do exactly what Anurag mentioned, but _only_ when what
 I want to debug is, for some reason, not accessible via unit
 tests. It's very easy to do.

 It's usually much faster though to use unit tests, which you
 should be able to run from eclipse without starting a server
 at all. In IntelliJ, you just ctrl-click on the file and the menu
 gives you a choice of running or debugging the unit test, I'm
 sure Eclipse does something similar.

 There are zillions of units to choose from, and for new development
 it's a Good Thing to write the unit test first...

 Good luck!
 Erick

 On Tue, Oct 14, 2014 at 1:37 AM, Anurag Sharma anura...@gmail.com wrote:
  Another alternative is launch the jetty server from outside and attach it
  remotely from eclipse.
 
  java -Xdebug
 -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=7666
  -jar start.jar
  The above command waits until the application attach succeed.
 
 
  On Tue, Oct 14, 2014 at 12:56 PM, Rajani Maski rajinima...@gmail.com
  wrote:
 
  Configure eclipse with Jetty plugin. Create a Solr folder under your
  Solr-Java-Project and Run the project [Run as] on Jetty Server.
 
  This blog[1] may help you to configure Solr within eclipse.
 
 
  [1]
 
 http://hokiesuns.blogspot.in/2010/01/setting-up-apache-solr-in-eclipse.html
 
  On Tue, Oct 14, 2014 at 12:06 PM, Ali Nazemian alinazem...@gmail.com
  wrote:
 
   Thank you very much for your guides but how can I run solr server
 inside
   eclipse?
   Best regards.
  
   On Mon, Oct 13, 2014 at 8:02 PM, Rajani Maski rajinima...@gmail.com
   wrote:
  
Hi,
   
The best tutorial for setting up Solr[solr 4.7] in
 eclipse/intellij  is
documented in Solr In Action book, Apendix A, *Working with the Solr
codebase*
   
   
On Mon, Oct 13, 2014 at 6:45 AM, Tomás Fernández Löbbe 
tomasflo...@gmail.com wrote:
   
 The way I do this:
 From a terminal:
 svn checkout https://svn.apache.org/repos/asf/lucene/dev/trunk/
 lucene-solr-trunk
 cd lucene-solr-trunk
 ant eclipse

 ... And then, from your Eclipse import existing java project,
 and
select
 the directory where you placed lucene-solr-trunk

 On Sun, Oct 12, 2014 at 7:09 AM, Ali Nazemian 
 alinazem...@gmail.com
  
 wrote:

  Hi,
  I am going to import solr source code to eclipse for some
  development
  purpose. Unfortunately every tutorial that I found for this
 purpose
   is
  outdated and did not work. So would you please give me some hint
   about
 how
  can I import solr source code to eclipse?
  Thank you very much.
 
  --
  A.Nazemian
 

   
  
  
  
   --
   A.Nazemian
  
 



Tika Integration problem with DIH and JDBC

2014-10-10 Thread Dan Davis
What I want to do is to pull an URL out of an Oracle database, and then use
TikaEntityProcessor and BinURLDataSource to go fetch and process that
URL.   I'm having a problem with this that seems general to JDBC with Tika
- I get an exception as follows:

Exception in entity :
extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query:
http://www.cdc.gov/healthypets/pets/wildlife.html Processing Document
# 14
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
...

Steps to reproduce any problem should be:


   - Try it with the XML and verify you get two documents and they contain
   text (schema browser with the text field)
   - Try it with a JDBC sqlite3 dataSource and verify that you get an
   exception, and advise me what may be the problem in my configuration ...

Now, I've tried this 3 ways:


   - My Oracle database - fails as above
   - An SQLite3 database to see if it is Oracle specific - fails with
   Unable to execute query, but doesn't have the URL as part of the message.
   - An XML file listing two URLs - succeeds without error.

For the SQL attempts, setting onError=skip leads the data from the
database to be indexed, but the exception is logged for each root entity.
I can tell that nothing is indexed from the text extraction by browsing the
text field from the schema browser and seeing how few terms there are.
The exceptions also sort of give it away, but it is good to be careful :)

This is using:

   - Tomcat 7.0.55
   - Solr 4.10.1
   - and JDBC drivers
  - ojdbc7.jar
  - sqlite-jdbc-3.7.2.jar

Excerpt of solrconfig.xml:

  !-- Data Import Handler for Health Topics --
  requestHandler name=/dih-healthtopics class=solr.DataImportHandler
lst name=defaults
  str name=configdih-healthtopics.xml/str
/lst
  /requestHandler

  !-- Data Import Handler that imports a single URL via Tika --
  requestHandler name=/dih-smallxml class=solr.DataImportHandler
lst name=defaults
  str name=configdih-smallxml.xml/str
/lst
  /requestHandler

!-- Data Import Handler that imports a single URL via Tika --
  requestHandler name=/dih-smallsqlite class=solr.DataImportHandler
lst name=defaults
  str name=configdih-smallsqlite.xml/str
/lst
  /requestHandler


The data import handlers and a copy-paste from Solr logging are attached.
Exception in entity : 
extract:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable 
to execute query:  Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:283)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44)
at 
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:502)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:189)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at 

Re: Tika Integration problem with DIH and JDBC

2014-10-10 Thread Dan Davis
Thanks, Alexandre.My role is to kick the tires on this.   We're trying
it a couple of different ways.   So, I'm going to assume this could be
resolved and move on to trying ManifestCF and see whether it can do similar
things for me, e.g. what it adds for free to our bag of tricks.

On Fri, Oct 10, 2014 at 3:16 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 I would concentrate on the stack traces and try reading them. They
 often provide a lot of clues. For example, you original stack trace
 had


 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:283)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240)
 2) at
 org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44)
 at
 org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
 1) at
 org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112)
 at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)

 I added 1) and 2) to show the lines of importance. You can see in 1)
 that your TikaEntityProcessor is calling 2) JdbcDataSource, which was
 not what you wanted as you specified BinDataSource. So, you focus on
 that until it gets resolved.

 Sometimes these happens when the XML file says 'datasource' instead of
 'dataSource' (DIH is case-sensitive), but it does not seem to be the
 case in your situation.

 Regards,
 Alex.
 P.s. If you still haven't figure it out, mention the Solr version on
 the next email. Sometimes it makes difference, though DIH has been
 largely unchanged for a while.

 -- Forwarded message --
 From: Dan Davis d...@danizen.net
 Date: 10 October 2014 15:00
 Subject: Re: Tika Integration problem with DIH and JDBC
 To: Alexandre Rafalovitch arafa...@gmail.com


 The definition of dataSource name=bin type=BinURLDataSource is in
 each of the dih-*.xml files.
 But only the xml version has the definition at the top, above the document.

 Moving the dataSource definition to the top does change the behavior,
 now I get the following error for that entity:

 Exception in entity :
 extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
 JDBC URL or JNDI name has to be specified Processing Document # 30

 When I changed it to specify url=, it then reverted to form:

 Exception in entity :
 extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
 Unable to execute query: http://www.cdc.gov/flu/swineflu/ Processing
 Document # 1
 at
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)

 It does seem to be a problem resolving the dataSource in some way.   I
 did double check another part of solrconfig.xml therefore.   Since the
 XML example still works, I guess I know it has to be there.

   lib dir=${solr.solr.home:}/dist/
 regex=solr-dataimporthandler-.*\.jar /

   lib dir=${solr.solr.home:}/contrib/extraction/lib regex=.*\.jar /
   lib dir=${solr.solr.home:}/dist/ regex=solr-cell-\d.*\.jar /

   lib dir=${solr.solr.home:}/contrib/clustering/lib/ regex=.*\.jar /
   lib dir=${solr.solr.home:}/dist/ regex=solr-clustering-\d.*\.jar /

   lib dir=${solr.solr.home:}/contrib/langid/lib/ regex=.*\.jar /
   lib dir=${solr.solr.home:}/dist/ regex=solr-langid-\d.*\.jar /

   lib dir=${solr.solr.home:}/contrib/velocity/lib regex=.*\.jar /
   lib dir=${solr.solr.home:}/dist/ regex=solr-velocity-\d.*\.jar /


 On Fri, Oct 10, 2014 at 2:37 PM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:
 
  You say dataSource='bin' but I don't see you defining that datasource.
 E.g.:
 
  dataSource type=BinURLDataSource name=bin/
 
  So, there might be some weird default fallback that's just causes
  strange problems.
 
  Regards,
  Alex.
 
  Personal: http://www.outerthoughts.com/ and @arafalov
  Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
  Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
 
 
  On 10 October 2014 14:17, Dan Davis dansm...@gmail.com wrote:
  
   What I want to do is to pull an URL out of an Oracle database, and
 then use
   TikaEntityProcessor and BinURLDataSource to go fetch and process that
 URL.
   I'm having a problem with this that seems general to JDBC with Tika -
 I get
   an exception as follows:
  
   Exception in entity :
   extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
   Unable to execute query:
 http://www.cdc.gov/healthypets/pets/wildlife.html
   Processing Document # 14
 at
  
 org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
   ...
  
   Steps to reproduce any problem should be:
  
   Try it with the XML and verify you get two documents and they contain
 text

Re: Nagle's Algorithm

2013-09-29 Thread Dan Davis
I don't keep up with this list well enough to know whether anyone else
answered.  I don't know how to do it in jetty.xml, but you can certainly
tweak the code.   java.net.Socket has a method setTcpNoDelay() that
corresponds with the standard Unix system calls.

Long-time past, my suggestion of this made Apache Axis 2.0 250ms faster per
call (1).   Now I want to know whether Apache Solr sets it.

One common way to test the overhead portion of latency is to project the
latency for a zero size request based on larger requests.   What you do is
to warm requests (all in memory) for progressively fewer and fewer
rows.   You can make requests for 100, 90, 80, 70 ... 10 rows each more
than once so that all is warmed.   If you plot this, it should look like a
linear function latency(rows) = m(rows) + b since all is cached in
memory.   You have to control what else is going on on the server to get
the linear plot of course - it can be quite hard to get this to work right
on modern Linux.   But once you have it, you can simply calculate f(0) and
you have the latency for a theoretical 0 sized request.

This is a tangential answer at best - I wish I just knew a setting to give
you.

(1) Latency Performance of SOAP
Implementationshttp://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.21.8556type=ab


On Sun, Sep 29, 2013 at 9:22 PM, William Bell billnb...@gmail.com wrote:

 How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4?

 Is there an option in jetty.xml ?

 /* Create new stream socket */

 sock = *socket*( AF_INET, SOCK_STREAM, 0 );



 /* Disable the Nagle (TCP No Delay) algorithm */

 flag = 1;

 ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)flag,
 sizeof(flag) );




 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Excluding a facet's constraint to exclude a facet

2013-09-24 Thread Dan Davis
Summary - when constraining a search using filter query, how can I exclude
the constraint for a particular facet?

Detail - Suppose I have the following facet results for a query q=*
mainquery*:

lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=foo
int name=A491/int
int name=B111/int
int name=C103/int
...
/lst
...

I understand from
http://people.apache.org/~hossman/apachecon2010/facets/and Wiki
documentation that I can limit results to category A as follows:

fq={!raw f=foo}A

But I cannot seem to (Solr 3.6.1) exclude that way:

fq={!raw f=foo}-A

And the simpler test (with edismax) doesn't work either:

fq=foo:A# works
fq=foo:-A   # doesn't work

Do I need to be using facet.method=enum to get this to work?   What else
could be the problem here?


Re: Storing query results

2013-08-28 Thread Dan Davis
You could copy the existing core to a new core every once in awhile, and
then do your delta indexing into a new core once the copy is complete.  If
a Persistent URL for the search results included the name of the original
core, the results you would get from a bookmark would be stable.  However,
if you went to the site, and did a new site, you would be searching the
newest core.

This I think applies whether the site is Intranet or not.

Older cores could be aged out gracefully, and the search handler for an old
core could be replaced by a search on the new core via sharding.


On Fri, Aug 23, 2013 at 11:57 AM, jfeist jfe...@llminc.com wrote:

 I completely agree.  I would prefer to just rerun the search each time.
 However, we are going to be replacing our rdb based search with something
 like Solr, and the application currently behaves this way.  Our users
 understand that the search is essentially a snapshot (and I would guess
 many
 prefer this over changing results) and we don't want to change existing
 behavior and confuse anyone.  Also, my boss told me it unequivocally has to
 be this way :p

 Thanks for your input though, looks like I'm going to have to do something
 like you've suggested within our application.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Storing-query-results-tp4086182p4086349.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to Manage RAM Usage at Heavy Indexing

2013-08-28 Thread Dan Davis
This could be an operating systems problem rather than a Solr problem.
CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing
and I would read-up up on that.
The VM parameters can be tuned in /etc/sysctl.conf


On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.comwrote:

 Hi Erick;

 I wanted to get a quick answer that's why I asked my question as that way.

 Error is as follows:

 INFO  - 2013-08-21 22:01:30.978;
 org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
 webapp=/solr path=/update params={wt=javabinversion=2}
 {add=[com.deviantart.reachmeh
 ere:http/gallery/, com.deviantart.reachstereo:http/,
 com.deviantart.reachstereo:http/art/SE-mods-313298903,
 com.deviantart.reachtheclouds:http/, com.deviantart.reachthegoddess:http/,
 co
 m.deviantart.reachthegoddess:http/art/retouched-160219962,
 com.deviantart.reachthegoddess:http/badges/,
 com.deviantart.reachthegoddess:http/favourites/,
 com.deviantart.reachthetop:http/
 art/Blue-Jean-Baby-82204657 (1444006227844530177),
 com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790
 ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException;
 java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException]
 early EOF
 at

 com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
 at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 at

 com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
 at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
 at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
 at
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
 at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
 at

 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at

 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at

 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at

 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at

 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at

 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at

 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at

 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at

 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at

 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
 at

 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:948)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at

 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at

 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at

 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at

 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: org.eclipse.jetty.io.EofException: early EOF
 at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65)
 at java.io.InputStream.read(InputStream.java:101)
 at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)
 at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)
 at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
 at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
 at

 

Re: More on topic of Meta-search/Federated Search with Solr

2013-08-28 Thread Dan Davis
On Tue, Aug 27, 2013 at 2:03 AM, Paul Libbrecht p...@hoplahup.net wrote:

 Dan,

 if you're bound to federated search then I would say that you need to work
 on the service guarantees of each of the nodes and, maybe, create
 strategies to cope with bad nodes.

 paul


+1

I'll think on that.


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-28 Thread Dan Davis
On Tue, Aug 27, 2013 at 3:33 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 Years ago when Federated Search was a buzzword we did some development
 and
 testing with Lucene, FAST Search, Google and several other Search Engines
 according Federated Search in Library context.
 The results can be found here
 http://pub.uni-bielefeld.de/download/2516631/2516644
 Some minor parts are in German most is written in English.
 It also gives you an idea where to keep an eye on, where are the pitfalls
 and so on.
 We also had a tool called unity (written in Python) which did Federated
 Search on any Search Engine and
 Database, like Google, Gigablast, FAST, Lucene, ...
 The trick with Federated Search is to combine the results.
 We offered three options to the users search surface:
 - RoundRobin
 - Relevancy
 - PseudoRandom



Thanks much - Andrzej B. suggested I read Comparing top-k lists in
addition to his Berlin Buzzwords presentation.

I will know soon whether we are intent on this direction, right now I'm
still trying to think on how hard it will be.


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-28 Thread Dan Davis
On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha shanuu@gmail.com wrote:

 Would you like to create something like
 http://knimbus.com


I work at the National Library of Medicine.   We are moving our library
catalog to a newer platform, and we will probably include articles.   The
article's content and meta-data are available from a number of web-scale
discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's traditional
API.   Most libraries use open source solutions to avoid the cost of
purchasing an expensive enterprise search platform.   We are big; we
already have a closed-source enterprise search engine (and our own home
grown Entrez search used for PubMed).Since we can already do Federated
Search with the above, I am evaluating the effort of adding such to Apache
Solr.   Because NLM data is used in the open relevancy project, we actually
have the relevancy decisions to decide whether we have done a good job of
it.

I obviously think it would be Fun to add Federated Search to Apache Solr.

*Standard disclosure *- my opinion's do not represent the opinions of NIH
or NLM.Fun is no reason to spend tax-payer money.Enhancing Apache
Solr would reduce the risk of putting all our eggs in one basket. and
there may be some other relevant benefits.

We do use Apache Solr here for more than one other project... so keep up
the good work even if my working group decides to go with the closed-source
solution.


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-26 Thread Dan Davis
I have now come to the task of estimating man-days to add Blended Search
Results to Apache Solr.   The argument has been made that this is not
desirable (see Jonathan Rochkind's blog entries on Bento search with
blacklight).   But the estimate remains.No estimate is worth much
without a design.   So, I am come to the difficult of estimating this
without having an in-depth knowledge of the Apache core.   Here is my
design, likely imperfect, as it stands.

   - Configure a core specific to each search source (local or remote)
   - On cores that index remote content, implement a periodic delete query
   that deletes documents whose timestamp is too old
   - Implement a custom requestHandler for the remote cores that goes out
   and queries the remote source.   For each result in the top N
   (configurable), it computes an id that is stable (e.g. it is based on the
   remote resource URL, doi, or hash of data returned).   It uses that id to
   look-up the document in the lucene database.   If the data is not there, it
   updates the lucene core and sets a flag that commit is required.   Once it
   is done, it commits if needed.
   - Configure a core that uses a custom SearchComponent to call the
   requestHandler that goes and gets new documents and commits them.   Since
   the cores for remote content are different cores, they can restart their
   searcher at this point if any commit is needed.   The custom
   SearchComponent will wait for commit and reload to be completed.   Then,
   search continues uses the other cores as shards.
   - Auto-warming on this will assure that the most recently requested data
   is present.

It will, of course, be very slow a good part of the time.

Erik and others, I need to know whether this design has legs and what other
alternatives I might consider.



On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.comwrote:

 The lack of global TF/IDF has been answered in the past,
 in the sharded case, by usually you have similar enough
 stats that it doesn't matter. This pre-supposes a fairly
 evenly distributed set of documents.

 But if you're talking about federated search across different
 types of documents, then what would you rescore with?
 How would you even consider scoring docs that are somewhat/
 totally different? Think magazine articles an meta-data associated
 with pictures.

 What I've usually found is that one can use grouping to show
 the top N of a variety of results. Or show tabs with different
 types. Or have the app intelligently combine the different types
 of documents in a way that makes sense. But I don't know
 how you'd just get the right thing to happen with some kind
 of scoring magic.

 Best
 Erick


 On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote:

 I've thought about it, and I have no time to really do a meta-search
 during
 evaluation.  What I need to do is to create a single core that contains
 both of my data sets, and then describe the architecture that would be
 required to do blended results, with liberal estimates.

 From the perspective of evaluation, I need to understand whether any of
 the
 solutions to better ranking in the absence of global IDF have been
 explored?I suspect that one could retrieve a much larger than N set of
 results from a set of shards, re-score in some way that doesn't require
 IDF, e.g. storing both results in the same priority queue and *re-scoring*
 before *re-ranking*.

 The other way to do this would be to have a custom SearchHandler that
 works
 differently - it performs the query, retries all results deemed relevant
 by
 another engine, adds them to the Lucene index, and then performs the query
 again in the standard way.   This would be quite slow, but perhaps useful
 as a way to evaluate my method.

 I still welcome any suggestions on how such a SearchHandler could be
 implemented.





Re: More on topic of Meta-search/Federated Search with Solr

2013-08-26 Thread Dan Davis
First answer:

My employer is a library and do not have the license to harvest everything
indexed by a web-scale discovery service such as PRIMO or Summon.If
our design automatically relays searches entered by users, and then
periodically purges results, I think it is reasonable from a licensing
perspective.

Second answer:

What if you wanted your Apache Solr powered search to include all results
from Google scholar to any query?   Do you think you could easily or
cheaply configure a Zookeeper cluster large enough to harvest and index all
of Google Scholar?   Would that violate robot rules?Is it even possible
to do this from an API perspective?   Wouldn't google notice?

Third answer:

On Gartner's 2013 Enterprise Search Magic Quadrant, LucidWorks and the
other Enterprise Search firm based on Apache Solr were dinged on the lack
of Federated Search.  I do not have the hubris to think I can fix that, and
it is not really my role to try, but something that works without
Harvesting and local indexing is obviously desirable to Enterprise Search
users.



On Mon, Aug 26, 2013 at 4:46 PM, Paul Libbrecht p...@hoplahup.net wrote:


 Why not simply create a meta search engine that indexes everything of each
 of the nodes.?
 (I think one calls this harvesting)

 I believe that this the way to avoid all sorts of performance bottleneck.
 As far as I could analyze, the performance of a federated search is the
 performance of the least speedy node; which can turn to be quite bad if you
 do not exercise guarantees of remote sources.

 Or are the remote cores below actually things that you manage on your
 side? If yes guarantees are easy to manage..

 Paul


 Le 26 août 2013 à 22:38, Dan Davis a écrit :

  I have now come to the task of estimating man-days to add Blended Search
  Results to Apache Solr.   The argument has been made that this is not
  desirable (see Jonathan Rochkind's blog entries on Bento search with
  blacklight).   But the estimate remains.No estimate is worth much
  without a design.   So, I am come to the difficult of estimating this
  without having an in-depth knowledge of the Apache core.   Here is my
  design, likely imperfect, as it stands.
 
- Configure a core specific to each search source (local or remote)
- On cores that index remote content, implement a periodic delete query
that deletes documents whose timestamp is too old
- Implement a custom requestHandler for the remote cores that goes
 out
and queries the remote source.   For each result in the top N
(configurable), it computes an id that is stable (e.g. it is based on
 the
remote resource URL, doi, or hash of data returned).   It uses that id
 to
look-up the document in the lucene database.   If the data is not
 there, it
updates the lucene core and sets a flag that commit is required.
 Once it
is done, it commits if needed.
- Configure a core that uses a custom SearchComponent to call the
requestHandler that goes and gets new documents and commits them.
 Since
the cores for remote content are different cores, they can restart
 their
searcher at this point if any commit is needed.   The custom
SearchComponent will wait for commit and reload to be completed.
 Then,
search continues uses the other cores as shards.
- Auto-warming on this will assure that the most recently requested
 data
is present.
 
  It will, of course, be very slow a good part of the time.
 
  Erik and others, I need to know whether this design has legs and what
 other
  alternatives I might consider.
 
 
 
  On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  The lack of global TF/IDF has been answered in the past,
  in the sharded case, by usually you have similar enough
  stats that it doesn't matter. This pre-supposes a fairly
  evenly distributed set of documents.
 
  But if you're talking about federated search across different
  types of documents, then what would you rescore with?
  How would you even consider scoring docs that are somewhat/
  totally different? Think magazine articles an meta-data associated
  with pictures.
 
  What I've usually found is that one can use grouping to show
  the top N of a variety of results. Or show tabs with different
  types. Or have the app intelligently combine the different types
  of documents in a way that makes sense. But I don't know
  how you'd just get the right thing to happen with some kind
  of scoring magic.
 
  Best
  Erick
 
 
  On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote:
 
  I've thought about it, and I have no time to really do a meta-search
  during
  evaluation.  What I need to do is to create a single core that contains
  both of my data sets, and then describe the architecture that would be
  required to do blended results, with liberal estimates.
 
  From the perspective of evaluation, I need to understand whether any of
  the
  solutions to better ranking

Re: More on topic of Meta-search/Federated Search with Solr

2013-08-26 Thread Dan Davis
One more question here - is this topic more appropriate to a different list?


On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis dansm...@gmail.com wrote:

 I have now come to the task of estimating man-days to add Blended Search
 Results to Apache Solr.   The argument has been made that this is not
 desirable (see Jonathan Rochkind's blog entries on Bento search with
 blacklight).   But the estimate remains.No estimate is worth much
 without a design.   So, I am come to the difficult of estimating this
 without having an in-depth knowledge of the Apache core.   Here is my
 design, likely imperfect, as it stands.

- Configure a core specific to each search source (local or remote)
- On cores that index remote content, implement a periodic delete
query that deletes documents whose timestamp is too old
- Implement a custom requestHandler for the remote cores that goes
out and queries the remote source.   For each result in the top N
(configurable), it computes an id that is stable (e.g. it is based on the
remote resource URL, doi, or hash of data returned).   It uses that id to
look-up the document in the lucene database.   If the data is not there, it
updates the lucene core and sets a flag that commit is required.   Once it
is done, it commits if needed.
- Configure a core that uses a custom SearchComponent to call the
requestHandler that goes and gets new documents and commits them.   Since
the cores for remote content are different cores, they can restart their
searcher at this point if any commit is needed.   The custom
SearchComponent will wait for commit and reload to be completed.   Then,
search continues uses the other cores as shards.
- Auto-warming on this will assure that the most recently requested
data is present.

 It will, of course, be very slow a good part of the time.

 Erik and others, I need to know whether this design has legs and what
 other alternatives I might consider.



 On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 The lack of global TF/IDF has been answered in the past,
 in the sharded case, by usually you have similar enough
 stats that it doesn't matter. This pre-supposes a fairly
 evenly distributed set of documents.

 But if you're talking about federated search across different
 types of documents, then what would you rescore with?
 How would you even consider scoring docs that are somewhat/
 totally different? Think magazine articles an meta-data associated
 with pictures.

 What I've usually found is that one can use grouping to show
 the top N of a variety of results. Or show tabs with different
 types. Or have the app intelligently combine the different types
 of documents in a way that makes sense. But I don't know
 how you'd just get the right thing to happen with some kind
 of scoring magic.

 Best
 Erick


 On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote:

 I've thought about it, and I have no time to really do a meta-search
 during
 evaluation.  What I need to do is to create a single core that contains
 both of my data sets, and then describe the architecture that would be
 required to do blended results, with liberal estimates.

 From the perspective of evaluation, I need to understand whether any of
 the
 solutions to better ranking in the absence of global IDF have been
 explored?I suspect that one could retrieve a much larger than N set
 of
 results from a set of shards, re-score in some way that doesn't require
 IDF, e.g. storing both results in the same priority queue and
 *re-scoring*
 before *re-ranking*.

 The other way to do this would be to have a custom SearchHandler that
 works
 differently - it performs the query, retries all results deemed relevant
 by
 another engine, adds them to the Lucene index, and then performs the
 query
 again in the standard way.   This would be quite slow, but perhaps useful
 as a way to evaluate my method.

 I still welcome any suggestions on how such a SearchHandler could be
 implemented.






Re: Flushing cache without restarting everything?

2013-08-22 Thread Dan Davis
be careful with drop_caches - make sure you sync first


On Thu, Aug 22, 2013 at 1:28 PM, Jean-Sebastien Vachon 
jean-sebastien.vac...@wantedanalytics.com wrote:

 I was afraid someone would tell me that... thanks for your input

  -Original Message-
  From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
  Sent: August-22-13 9:56 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Flushing cache without restarting everything?
 
  On Tue, 2013-08-20 at 20:04 +0200, Jean-Sebastien Vachon wrote:
   Is there a way to flush the cache of all nodes in a Solr Cloud (by
   reloading all the cores, through the collection API, ...) without
   having to restart all nodes?
 
  As MMapDirectory shares data with the OS disk cache, flushing of
  Solr-related caches on a machine should involve
 
  1) Shut down all Solr instances on the machine
  2) Clear the OS read cache ('sudo echo 1  /proc/sys/vm/drop_caches' on
  a Linux box)
  3) Start the Solr instances
 
  I do not know of any Solr-supported way to do step 2. For our
  performance tests we use custom scripts to perform the steps.
 
  - Toke Eskildsen, State and University Library, Denmark
 
 
  -
  Aucun virus trouvé dans ce message.
  Analyse effectuée par AVG - www.avg.fr
  Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date:
 09/08/2013
  La Base de données des virus a expiré.



Removing duplicates during a query

2013-08-22 Thread Dan Davis
Suppose I have two documents with different id, and there is another field,
for instance content-hash which is something like a 16-byte hash of the
content.

Can Solr be configured to return just one copy, and drop the other if both
are relevant?

If Solr does drop one result, do you get any indication in the document
that was kept that there was another copy?


Re: How to avoid underscore sign indexing problem?

2013-08-22 Thread Dan Davis
Ah, but what is the definition of punctuation in Solr?


On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky j...@basetechnology.comwrote:

 I thought that the StandardTokenizer always split on punctuation, 

 Proving that you haven't read my book! The section on the standard
 tokenizer details the rules that the tokenizer uses (in addition to
 extensive examples.) That's what I mean by deep dive.

 -- Jack Krupansky

 -Original Message- From: Shawn Heisey
 Sent: Wednesday, August 21, 2013 10:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: How to avoid underscore sign indexing problem?


 On 8/21/2013 7:54 PM, Floyd Wu wrote:

 When using StandardAnalyzer to tokenize string Pacific_Rim will get

 ST
 textraw_**bytesstartendtypeposition
 pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011ALPHANUM1

 How to make this string to be tokenized to these two tokens Pacific,
 Rim?
 Set _ as stopword?
 Please kindly help on this.
 Many thanks.


 Interesting.  I thought that the StandardTokenizer always split on
 punctuation, but apparently that's not the case for the underscore
 character.

 You can always use the WordDelimeterFilter after the StandardTokenizer.

 http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
 WordDelimiterFilterFactoryhttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

 Thanks,
 Shawn



Re: More on topic of Meta-search/Federated Search with Solr

2013-08-22 Thread Dan Davis
You are right, but here's my null hypothesis for studying the impact on
relevance.Hash the query to deterministically seed random number
generator.Pick one from column A or column B randomly.

This is of course wrong - a query might find two non-relevant results in
corpus A and lots of relevant results in corpus B, leading to poor
precision because the two non-relevant documents are likely to show up on
the first page.   You can weight on the size of the corpus, but weighting
is probably wrong then on any specifc query.

It was an interesting thought experiment though.

Erik,

Since LucidWorks was dinged in the 2013 Magic Quadrant on Enterprise Search
due to a lack of Federated Search, the for-profit Enterprise Search
companies must be doing it some way.Maybe relevance suffers (a lot),
but you can do it if you want to.

I have read very little of the IR literature - enough to sound like I know
a little, but it is a very little.  If there is literature on this, it
would be an interesting read.


On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson erickerick...@gmail.comwrote:

 The lack of global TF/IDF has been answered in the past,
 in the sharded case, by usually you have similar enough
 stats that it doesn't matter. This pre-supposes a fairly
 evenly distributed set of documents.

 But if you're talking about federated search across different
 types of documents, then what would you rescore with?
 How would you even consider scoring docs that are somewhat/
 totally different? Think magazine articles an meta-data associated
 with pictures.

 What I've usually found is that one can use grouping to show
 the top N of a variety of results. Or show tabs with different
 types. Or have the app intelligently combine the different types
 of documents in a way that makes sense. But I don't know
 how you'd just get the right thing to happen with some kind
 of scoring magic.

 Best
 Erick


 On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote:

 I've thought about it, and I have no time to really do a meta-search
 during
 evaluation.  What I need to do is to create a single core that contains
 both of my data sets, and then describe the architecture that would be
 required to do blended results, with liberal estimates.

 From the perspective of evaluation, I need to understand whether any of
 the
 solutions to better ranking in the absence of global IDF have been
 explored?I suspect that one could retrieve a much larger than N set of
 results from a set of shards, re-score in some way that doesn't require
 IDF, e.g. storing both results in the same priority queue and *re-scoring*
 before *re-ranking*.

 The other way to do this would be to have a custom SearchHandler that
 works
 differently - it performs the query, retries all results deemed relevant
 by
 another engine, adds them to the Lucene index, and then performs the query
 again in the standard way.   This would be quite slow, but perhaps useful
 as a way to evaluate my method.

 I still welcome any suggestions on how such a SearchHandler could be
 implemented.





Re: Removing duplicates during a query

2013-08-22 Thread Dan Davis
OK - I see that this can be done with Field Collapsing/Grouping.  I also
see the mentions in the Wiki for avoiding duplicates using a 16-byte hash.

So, question withdrawn...


On Thu, Aug 22, 2013 at 10:21 PM, Dan Davis dansm...@gmail.com wrote:

 Suppose I have two documents with different id, and there is another
 field, for instance content-hash which is something like a 16-byte hash
 of the content.

 Can Solr be configured to return just one copy, and drop the other if both
 are relevant?

 If Solr does drop one result, do you get any indication in the document
 that was kept that there was another copy?




Re: Prevent Some Keywords at Analyzer Step

2013-08-19 Thread Dan Davis
This is an interesting topic - my employer is a medical library and there
are many keywords that may need to be aliased in various ways, and 2 or 3
word phrases that perhaps should be treated specially.   Jack, can you give
me an example of how to do that sort of thing?Perhaps I need to buy
your almost released Deep Dive book...
Sorry to be too tangential - it is my strange way.


On Mon, Aug 19, 2013 at 12:32 PM, Jack Krupansky j...@basetechnology.comwrote:

 Okay, but what is it that you are trying to prevent??

 And, diet follower is a phrase, not a keyword or term.

 So, I'm still baffled as to what you are really trying to do. Trying
 explaining it in plain English.

 And given this same input, how would it be queried?


 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Monday, August 19, 2013 11:22 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Prevent Some Keywords at Analyzer Step


 Let's assume that my sentence is that:

 *Alice is a diet follower*

 My special keyword = *diet follower*

 Tokens will be:

 Token 1) Alice
 Token 2) is
 Token 3) a
 Token 4) diet
 Token 5) follower
 Token 6) *diet follower*


 2013/8/19 Jack Krupansky j...@basetechnology.com

  Your example doesn't prevent any keywords.

 You need to elaborate the specific requirements with more detail.

 Given a long stream of text, what tokenization do you expect in the index?

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI Sent: Monday, August 19,
 2013 8:07 AM To: solr-user@lucene.apache.org Subject: Prevent Some
 Keywords at Analyzer Step
 Hi;

 I want to write an analyzer that will prevent some special words. For
 example sentence to be indexed is:

 diet follower

 it will tokenize it as like that

 token 1) diet
 token 2) follower
 token 3) diet follower

 How can I do that with Solr?





More on topic of Meta-search/Federated Search with Solr

2013-08-16 Thread Dan Davis
I've thought about it, and I have no time to really do a meta-search during
evaluation.  What I need to do is to create a single core that contains
both of my data sets, and then describe the architecture that would be
required to do blended results, with liberal estimates.

From the perspective of evaluation, I need to understand whether any of the
solutions to better ranking in the absence of global IDF have been
explored?I suspect that one could retrieve a much larger than N set of
results from a set of shards, re-score in some way that doesn't require
IDF, e.g. storing both results in the same priority queue and *re-scoring*
before *re-ranking*.

The other way to do this would be to have a custom SearchHandler that works
differently - it performs the query, retries all results deemed relevant by
another engine, adds them to the Lucene index, and then performs the query
again in the standard way.   This would be quite slow, but perhaps useful
as a way to evaluate my method.

I still welcome any suggestions on how such a SearchHandler could be
implemented.


Meta-search by subclassing SearchHandler

2013-08-15 Thread Dan Davis
I am considering enabling a true Federated Search, or meta-search, using
the following basic configuration (this configuration is only for
development and evaluation):

Three Solr cores:

   - One to search data I have indexed locally
   - One with a custom SearchHandler that is a facade, e.g. it performs a
   meta-search (aka Federated Search)
   - One that queries and merges the above cores as shards

Lest I seem completely like Sauron, I read
http://2011.berlinbuzzwords.de/sites/2011.berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf
and am familiar with evaluating precision at 10, etc. although I am no
doubt less familiar with IR than many.

I think that it is much, much better for performance and relevancy to index
it all on a level playing field.  But my employer cannot do that, because
we do not have a license to all the data we may wish to search in the
future.

My questions are simple - has anybody implemented such a SearchHandler that
is a facade for another search engine?   How would I get started with that?

I have made a similar post on the blacklight developers google group.