Re: Increase Physical Memory in Solr

2020-01-13 Thread Terry Steichen
Maybe solr isn't using enough of your available memory (a rough check is 
produced by 'solr status'). Do you realize you can start solr with a  
'-m xx' parameter? (for me, xx = 1g)


Terry

On 1/13/20 3:12 PM, rhys J wrote:

On Mon, Jan 13, 2020 at 3:11 PM Gael Jourdan-Weil <
gael.jourdan-w...@kelkoogroup.com> wrote:


Hello,

If you are talking about "physical memory" as the bar displayed in Solr
UI, that is the actual RAM your host have.
If you need more, you need more RAM, it's not related to Solr.



Thanks, that helped me understand what is going on.

I am going to ask to increase the RAM of the machine.

Rhys



Re: Newbie permissions problem running solr

2019-05-30 Thread Terry Steichen
For what it's worth - after not using it for some time, I just started
up my solr system (6.6.0) and made a mistake in the command line.  I
mistakenly used 'bin/solr start -c -m 1gb' and got precisely the same
error message as Bernard did (other than the '.." part). 

When I changed it to the correct command ('bin/solr start -c -m 1g')
everything worked just fine.  Not sure how this fits with Bernard's
question, but it's interesting that I got the same error message (so
maybe that could in some way be related?)

Terry

On 5/30/19 3:19 PM, Joe Doupnik wrote:
>     One day I will learn to type. In the meanwhile the command, as
> root, is  chown -R solr:users solr. That means creating that username
> if it is not present.
>     Thanks,
>     Joe D.
>
> On 30/05/2019 20:12, Joe Doupnik wrote:
>> On 30/05/2019 20:04, Bernard T. Higonnet wrote:
>>> Hello,
>>>
>>> I have installed solr from ports under FreeBSD 12.0 and I am trying
>>> to run solr as described in the Solr Quick Start tutorial.
>>>
>>> I keep getting permission errors:
>>>
>>> /usr/local/solr/example/cloud/node2/solr/../logs  could not be
>>> created. Exiting
>>>
>>> Apart from the fact that I find it bizarre that it doesn't put its
>>> logs in some 'standard' writable place, the ".." perturbs me. Does
>>> it mean there's stuff there which I don't know what it is (but it
>>> doesn't want to tell me?). He knows how to write long messages so
>>> what's the problem?
>>>
>>> I have tried making various places writable, but clearly I don't
>>> know what the ".." means...
>>>
>>> Any help appreciated.
>>>
>>> TIA
>>> Bernard Higonnet
>> ---
>>     In my own work, now and then I encounter exactly that problem. I
>> then recall that the Solr material expects to be owned by user solr,
>> and group users on Linux. Thus a  chmod -R solr:users solr command
>> would take care of the problem.
>>     Thanks,
>>     Joe D.
>>
>
>


Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-14 Thread Terry Steichen
Using 6.6.0, I am able to index EML files just fine.  The trick is, when
indexing files containing .eml, add "-filetypes eml" to the commandline
(note the plural filetypes).

Terry Steichen

On 1/13/19 10:18 PM, Zheng Lin Edwin Yeo wrote:
> Hi,
>
> I am using Solr 7.5.0 with Tika 1.18.
>
> Currently I am facing a situation during the indexing of EML files, whereby
> the content is being extracted from the Content-type=text/html instead of
> Content-type=text/plain.
>
> The problem with Content-type=text/html is that it contains alot of words
> like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> these get indexed in Solr as well, which makes the content very cluttered,
> and it also affect the search, as when we search for words like "font", all
> the contents gets returned because of this.
>
> Would like to enquire on the following:
> 1. Why Tika didn't get the text part (text/plain). Is there any way to
> configure the Tika in Solr to change the priority to get the text part
> (text/plain) instead of html part (text/html).
> 2. If that is not possible, as you can see, the content is not clean, which
> is not right. How can we get this to be clean when Tika is extracting text?
>
> Regards,
> Edwin
>


Re: How to access the Solr Admin GUI

2019-01-01 Thread Terry Steichen
I think a better approach to tunneling would be:

ssh -p  -L :localhost:8983 use...@myremoteserver.example.com

This requires you to set up a different port () rather than use the
standard 22 port (on your router and on your sshd config).  I've been
running something like this for about a year and have rarely if ever had
it attacked.  Prior to changing the port (to ), however, I was under
constant hacking attacks - they find port 22 too attractive to ignore.

Also, regarding my use of port : if you have the server running on
several local machines (as I do), the use of the  port may help
prevent confusion (as to whether your browser is accessing a local -
defaulted to 8983 - or a remote solr server).

Note: you might find that the ssh connection will drop out after some
inactivity, and need to be restarted occasionally.  Pretty simple to do
- just run the ssh line above again.

Note: I also add authorization controls to the AdminUI (and its functions)


On 1/1/19 1:02 PM, Kay Wrobel wrote:
> You can use ssh to tunnel in.
>
> ssh -L8983:localhost:8983 use...@myremoteserver.example.com
>
> This will only require port 22 to be exposed to the public.
>
>
> Sent from my iPhone
>
>> On Jan 1, 2019, at 11:43 AM, Gus Heck  wrote:
>>
>> Why would you want to expose the administration gui on the web? This is a
>> very hazardous thing to do. Never mind that it normally also runs on 8983
>> and all it's functionality relies on the ability to interact with 8983
>> hosted api end points.
>>
>> What are you actually trying to solve?
>>
>> On Dec 31, 2018 6:04 PM, "Jörn Franke"  wrote:
>>
>> Reverse proxy?
>>
>>
>>> Am 31.12.2018 um 22:48 schrieb s...@cid.is:
>>>
>>> Hi all,
>>>
>>> is there a way, better a solution, to access the Solr Admin GUI from
>> outside the server (via public web) while the Solr port 8983 is closed by a
>> firewall and only available inside the server via localhost?
>>> Thanks in advance
>>> Walter Claassen
>>>
>>> Alexandraweg 32
>>> D 64287 Darmstadt
>>> Fon +49-6151-4937961
>>> Fax +49-6151-4937969
>>> c...@cid.is
>>>


Resolved Authorization Issue

2018-12-31 Thread Terry Steichen
Thanks, Dominique.  This appears to explain a LOT of past confusion.

Terry

On 12/31/18 5:26 AM, Dominique Bejean wrote:
> So in Solr standalone mode, only authentication is fully functional, not
> authorization !


Re: Basic Auth Permission

2018-12-08 Thread Terry Steichen
What Noble Paul says is true: Solr can't - directly - restrict access to
static files.

However, if you set your file repository's permissions to a minimal
level (so, for example, users can't do a directory search), then they
must know the precise name and location of the file they're trying to
retrieve.  And, depending on your system implementation, that
information may be only available via a Solr search result (the access
to which can be restricted).

Terry Steichen

On 12/8/18 12:06 AM, Noble Paul wrote:
> You can't restrict access to static files.
>
> You can only restrict access to Solr content.
>
> However you can use the "blockUnknown" property in your security.json
> to restrict access to all files
>
> https://lucene.apache.org/solr/guide/7_5/basic-authentication-plugin.html
> --Noble
> On Sat, Jun 9, 2018 at 2:43 AM Antony A  wrote:
>> Hello,
>>
>> I am trying to get the path/params restricted to users of individual
>> collection through Solr UI.
>>
>> Here is the permission that I have for an user.
>>
>> {"collection": "collection_name", "path": "/admin/file", "role": ["
>> collection_user"]}
>>
>> I am still not able to restrict another user from accessing other
>> collection files like solrconfig, solr-data-config etc.
>>
>> If it possible to define permission at collection-level to this path?
>>
>> Thanks,
>> Antony
>
>


Re: Basic Auth Permission

2018-12-04 Thread Terry Steichen
I think there's been some confusion on which standalone versions support
authentication.  I'm using 6.6 in cloud mode (purely so the
authentication will work).  Some of the documentation seems to say that
only cloud implementations support it, but others (like the experts on
this forum) say that later versions (including yours) support it in
standalone mode.

On 12/4/18 4:14 PM, yydpkm wrote:
> I am using standalone Solr 7.4.0. Are you using cloud or standalone? Not sure
> if that cause the problem or not.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Basic Auth Permission

2018-12-04 Thread Terry Steichen
What Solr version are you using?

On 12/4/18 2:47 PM, yydpkm wrote:
> Thank you for your replay. I use your format and failed. User2 can still
> visit collection "name"
> Could that because I am using standalone Solr not Solrcloud? 
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Basic Auth Permission

2018-12-04 Thread Terry Steichen
In setting his permission, Antony said he set "path": "/admin/file".  I
use "path":"/*" - that may be too restrictive for you, but it works fine
(for me).

On 12/4/18 9:55 AM, yydpkm wrote:
> Hi Antony, 
>
> Have you solved this? I am facing the same thing. Other users can still do
> /select after I set the permission path and collection. 
>
> Best,
> Rick
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


RE: Solr OCR Support

2018-11-04 Thread Terry Steichen
+1
My experience is that you can't easily tell ahead of time whether your PDF is 
searchable or not. If it is, you may not even retrieve it because there's no 
text to index.  Also, if you blindly OCR a file that has already been OCR'd, it 
can create a mess.  Most higher end PDF editors have a batch mode to do OCR 
processing, if that works better for you.

On November 4, 2018 5:20:41 PM EST, Phil Scadden  wrote:
>I would strongly consider OCR offline, BEFORE loading the documents
>into Solr. The  advantage of this is that you convert your OCRed PDF
>into searchable PDF. Consider someone using Solr and they have found a
>document that matches their search criteria. Once they retrieve the
>document, they will discover it is has not been OCRed and they cannot
>use a text search within a document. If the document that you are
>feeding Solr is large, then this is major pain. Setting up Tesseract
>(or whatever engine - tesseract involves a bit of a tool chain) to OCR
>and save as searchable PDF, means you can provide a much more useful
>document as the result of Solr search. Feed that searchable PDF to
>SolrJ with OCR turned off.
>
>   PDFParserConfig pdfConfig = new PDFParserConfig();
>   pdfConfig.setExtractInlineImages(false);
> pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
>   context.set(PDFParserConfig.class,pdfConfig);
>   context.set(Parser.class,parser);
>
>-Original Message-
>From: Furkan KAMACI 
>Sent: Saturday, 3 November 2018 03:30
>To: solr-user@lucene.apache.org
>Subject: Solr OCR Support
>
>Hi All,
>
>I want to index images and pdf documents which have images into Solr. I
>test it with my Solr 6.3.0.
>
>I've installed tesseract at my computer (Mac). I verify that Tesseract
>works fine to extract text from an image.
>
>I index image into Solr but it has no content. However, as far as I
>know, I don't need to do anything else to integrate Tesseract with
>Solr.
>
>I've checked these but they were not useful for me:
>
>http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
>http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html
>
>My question is, how can I support OCR with Solr?
>
>Kind Regards,
>Furkan KAMACI
>Notice: This email and any attachments are confidential and may not be
>used, published or redistributed without the prior written consent of
>the Institute of Geological and Nuclear Sciences Limited (GNS Science).
>If received in error please destroy and immediately notify GNS Science.
>Do not copy or disclose the contents.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: ManagedIndexSchema Bad version when trying to persist schema

2018-10-11 Thread Terry Steichen
Erick,

I don't get any such message when I start solr - could you share what
that curl command should be?

You suggest modifying solrconfig.xml - could you be more explicit on
what changes to make?

Terry


On 10/11/2018 11:52 AM, Erick Erickson wrote:
> bq: Also why solr updates and persists the managed-schema while ingesting 
> data?
>
> I'd guess you are using "schemaless mode", which is expressly
> recommended _against_ for production systems. See "Schemaless Mode" in
> the reference guide.
>
> I'd disable schemaless mode (when you start Solr there should be a
> message telling you how to disable it via curl, but I'd modify my
> solrconfig.xml file to remove it permanently)
>
> Best,
> Erick
> On Thu, Oct 11, 2018 at 8:02 AM Mikhail Ibraheem
>  wrote:
>> Hi,We upgraded to Solr 7.5, we try to ingest to solr using solrJ in 
>> concurrent updates (Many threads).We are getting this 
>> exception:o.a.s.s.ManagedIndexSchema Bad version when trying to persist 
>> schema using 1 due to: 
>> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
>> BadVersion for /configs/my-core/managed-schemao.a.s.s.ManagedIndexSchema 
>> Failed to persist managed schema at /configs/my-core/managed-schema - 
>> version mismatch
>>
>> Also why solr updates and persists the managed-schema while ingesting data? 
>> I see managed-schema shouldn't be affected by data updates.
>> Thanks



Re: Solr JVM Memory settings

2018-10-11 Thread Terry Steichen
Don't know if this directly affects what you're trying to do.  But I
have an 8GB server and when I run "solr status" I can see what % of the
automatic memory allocation is being used.  As it turned out, solr would
occasionally exceed that (and crashed). 

I then began starting solr with the additional parameter: "-m 1g"  Now
the solr consumption is almost always 50% or less, and have had no
further problems.


On 10/11/2018 12:08 AM, Sourav Moitra wrote:
> Hello,
>
> We have a Solr server with 8gb of memory. We are using solr in cloud
> mode, solr version is 7.5, Java version is Oracle Java 9 and settings
> for Xmx and Xms value is 2g but we are observing that the RAM getting
> used to 98% when doing indexing.
>
> How can I ensure that SolrCloud doesn't use more than N GB of memory ?
>
> Sourav Moitra
> https://souravmoitra.com
>



Re: Nutch+Solr

2018-10-03 Thread Terry Steichen
Bineesh,

I don't use Nutch, so don't know if this is relevant, but I've had
similar-sounding failures in doing and restoring backups.  The solution
for me was to deactivate authentication while the backup was being done,
and then activate it again afterwards.  Then everything was restored
correctly.  Otherwise, I got a whole bunch of efforts (if I left
authentication active when doing the backup). 

Terry


On 10/03/2018 10:21 AM, Bineesh wrote:
> Hello,
>
> We use Solr 7.3.1 and Nutch 1.15
>
> We've placed the authentication for our solr cloud setup using the basic
> auth plugin ( login details -> solr/SolrRocks)
>
> For the nutch to index data to solr, below properties added to nutch-sitexml
> file
>
>  
>   solr.auth
>   true
>   
>   Whether to enable HTTP basic authentication for communicating with Solr.
>   Use the solr.auth.username and solr.auth.password properties to configure
>   your credentials.
>   
> 
>
>
> 
>   solr.auth.username
>   solr
>   
>   Username
>   
> 
>
>
> 
>   solr.auth.password
>   SolrRocks
>   
>   Password
>   
> 
>
> While Nutch index data to solr, its failing due to authentication. Am i
> doing something wrong ? Pls help
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>



Re: Making Solr Indexing Errors Visible

2018-09-26 Thread Terry Steichen
Alex,

Please look at my embedded responses to your questions.

Terry


On 09/26/2018 04:57 PM, Alexandre Rafalovitch wrote:
> The challenge here is to figure out exactly what you are doing,
> because the original description could have been 10 different things.
>
> So:
> 1) You are using bin/post command (we just found this out)
No, I said that at the outset.  And repeated it.
> 2) You are indexing a bunch of files (what format? all same or different?)
I also said I was indexing a mixture of pdf and doc files
> 3) You are indexing them into a Schema supposedly ready for those
> files (which one?)
I'm using the managed-schema, the data-driven approach
> 4) You think some of them are not in in Solr (how do you know that?
> how do you know that some are? why do you not know _which_ of the
> files are not indexed?)
I thought I made it very clear (twice) that I find that the list of
indexed files is 10% fewer than those in the directory holding the files
being indexed.  And I said that I don't know which are not getting
indexed because I am not getting error messages.
> 5) You are asking whether the error message should have told you if
> there is a problem with indexing (normally yes, but maybe there are
> some edge cases).
That's my question - why am I not getting error messages.  That's the
whole point of my query to the list.
>
> I've put the questions in brackets. I would focus on looking at
> questions in 4) first as they roughly bisect the problem. But other
> things are important too.
>
> I hope this helps,
> Alex.
>
>
> On 26 September 2018 at 16:39, Terry Steichen  wrote:
>> Shawn,
>>
>> To the best of my knowledge, I'm not using SolrJ at all.  Just
>> Solr-out-of-the-box.  In this case, if I understand you below, it
>> "should indicate an error status"
>>
>> But it doesn't.
>>
>> Let me try to clarify a bit - I'm just using bin/post to index the files
>> in a directory.  That indexing process produces a lengthy screen display
>> of files that were indexed.  (I realize this isn't production-quality,
>> but I'm not ready for production just yet, so that should be OK.)
>>
>> But no errors are shown (even though there have to be because the totals
>> indexed is less than the directory totals).
>>
>> Are you saying I can't use post (to verify correct indexing), but that I
>> have to write custom software to accomplish that?
>>
>> And that there's no solr variable I can define that will do a kind of
>> "verbose" to show that?
>>
>> And that such errors will not show up in any of solr's log files?
>>
>> Hard to believe (but what is, is, I guess).
>>
>> Terry
>>
>> On 09/26/2018 03:49 PM, Shawn Heisey wrote:
>>> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>>>> I'm pretty sure this was covered earlier.  But I can't find references
>>>> to it.  The question is how to make indexing errors clear and obvious.
>>> If there's an indexing error and you're NOT using the concurrent
>>> client in SolrJ, the response that Solr returns should indicate an
>>> error status.  ConcurrentUpdateSolrClient gets those errors and
>>> swallows them so the calling program never knows they occurred.
>>>
>>>> (I find that there are maybe 10% more files in a directory than end up
>>>> in the index.  I presume they were indexing errors, but I have no idea
>>>> which ones or what might have caused the error.)  As I recall, Solr's
>>>> post tool doesn't give any errors when indexing.  I (vaguely) recall
>>>> that there's a way (through the logs?) to overcome this and show the
>>>> errors.  Or maybe it's that you have to do the indexing outside of Solr?
>>> The simple post tool is not really meant for production use.  It is a
>>> simple tool for interactive testing.
>>>
>>> I don't see anything in SimplePostTool for changing the program's exit
>>> status when an error is encountered during program operation.  If an
>>> error is encountered during the upload, a message would be logged to
>>> stderr, but you wouldn't be able to rely on the program's exit status
>>> to indicate an error.  To get that, you will need to write the
>>> indexing software.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>



Re: Making Solr Indexing Errors Visible

2018-09-26 Thread Terry Steichen
Shawn,

To the best of my knowledge, I'm not using SolrJ at all.  Just
Solr-out-of-the-box.  In this case, if I understand you below, it
"should indicate an error status" 

But it doesn't.

Let me try to clarify a bit - I'm just using bin/post to index the files
in a directory.  That indexing process produces a lengthy screen display
of files that were indexed.  (I realize this isn't production-quality,
but I'm not ready for production just yet, so that should be OK.)

But no errors are shown (even though there have to be because the totals
indexed is less than the directory totals).

Are you saying I can't use post (to verify correct indexing), but that I
have to write custom software to accomplish that? 

And that there's no solr variable I can define that will do a kind of
"verbose" to show that?

And that such errors will not show up in any of solr's log files?

Hard to believe (but what is, is, I guess).

Terry

On 09/26/2018 03:49 PM, Shawn Heisey wrote:
> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>> I'm pretty sure this was covered earlier.  But I can't find references
>> to it.  The question is how to make indexing errors clear and obvious.
>
> If there's an indexing error and you're NOT using the concurrent
> client in SolrJ, the response that Solr returns should indicate an
> error status.  ConcurrentUpdateSolrClient gets those errors and
> swallows them so the calling program never knows they occurred.
>
>> (I find that there are maybe 10% more files in a directory than end up
>> in the index.  I presume they were indexing errors, but I have no idea
>> which ones or what might have caused the error.)  As I recall, Solr's
>> post tool doesn't give any errors when indexing.  I (vaguely) recall
>> that there's a way (through the logs?) to overcome this and show the
>> errors.  Or maybe it's that you have to do the indexing outside of Solr?
>
> The simple post tool is not really meant for production use.  It is a
> simple tool for interactive testing.
>
> I don't see anything in SimplePostTool for changing the program's exit
> status when an error is encountered during program operation.  If an
> error is encountered during the upload, a message would be logged to
> stderr, but you wouldn't be able to rely on the program's exit status
> to indicate an error.  To get that, you will need to write the
> indexing software.
>
> Thanks,
> Shawn
>
>



Making Solr Indexing Errors Visible

2018-09-26 Thread Terry Steichen
I'm pretty sure this was covered earlier.  But I can't find references
to it.  The question is how to make indexing errors clear and obvious. 
(I find that there are maybe 10% more files in a directory than end up
in the index.  I presume they were indexing errors, but I have no idea
which ones or what might have caused the error.)  As I recall, Solr's
post tool doesn't give any errors when indexing.  I (vaguely) recall
that there's a way (through the logs?) to overcome this and show the
errors.  Or maybe it's that you have to do the indexing outside of Solr?

Terry Steichen


Re: copy field

2018-07-12 Thread Terry Steichen
Gus,

Perhaps you might try the technique described in the forwarded exchange
below.  It has been working very nicely for me.

Terry


 Forwarded Message 
Subject:Re: Changing Field Assignments
Date:   Tue, 12 Jun 2018 12:21:16 +0900
From:   Yasufumi Mizoguchi 
Reply-To:   solr-user@lucene.apache.org
To: solr-user@lucene.apache.org



Hi,

You can do that via adding the following lines in managed-schema.

  
  
  

After adding the above and re-indexing docs, you will get the result like
following.

{ "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"*:*", "indent":
"on", "wt":"json", "_":"1528772599296"}}, "response":{"numFound":2,"start":0
,"docs":[ { "id":"test2", "meta_creation_date":["2018-04-30T00:00:00Z"], "
meta_creation_date_range":"2018-04-30T00:00:00Z", "_version_":
1603034044781559808}, { "id":"test", "meta_creation_date":[
"1944-04-02T00:00:00Z"], "meta_creation_date_range":"1944-04-02T00:00:00Z",
"_version_":1603034283921899520}] }}

thanks,
Yasufumi


2018年6月12日(火) 5:04 Terry Steichen :

> I am using Solr (6.6.0) in the automatic mode (where it discovers
> fields).  It's working fine with one exception.  The problem is that
> Solr maps the discovered "meta_creation_date" is assigned the type
> TrieDateField.
>
> Unfortunately, that type is limited in a number of ways (like sorting,
> abbreviated forms and etc.).  What I'd like to do is have that
> ("meta_creation_date") field assigned to a different type, like
> DateRangeField.
>
> Is it possible to accomplish this (during indexing) by creating a copy
> field to a different type, and using the copy field in the query?  Or
> via some kind of function operation (which I've never understood)?
>
>


On 07/12/2018 02:43 PM, Gus Heck wrote:
> XY question not withstanding, this is exactly the sort of thing one might
> want to do in their indexing pipeline. For example:
>
> https://github.com/nsoft/jesterj/blob/master/code/ingest/src/main/java/org/jesterj/ingest/processors/SimpleDateTimeReformatter.java
>
> On Thu, Jul 12, 2018 at 1:34 PM, Erick Erickson 
> wrote:
>
>> This seems like an XY problem, you've asked how to do X without
>> explaining _why_ (the Y).
>>
>> If this is just because you want to search the field without having
>> to specify the full string, consider a DateRangeField.
>>
>> Best,
>> Erick
>>
>> On Thu, Jul 12, 2018 at 10:22 AM, Anil  wrote:
>>> HI,
>>>
>>> i have a date field which needs to copied to different field with
>> different
>>> format/value. is there any way  to achieve this using copy field ? or
>> needs
>>> to be done when creating solr document itself.
>>>
>>> lets say createdDate is 10-23-2017 10:15:00, it needs to be copied to
>>> transformedDate field as  10-23-2017.
>>>
>>> please help. thanks.
>>>
>>> Regards,
>>> Anil
>
>



Re: Regarding pdf indexing issue

2018-07-11 Thread Terry Steichen
Walter,

Well said.  (And I love the hamburger conversion analogy - very apt.)

The only thing I will add is that when you have a collection of similar
rich text documents, you might be able to construct queries to respect
internal structures within the documents.  If all/most of your documents
have a unique line like "subject:", you might be able to be selective.

Also, if your documents are organized on disk in some categorical way,
you can include in your query, a reference to that categorical
information (via the id:*pattern* field).

Finally, there *might* be useful information in the metadata that you
can use in refining your searches.

Terry


On 07/11/2018 11:42 AM, Walter Underwood wrote:
> PDF is not a structured document format. It is a printer control format.
>
> PDF does not have a paragraph marker. Instead, it says to move
> to this spot on the page, choose this font, and print this letter. For a
> paragraph, it moves farther. For the next letter in a word, it moves a 
> little bit. Extracting paragraphs from that is a difficult pattern recognition
> problem.
>
> I worked with a PDF of a two-column magazine article that printed
> the first line of column 1, then the first line of column 2, then the 
> second line of column 1, and so on. If a line ended with a hyphenated
> word, too bad.
>
> Extracting structure from a PDF document is somewhere between 
> very hard and impossible. Someone I worked with said that getting
> structured text from PDF was like turning hamburger back into a cow.
>
> Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
> is used. It appears to be an accessibility feature, so it still might not
> be useful for search.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 11, 2018, at 8:07 AM, Erick Erickson  wrote:
>>
>> Solr will not do this automatically, the Extracting Request Handler
>> simply indexes the entire contents of the doc without regard to things
>> like paragraphs etc. Ditto with HTML. This is actually a task that
>> requires getting into Tika and using all the bells and whistles there.
>>
>> I'd recommend two things:
>>
>> 1> Take the PDF parsing offline, i.e. in a separate client. There are
>> many reasons for this, in particular you can attempt to do what you're
>> asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>
>> 2> Talk to the Tika folks about the best ways to make Tika return the
>> information such that you can index them and get what you'd like.
>>
>> Best,
>> Erick
>>
>> On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
>>  wrote:
>>> Hello Team,
>>>
>>> I am using the Solr for indexing and searching for pdf document
>>>
>>> I have go through with your website document and installed solr but unable
>>> to index and search the document.
>>>
>>> For example: Suppose we have a PDF file which have no of paragraph with
>>> separate heading.
>>>
>>> So If I search for the title on indexed pdf the result should be contain
>>> the paragraph from where the title belongs.
>>>
>>> I am unable to perform this task.
>>>
>>> I have run the below command for upload the pdf
>>>
>>> *bin/post -c gettingstarted pdf-sample.pdf*
>>>
>>> and for searching I am running the command
>>>
>>> *curl http://localhost:8983/solr/gettingstarted/select?q='*
>>> >>
>>> Please suggest me anything and let me know if I am missing anything
>>>
>>> Thanks,
>>>
>>> Rahul
>



Re: Solr basic auth

2018-06-15 Thread Terry Steichen
"When authentication is enabled ALL requests must carry valid
credentials."  I believe this behavior depends on the value you set for
the *blockUnknown* authentication parameter.


On 06/15/2018 06:25 AM, Jan Høydahl wrote:
> When authentication is enabled ALL requests must carry valid credentials.
>
> Are you asking for a feature where a request is authenticated based on IP 
> address of the client, not username/password?
>
> Jan
>
> Sendt fra min iPhone
>
>> 14. jun. 2018 kl. 22:24 skrev Dinesh Sundaram :
>>
>> Hi,
>>
>> I have configured basic auth for solrcloud. it works well when i access the
>> solr url directly. i have integrated this solr with test.com domain. now if
>> I access the solr url like test.com/solr it prompts the credentials but I
>> dont want to ask this time since it is known domain. is there any way to
>> achieve this. much appreciate your quick response.
>>
>> my security json below. i'm using the default security, want to allow my
>> domain default without prompting any credentials.
>>
>> {"authentication":{
>>   "blockUnknown": true,
>>   "class":"solr.BasicAuthPlugin",
>>   "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
>> Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
>> },"authorization":{
>>   "class":"solr.RuleBasedAuthorizationPlugin",
>>   "permissions":[{"name":"security-edit",
>>  "role":"admin"}],
>>   "user-role":{"solr":"admin"}
>> }}



Re: Changing Field Assignments

2018-06-14 Thread Terry Steichen
Shawn,

I don't disagree at all, but have a basic question: How do you easily
transition from a system using a dynamic schema to one using a fixed one?

I'm runnning 6.6.0 in cloud mode (only because it's necessary, as I
understand it, to be in cloud mode for the authentication/authorization
to work).  In my server/solr/configsets subdirectory there are
directories "data_driven_schema_configs" and "basic_configs".  Both
contain a file named "managed_schema."  Which one is the active one?

>From the AdminUI, each collection has an associated "managed_schema"
(under the "Files" option).  I'm guessing that this collection-specific
managed_schema is the result of the automated field discovery process,
presumably using some baseline version (in configsets) to start with.

If that's true, then it would presumably make sense to save this
collection-specific managed_schema to disk as schema.xml.  I further
presume I'd create a config subdirectory for each of said collections
and put schema.xml there.  Is that right?

And I have to do this for each collection, right?

Every time I read (and reread, and reread, ...) the Solr docs they seem
to be making certain (very basic) assumptions that I'm unclear about, so
your help in the preceding would be most appreciated.

Thanks.

Terry


On 06/14/2018 01:51 PM, Shawn Heisey wrote:
> On 6/11/2018 2:02 PM, Terry Steichen wrote:
>> I am using Solr (6.6.0) in the automatic mode (where it discovers
>> fields).  It's working fine with one exception.  The problem is that
>> Solr maps the discovered "meta_creation_date" is assigned the type
>> TrieDateField. 
>>
>> Unfortunately, that type is limited in a number of ways (like sorting,
>> abbreviated forms and etc.).  What I'd like to do is have that
>> ("meta_creation_date") field assigned to a different type, like
>> DateRangeField. 
>>
>> Is it possible to accomplish this (during indexing) by creating a copy
>> field to a different type, and using the copy field in the query?  Or
>> via some kind of function operation (which I've never understood)?
> What you are describing is precisely why I never use the mode where Solr
> automatically adds unknown fields.
>
> If the field does not exist in the schema before you index the document,
> then the best Solr can do is precisely what is configured in the update
> processor that adds unknown fields.  You can adjust that config, but it
> will always be a general purpose guess.
>
> What is actually needed for multiple unknown fields is often outside
> what that update processor is capable of detecting and configuring
> automatically.  For that reason, I set up the schema manually, and I
> want indexing to fail if the input documents contain fields that I
> haven't defined.  Then whoever is doing the indexing can contact me with
> their error details, and I can add new fields with the exact required
> definition.
>
> Thanks,
> Shawn
>
>



Changing Field Assignments

2018-06-11 Thread Terry Steichen
I am using Solr (6.6.0) in the automatic mode (where it discovers
fields).  It's working fine with one exception.  The problem is that
Solr maps the discovered "meta_creation_date" is assigned the type
TrieDateField. 

Unfortunately, that type is limited in a number of ways (like sorting,
abbreviated forms and etc.).  What I'd like to do is have that
("meta_creation_date") field assigned to a different type, like
DateRangeField. 

Is it possible to accomplish this (during indexing) by creating a copy
field to a different type, and using the copy field in the query?  Or
via some kind of function operation (which I've never understood)?



Date Query Confusion

2018-05-17 Thread Terry Steichen
To me, one of the more frustrating things I've encountered in Solr is
working with date fields.  Supposedly, according to the documentation,
this is straightforward.  But in my experience, it is anything but
that.  In particular, I've found that the abbreviated forms of date
queries, don't work as described.

If I create a query like creation_date: [2016-10-01 To 2016-11-01], it
will produce a set of documents produced in the month of November 2016. 
That's the good news.

But, the abbreviated date queries (described in Solr documentation
)
don't work.  Tried creation_date: 2016-11.  That's supposed to match
documents with any November 2016 date.  But actually produces: 
|"Invalid Date String:'2016-11'|

||And Solr doesn't seem to let me sort on a date field.  Tried
creation_date asc  Produced: |"can not sort on multivalued field:
creation_date"|

In the AdminUI, if you go to the schema option for my collection, and
examine creation_date it show it to be:
org.apache.solr.schema.TrieDateField  (This was automatically chosen by
the managed-schema)

In that same AdminUI display, if I click "Load Term Info" I get a list
of dates, but when I click on one, it transforms it into a different
query form: {!term f=creation_date}2016-10-26T07:59:09.824Z  But this
query still produces 0 hits (even though the listing says it should
produce dozens of hits).

I imagine that I'm missing something basic here.  But I have no idea
what.  Any thoughts would be MOST welcome.

PS: I'm using Solr 6.6.0.


Re: Techniques for Retrieving Hits

2018-05-14 Thread Terry Steichen
Shawn,

As noted in my embedded comments below, I don't really see the problem
you apparently do. 

Maybe I'm missing something important (which certainly wouldn't  be the
first - or last -  time that happened).

I posted this note because I've not seen list comments pertaining to the
job of actually locating and retrieving hitlist documents. 

My way "seems" to work, and it is quite simple and compact.  I just
threw it out seeking a sanity check from others.

Terry


On 05/14/2018 11:32 AM, Shawn Heisey wrote:
> On 5/14/2018 6:46 AM, Terry Steichen wrote:
>> In order to allow users to retrieve the documents that match a query, I
>> make use of the embedded Jetty container to provide file server
>> functionality.  To make this happen, I provide a symbolic link between
>> the actual document archive, and the Jetty file server.  This seems
>> somewhat of a kludge, and I'm wondering if others have a better way to
>> retrieve the desired documents?  (I'm not too concerned about security
>> because I use ssh port forwarding to connect to remote authenticated
>> clients.)
>
> This is not a recommended usage for the servlet container where Solr
> runs.
But if the retrieval traffic is light, what's the problem?
>
> Solr is a search engine.  It is not designed to be a data store,
> although some people do use it that way.
Perhaps I didn't explain it right, but I'm not using it as a datastore
(other than the fact that I keep the actual file repository on the same
machine on which Solr runs.  I've got plenty of storage, so that's not
an issue, and, as I mentioned above, traffic is quite light.
>
> If systems running Solr clients want to access all the information for
> a document when the search results do not contain all the information,
> they should use what IS in the search results to access that data from
> the system where it is stored -- that could be a database, a file
> server, a webserver, or similar.
Perhaps I'm missing something, but search results cannot "contain all
the information" can they?  I use highlighting but that's just showing a
few snippets - not a substitute for the document itself.
>
> Thanks,
> Shawn
>
>



Techniques for Retrieving Hits

2018-05-14 Thread Terry Steichen
In order to allow users to retrieve the documents that match a query, I
make use of the embedded Jetty container to provide file server
functionality.  To make this happen, I provide a symbolic link between
the actual document archive, and the Jetty file server.  This seems
somewhat of a kludge, and I'm wondering if others have a better way to
retrieve the desired documents?  (I'm not too concerned about security
because I use ssh port forwarding to connect to remote authenticated
clients.)



Re: Specialized Solr Application

2018-04-19 Thread Terry Steichen
Thanks, Tim.  A couple of quick comments and a couple of questions:

1) the toughest pdfs to identify are those that are partly
searchable (text) and partly not (image-based text).  However, I've
found that such documents tend to exist in clusters.

2) email documents (.eml) are no problem, provided the -filetypes
eml is including in the indexing command.  Otherwise the indexing is
not recursive and you'll completely (and silently) miss all such
documents in lower subdirectories.

3) I have indexed other repositories and noticed some silent
failures (mostly for large .doc documents).  Wish there was some way
to log these errors so it would be obvious what documents have been
excluded.

4) I still don't understand the use of tika.eval - is that an
application that you run against a collection or what?

5) I've seen reference to tika-server - but I have no idea on how
that tool might be usefully applied.

6) Adobe Acrobat Pro apparently has a batch mode suitable for
flagging unsearchable (that is, image-based) pdf files and fixing them.

7) Another problem I've encountered is documents that are themselves
a composite of other documents (like an email thread).  The problem
is that a hit on such a document doesn't tell you much about the
true relevance of each contained document.  You have to do a
laborious manual search to figure it out.

8) Is there a way to return the size of a matching document (which,
I think, would help identify non-searchable/image documents)?

Regards,

Terry




On 04/18/2018 12:50 PM, Allison, Timothy B. wrote:
> To be Waldorf to Erick's Statler (if I may), lots of things can go wrong 
> during content extraction.[1]  I had two big concerns when I heard of your 
> task:
>
>
>
> 1) image only pdfs, which can parse without problem, but which might yield 0 
> content.
>
> 2) emails (see, e.g. SOLR-12048)
>
>
>
> It sounds like you're taking care of 1), and 2) doesn't apply because you're 
> using Tika (although note that we've made some major changes to our RFC822 
> parsing in the upcoming Tika 1.18).  So, no need to read further! 
>
>
>
> In general, surprising things can happen during the content extraction phase, 
> and unless you are monitoring/measuring/evaluating what's extracted, your 
> search system can yield results that are downright dangerous if you assume 
> that the full stack is actually working.
>
>
>
> I worked with one batch of documents where HALF of the Excel files weren't 
> being parsed.  They all had the same quirk which caused an exception in POI, 
> and because they were inside zip files, and Tika's legacy/default behavior is 
> to silently ignore embedded exceptions -- the owners of the search system had 
> _no idea_ that they'd never be able to find those documents.  At one point, 
> Tika wasn't extracting sdt form fields in docx or form fields in pdf...at 
> all...imagine if your document set was a bunch docx with sdts or pdfs with 
> form fields...  We just fixed a bug to pull text from joined shapes in 
> ppt...we've been missing that text for years!
>
>
>
> Those are a few horror stories, I have many, and there are countless more yet 
> to be discovered!
>
>
>
> The goal of tika-eval[2] is to allow you to see if things don't look right 
> based on your expectations.[3]  It doesn't help with indexing at all per se, 
> but it can allow you to see odd things and 1) change your processing pipeline 
> (add OCR where necessary or use an alternate parser for some file formats) or 
> 2) raise an issue to fix bugs in the content extraction libraries, or at 
> least 3) recognize that you aren't getting reliable content out of ~x% of 
> your documents.  If manually checking PDFs to determine whether or not to run 
> OCR is a hassle, run tika-eval and identify those docs that have a low word 
> count/page ratio.
>
>
>
> Couple of handfuls of Welsh documents; I thought we only had English...what?! 
>  No, that's just bad content extraction (character mapping failure in the PDF 
> or other mojibake).  Average token length in this document is 1, and it is 
> supposed to be English...what?  No, that's the spacing problem that Erick 
> Mentioned.  Average words per page in some pdfs = 2?  No, that's an 
> image-only pdf...that needs to go through OCR.  Ratio of out of vocabulary 
> words = 90%...no that's character encoding mojibake.
>
>
>
>
>
>> I was recently indexing a set of about
> 13,000 documents and at one point, a document caused solr to crash.  I had to 
> restart it.  I removed the offending document, and restarted the indexing.  
> It then eventually happened again, so I did the same thing.
>
>
>
> Crash, crash like OOM?  If you're able to share that with Tika or PDFBox, we 
> can _try_ to fix the underlying bug if there is one.  Sometimes, though, our 
> parsers require far more memory that is ideal. 
>
>
>
> If you have questions about tika-eval, please ask 

Re: Specialized Solr Application

2018-04-18 Thread Terry Steichen
Thanks, Erick.  What I don't understand that "rich text documents" (aka,
PDF and DOC) lack any internal structure (unlike JSON, XML, etc.), so
there's not much potential in trying to get really precise in parsing
them.  Or am I overlooking something here?

And, as you say, the metadata of such documents is not somewhat variable
(some PDFs have a field and others don't), which suggests that you may
not want the parser to be rigid.

Moreover, as I noted earlier, most of the metadata fields of such
documents seem to be of little value (since many document authors are
not consistent in creating that information). 

I take your point about non-optimum Tika workload distribution - but I
am only occasionally doing indexing so I don't think that would be a
significant factor (for me, at least).

A point of possible interest: I was recently indexing a set of about
13,000 documents and at one point, a document caused solr to crash.  I
had to restart it.  I removed the offending document, and restarted the
indexing.  It then eventually happened again, so I did the same thing. 
It then completed indexing successfully.  IOW, out of 13,000 documents
there were two that caused a crash, but once they were removed, the
other 12,998 were parsed/indexed fine.

On OCRs, I presume you're referring to PDFs that are images?  Part of
our team uses Acrobat Pro to screen and convert such documents (which
are very common in legal circles) so they can be searched.  Or did you
mean something else?

Thanks for the insights.  And the long answers (from you, Tim and
Charlie).  These are helping me (and I hope others on the list) to
better understand some of the nuances of effectively implementing
(small-scale) solr.


On 04/17/2018 10:35 PM, Erick Erickson wrote:
> Terry:
>
> Tika has a horrible problem to deal with and it's approaching a
> miracle that it does so well ;)
>
> Let's take a PDF file. Which vendor's version? From what _decade_? Did
> that vendor adhere
> to the spec? Every spec has gray areas so even good-faith efforts can
> result in some version/vendor
> behaving slightly differently from the other.
>
> And what about Word .vs. PDF? One might have "last_modified" and the
> other might have
> "last_edited" to mean the same thing. You mentioned that you're aware
> of this, you can make
> it more useful if you have finer-grained control over the ETL process.
>
> You say "As I understand it, Tika is integrated with Solr"  which is
> correct, you're talking about
> the "Extracting Request Handler". However that has a couple of
> important caveats:
>
> 1> It does the best it can. But Tika has a _lot_ of tuning options
> that allow you to get down-and-dirty
> with the data you're indexing. You mentioned that precision is
> important. You can do some interesting
> things with extracting specific fields from specific kinds of
> documents and making use of them. The
> "last_modified" and "last_edited" fields above are an example.
>
> 2> It loads the work on a single Solr node. So the very expensive
> process of extracting data from the
> semi-structure document is all on the Solr node. If you use Tika in a
> client-side program you can
> parallelize the extraction and get through your indexing much more quickly.
>
> 3> Tika can occasionally get its knickers in a knot over some
> particular document. That'll also bring
> down the Solr instance.
>
> Here's a blog that can get you started doing client-side parsing,
> ignore the RDBMS bits.
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> I'll leave Tim to talk about tika-eval ;) But the general problem is
> that the extraction process can
> result in garbage, lots of garbage. OCR is particularly prone to
> nonsense. PDFs can be tricky,
> there's this spacing parameter that, depending on it's setting can
> render e r i c k as 5 separate
> letters or my name.
>
> Hey, you asked! Don't complain about long answers ;)
>
> Best,
> Erick
>
> On Tue, Apr 17, 2018 at 1:50 PM, Terry Steichen <te...@net-frame.com> wrote:
>> Hi Timothy,
>>
>> As I understand it, Tika is integrated with Solr.  All my indexed
>> documents declare that they've been parsed by tika.  For the eml files
>> it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
>> were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
>> files show: ||org.apache.tika.parser.pdf.PDFParser|
>>
>> ||
>>
>> ||
>>
>> What do you mean by improving the output with "tika-eval?"  I confess I
>> don't completely understand how documents should be prepared for
>> indexing.  But with the eml docs, solr/tika seems to properly pull out
>> things l

Re: Specialized Solr Application

2018-04-17 Thread Terry Steichen
Hi Timothy,

As I understand it, Tika is integrated with Solr.  All my indexed
documents declare that they've been parsed by tika.  For the eml files
it's: |org.apache.tika.parser.mail.RFC822Parser   Word docs show they
were parsed by ||org.apache.tika.parser.microsoft.ooxml.OOXMLParser  PDF
files show: ||org.apache.tika.parser.pdf.PDFParser|

||

||

What do you mean by improving the output with "tika-eval?"  I confess I
don't completely understand how documents should be prepared for
indexing.  But with the eml docs, solr/tika seems to properly pull out
things like date, subject, to and from.  Other (so-called 'rich text') 
documents (like pdfs and Word-type), the metadata is not so useful, but
on the other hand, there's not much consistent structure to the
documents I have to deal with.

I may be missing something - am I?

Regards,

Terry


On 04/17/2018 09:38 AM, Allison, Timothy B. wrote:
> +1 to Charlie's guidance.
>
> And...
>
>> 60,000 documents, mostly pdfs and emails.
>> However, there's a premium on precision (and recall) in searches.
> Please, oh, please, no matter what you're using for content/text extraction 
> and/or OCR, run tika-eval[1] on the output to ensure that that you are 
> getting mostly language-y content out of your documents.  Ping us on the Tika 
> user's list if you have any questions.
>
> Bad text, bad search. 
>
> [1] https://wiki.apache.org/tika/TikaEval
>
> -Original Message-
> From: Charlie Hull [mailto:char...@flax.co.uk] 
> Sent: Tuesday, April 17, 2018 4:17 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Specialized Solr Application
>
> On 16/04/2018 19:48, Terry Steichen wrote:
>> I have from time-to-time posted questions to this list (and received 
>> very prompt and helpful responses).  But it seems that many of you are 
>> operating in a very different space from me.  The problems (and
>> lessons-learned) which I encounter are often very different from those 
>> that are reflected in exchanges with most other participants.
> Hi Terry,
>
> Sounds like a fascinating use case. We have some similar clients - small 
> scale law firms and publishers - who have taken advantage of Solr.
>
> One thing I would encourage you to do is to blog and/or talk about what 
> you've built. Lucene Revolution is worth applying to talk at and if you do 
> manage to get accepted - or if you go anyway - you'll meet lots of others 
> with similar challenges and come away with a huge amount of useful 
> information and contacts. Otherwise there are lots of smaller Meetup events 
> (we run the London, UK one).
>
> Don't assume just because some people here are describing their 350 billion 
> document learning-to-rank clustered monster that the small applications don't 
> matter - they really do, and the fact that they're possible to build at all 
> is a testament to the open source model and how we share information and tips.
>
> Cheers
>
> Charlie
>> So I thought it would be useful to describe what I'm about, and see if 
>> there are others out there with similar implementations (or interest 
>> in moving in that direction).  A sort of pay-forward.
>>
>> We (the Lakota Peoples Law Office) are a small public interest, pro 
>> bono law firm actively engaged in defending Native American North 
>> Dakota Water Protector clients against (ridiculously excessive) criminal 
>> charges.
>>
>> I have a small Solr (6.6.0) implementation - just one shard.  I'm 
>> using the cloud mode mainly to be able to implement access controls.  
>> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 
>> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with 
>> a total of about 60,000 documents, mostly pdfs and emails.  The 
>> indexed documents are partly our own files and partly those we obtain 
>> through legal discovery (which, surprisingly, is allowed in ND for 
>> criminal cases).  We only have a few users (our lawyers and a couple 
>> of researchers mostly), so traffic is minimal.  However, there's a 
>> premium on precision (and recall) in searches.
>>
>> The document repository is local to the server.  I piggyback on the 
>> embedded Jetty httpd in order to serve files (selected from the 
>> hitlists).  I just use a symbolic link to tie the repository to 
>> Solr/Jetty's "webapp" subdirectory.
>>
>> We provide remote access via ssh with port forwarding.  It provides 
>> very snappy performance, with fully encrypted links.  Appears quite stable.
>>
>> I've had some bizarre behavior apparently caused by an interaction 
>> between repository permissions, solr permissions and the ssh link.  I 

Specialized Solr Application

2018-04-16 Thread Terry Steichen
I have from time-to-time posted questions to this list (and received
very prompt and helpful responses).  But it seems that many of you are
operating in a very different space from me.  The problems (and
lessons-learned) which I encounter are often very different from those
that are reflected in exchanges with most other participants.

So I thought it would be useful to describe what I'm about, and see if
there are others out there with similar implementations (or interest in
moving in that direction).  A sort of pay-forward.

We (the Lakota Peoples Law Office) are a small public interest, pro bono
law firm actively engaged in defending Native American North Dakota
Water Protector clients against (ridiculously excessive) criminal charges. 

I have a small Solr (6.6.0) implementation - just one shard.  I'm using
the cloud mode mainly to be able to implement access controls.  The
server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 8GB of
RAM and 4 cpu processors.  We presently have 8 collections with a total
of about 60,000 documents, mostly pdfs and emails.  The indexed
documents are partly our own files and partly those we obtain through
legal discovery (which, surprisingly, is allowed in ND for criminal
cases).  We only have a few users (our lawyers and a couple of
researchers mostly), so traffic is minimal.  However, there's a premium
on precision (and recall) in searches. 

The document repository is local to the server.  I piggyback on the
embedded Jetty httpd in order to serve files (selected from the
hitlists).  I just use a symbolic link to tie the repository to
Solr/Jetty's "webapp" subdirectory.

We provide remote access via ssh with port forwarding.  It provides very
snappy performance, with fully encrypted links.  Appears quite stable. 

I've had some bizarre behavior apparently caused by an interaction
between repository permissions, solr permissions and the ssh link.  I
seem "solved" for the moment, but time will tell for how long.

If there are any folks out there who have similar requirements, I'd be
more than happy to share the insights I've gained and problems I've
encountered and (I think) overcome.  There are so many unique parts of
this small scale, specialized application (many dimensions of which are
not strictly internal to Solr) that it probably won't be appreciated to
dump them on this (excellent) Solr list.  So, if you encounter problems
peculiar to this kind of setup, we can perhaps help handle them off-list
(although if they have more general Solr application, we should, of
course, post them to the list).

Terry Steichen



Re: [ANNOUNCE] Solr Reference Guide for Solr 7.3 released

2018-04-05 Thread Terry Steichen
OK, I guess this means this change been included in 7.3.0  I really
appreciate what all of the committers do, so please don't get this wrong.

Even with this and the preceding comment, I find it difficult to clearly
follow these changes.  Perhaps, as Shawn suggests, any such
consolidation and/or early release might be reflected back in the
original change (11622).

Anyway, I'm a happy camper now.  Thanks to all.


On 04/05/2018 11:37 AM, Shawn Heisey wrote:
> On 4/5/2018 9:05 AM, Terry Steichen wrote:
>> I'm a bit confused because of the issue I was concerned about earlier:
>> https://issues.apache.org/jira/browse/SOLR-11622
>> It was supposed to be fixed and included in (the then-future) 7.3, but I
>> don't see it there in the listed 7.3.0 changes/bug-fixes.
>> Am I missing something?
>
> One of the final comments in that issue says "Fixed as part of
> SOLR-11701".  That issue is listed in the CHANGES.txt.
>
> Perhaps the changelog entry for SOLR-11701 should have mentioned any
> other issues that were also fixed by the commit.  In Erick's defense,
> I'll say this:  Making sure that everything for one issue gets handled
> correctly in a decent timeframe can be a little overwhelming.  Details
> like the fact that the commit for one issue also solves another issue
> are easy to miss until later.
>
> Thanks,
> Shawn
>
>



Re: [ANNOUNCE] Solr Reference Guide for Solr 7.3 released

2018-04-05 Thread Terry Steichen
I'm a bit confused because of the issue I was concerned about earlier: 
https://issues.apache.org/jira/browse/SOLR-11622
It was supposed to be fixed and included in (the then-future) 7.3, but I
don't see it there in the listed 7.3.0 changes/bug-fixes.
Am I missing something?


On 04/05/2018 10:05 AM, Cassandra Targett wrote:
> The Lucene PMC is pleased to announce that the Solr Reference Guide for
> Solr 7.3 is now available.
>
> This 1,295 page PDF is the definitive guide to using Apache Solr, the
> search server built on Apache Lucene.
>
> The PDF Guide can be downloaded from:
> https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-7.3.pdf
>
> It is also available online at https://lucene.apache.org/solr/guide/7_3.
>



Re: Resetting Authentication/Authorization

2018-03-30 Thread Terry Steichen

On 03/29/2018 11:07 PM, Shawn Heisey wrote:
> On 3/29/2018 8:28 PM, Terry Steichen wrote:
>> When I set up the initial authentications and authorizations (I'm using
>> 6.6.0 and running in cloud mode.), I call "bin/solr auth enable
>> -credentials xxx:yyy".
>
> What does this command output?  There should definitely be something
> output when that command is run.  I don't know if it will be a lot of
> output or a little bit, but whatever it is, can you provide it?
*The output resembles the contents of security.json, except that there's
only one authenticated user, which is the one whose credentials are
supplied.  And there are only two permissions.*
>
>> I then use a series of additional API calls ( to
>> create additional users and permissions).  This creates my desired
>> security environment (and, BTW, it seems to function as it should).
>
> Can you elaborate on exactly what you did when you say "a series of
> additional API calls"?
*I issued the well-documented curl-based commands to create a user and
to create a permission.  Multiple times as needed.*
>
>> If I restart solr, it appears I must reactivate it with the same
>> 'bin/solr auth enable -credentials xxx:yyy' command.  But, it seems that
>> when solr is restarted this way, only the authorizations are retained
>> persistently.  But the authentications have to be created again from
>> scratch.
>
> Enabling the authentication when running in cloud mode should upload a
> "security.json" file to zookeeper.  It should also write some
> variables to your solr.in.sh file, so that future usage of the
> bin/solr tool can provide the authentication that is required.
*That's the essence of my question: yes, I think it should logically do
what you say, but I don't know if or how it does that.  I don't think it
loads security.json because I have to start from scratch no matter
what's in security.json, and no matter where I place that file.  I would
be happy if it did that because I could prepare a fine-tuned set of
authentications and permissions and reuse it each time.  I simply don't
know how to do that (or even if it can be done).*
>
> Thanks,
> Shawn
>
>



Resetting Authentication/Authorization

2018-03-29 Thread Terry Steichen
When I set up the initial authentications and authorizations (I'm using
6.6.0 and running in cloud mode.), I call "bin/solr auth enable
-credentials xxx:yyy".  I then use a series of additional API calls ( to
create additional users and permissions).  This creates my desired
security environment (and, BTW, it seems to function as it should).

If I restart solr, it appears I must reactivate it with the same
'bin/solr auth enable -credentials xxx:yyy' command.  But, it seems that
when solr is restarted this way, only the authorizations are retained
persistently.  But the authentications have to be created again from
scratch.

I would like to (somehow) capture the authentication/authorization
information (probably in a security.json file?) and then (somehow)
reload it when there's a restart. 

Can that be done?


Three Indexing Questions

2018-03-29 Thread Terry Steichen
First question: When indexing content in a directory, Solr's normal
behavior is to recursively index all the files found in that directory
and its subdirectories.  However, turns out that when the files are of
the form *.eml (email), solr won't do that.  I can use a wildcard to get
it to index the current directory, but it won't recurse.

I note this message that's displayed when I begin indexing: "Entering
auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log

Is there a way to get it to recurse through files with different
extensions, for example, like .eml?  When I manually add all the
subdirectory content, solr seems to parse the content very well,
recognizing all the standard email metadata.  I just can't get it to do
the indexing recursively.

Second question: if I want to index files from many different source
directories, is there a way to specify these different sources in one
command? (Right now I have to issue a separate indexing command for each
directory - which means I have to sit around and wait till each is
finished.)

Third question: I have a very large directory structure that includes a
couple of subdirectories I'd like to exclude from indexing.  Is there a
way to index recursively, but exclude specified directories?



Re: Continuing Saga of Authorization on 6.6.0

2018-03-13 Thread Terry Steichen
So, Shawn, every time zookeeper gets shut down (intentionally or
otherwise), I have to recreate the credentials and permissions via a set
of API calls?  Is there some way to have it save and read that stuff
from disk?

Terry

On 03/13/2018 01:51 PM, Shawn Heisey wrote:
> On 3/13/2018 11:25 AM, Terry Steichen wrote:
>> What also puzzles me is that I can't find any "security.json" file. 
>> Clearly, solr is persistently keeping track of the
>> authentication/authorization information, but I don't see where.  I
>> suppose it might be kept in zookeeper (which perhaps survives solr
>> restarts - but I don't know).  Any insights on that?
> Yes, with SolrCloud, the security.json file is kept in zookeeeper. 
> Almost all of the configuration for SolrCloud is in zookeeper, so it can
> affect any server in the cloud.  The only usual exception is solr.xml,
> and even that file CAN be in zookeeper.
>
> Thanks,
> Shawn
>
>



Re: Continuing Saga of Authorization on 6.6.0

2018-03-13 Thread Terry Steichen
Chris, many, many thanks.  From a quick check, those changes seem to
work.  I think I'm getting too old to differentiate between brackets and
curly braces.  I'll get back on track and see if I can (finally) set
this up right. 

What also puzzles me is that I can't find any "security.json" file. 
Clearly, solr is persistently keeping track of the
authentication/authorization information, but I don't see where.  I
suppose it might be kept in zookeeper (which perhaps survives solr
restarts - but I don't know).  Any insights on that?

Terry

On 03/13/2018 01:01 PM, Chris Ulicny wrote:
>> *failed to delete a user:*
> "delete-user" is expecting an array of users in the json, so the data
> should be: {"delete-user": ["lanny"]}
>
>
>> *failed to set a permission: *
> There are separate endpoints for authorization and authentication. You
> should use ".../solr/admin/authorization" for the permissions instead of
> "../solr/admin/authentication"
> https://lucene.apache.org/solr/guide/7_2/rule-based-authorization-plugin.html#manage-permissions
>
> Disclaimer: I've never worked with 6.6, but I've not noticed any big
> differences between the security for our 6.3 deployments and the 7.X ones.
>
> Best,
> Chris
>
> On Tue, Mar 13, 2018 at 12:47 PM Terry Steichen <te...@net-frame.com> wrote:
>
>> I switched solr from standalone to cloud and created the two collections
>> (emails1 and emails2).
>>
>> I was able to create a basic set of credentials via the curl-based
>> API's.  I could create users, and toggle the blockUnknown property
>> status. However, the system refused to allow me to delete a user, or to
>> set a permission.
>>
>> Here are the curl commands (with *terry:admin* as admin credentials) and
>> results:
>>
>> *succeeded in setting blockUnknown property (verified by
>> admin/authentication dump):*
>>
>> curl --user terry:admin http://localhost:8983/solr/admin/authentication
>> -H <http://localhost:8983/solr/admin/authentication-H>
>> 'Content-type:application/json' -d '{
>>   "set-property": {"blockUnknown" : true}}'
>>
>> *succeeded in adding a user (verified by admin/authentication dump):*
>>
>> curl --user terry:admin http://localhost:8983/solr/admin/authentication
>> -H <http://localhost:8983/solr/admin/authentication-H>
>> 'Content-type:application/json' -d '{
>>>   "set-user": {"lanny" : "hawaii"}}'
>> *succeeded in changing lanny's password (verified by
>> admin/authentication dump):*
>>
>> curl --user terry:admin http://localhost:8983/solr/admin/authentication
>> -H <http://localhost:8983/solr/admin/authentication-H>
>> 'Content-type:application/json' -d '{
>>  "set-user": {"lanny" : "hawaii_five_o"}}'
>>
>> *failed to delete a user:*
>>
>>  curl --user terry:admin http://localhost:8983/solr/admin/authentication
>> -H <http://localhost:8983/solr/admin/authentication-H>
>> 'Content-type:application/json' -d '{
>>  "delete-user": {"lanny"}}'
>> {
>>   "responseHeader":{
>> "status":500,
>> "QTime":1},
>>
>>   "error":{ "msg":"Expected key,value separator ':': char=},position=26
>> BEFORE='{ \"delete-user\": {\"lanny\"}' AFTER='}'",
>> [terry here: plus a very long stack trace}
>>
>> *failed to set a permission: *
>>
>> curl --user terry:admin http://localhost:8983/solr/admin/authentication
>> -H <http://localhost:8983/solr/admin/authentication-H>
>> 'Content-type:application/json' -d '{"set-permission" :
>> {"name":"collection-admin-edit", "role":"admin"}}'
>> {
>>   "responseHeader":{
>> "status":0,
>> "QTime":2},
>>   "errorMessages":[{
>>   "set-permission":{
>> "name":"collection-admin-edit",
>> "role":"admin"},
>>   "errorMessages":["Unknown operation 'set-permission' "]}]}
>>
>>
>> This really makes no sense at all (or, I'm really losing it - always a
>> distinct possibility).  It's almost as if half of the documented
>> parameters must have been changed, though I can't find any references to
>> any such changes.
>>
>> I confess I'm about to just give up and find some other route to go.
>>
>> Terry
>>
>>
>> On 03/1

Continuing Saga of Authorization on 6.6.0

2018-03-13 Thread Terry Steichen
I switched solr from standalone to cloud and created the two collections
(emails1 and emails2). 

I was able to create a basic set of credentials via the curl-based
API's.  I could create users, and toggle the blockUnknown property
status. However, the system refused to allow me to delete a user, or to
set a permission. 

Here are the curl commands (with *terry:admin* as admin credentials) and
results:

*succeeded in setting blockUnknown property (verified by
admin/authentication dump):*

curl --user terry:admin http://localhost:8983/solr/admin/authentication
-H 'Content-type:application/json' -d '{
  "set-property": {"blockUnknown" : true}}'

*succeeded in adding a user (verified by admin/authentication dump):*

curl --user terry:admin http://localhost:8983/solr/admin/authentication
-H 'Content-type:application/json' -d '{
>   "set-user": {"lanny" : "hawaii"}}'

*succeeded in changing lanny's password (verified by
admin/authentication dump):*

curl --user terry:admin http://localhost:8983/solr/admin/authentication
-H 'Content-type:application/json' -d '{
 "set-user": {"lanny" : "hawaii_five_o"}}'

*failed to delete a user:*

 curl --user terry:admin http://localhost:8983/solr/admin/authentication
-H 'Content-type:application/json' -d '{
 "delete-user": {"lanny"}}'
{
  "responseHeader":{
    "status":500,
    "QTime":1},

  "error":{ "msg":"Expected key,value separator ':': char=},position=26
BEFORE='{ \"delete-user\": {\"lanny\"}' AFTER='}'",
[terry here: plus a very long stack trace}

*failed to set a permission: *

curl --user terry:admin http://localhost:8983/solr/admin/authentication
-H 'Content-type:application/json' -d '{"set-permission" :
{"name":"collection-admin-edit", "role":"admin"}}'
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "errorMessages":[{
  "set-permission":{
    "name":"collection-admin-edit",
    "role":"admin"},
  "errorMessages":["Unknown operation 'set-permission' "]}]}


This really makes no sense at all (or, I'm really losing it - always a
distinct possibility).  It's almost as if half of the documented
parameters must have been changed, though I can't find any references to
any such changes. 

I confess I'm about to just give up and find some other route to go. 

Terry


On 03/12/2018 11:15 PM, Shawn Heisey wrote:
> On 3/12/2018 8:39 PM, Terry Steichen wrote:
>> I'm increasingly of the view that Solr's authentication/authorization
>> mechanism doesn't work correctly in a _standalone_ mode.  It was present
>> in the cloud mode for quite a few versions back, but as of 6.0.0 (or so)
>> it was supposed to be available in standalone mode too.  It seems to
>> partly work (when using the built-in permissions), but does not seem to
>> work with customized, core-specific permissions.
>
> I suspected based on your last message that the authorization feature
> might only work correctly in SolrCloud.  The entire authentication
> feature was designed for SolrCloud.  Version 6.5 brought the
> security.json file to standalone mode.  This was LONG after the
> feature was introduced in 5.2 and had a LOT of bugs fixed in the three
> 5.3.x releases.
>
> I just found the section in the documentation confirming what I
> suspected.
>
> https://lucene.apache.org/solr/guide/7_2/authentication-and-authorization-plugins.html#authorization
>
>
> There is a note here that says "The authorization plugin is only
> supported in SolrCloud mode. Also, reloading the plugin isn’t yet
> supported and requires a restart of the Solr installation (meaning,
> the JVM should be restarted, not simply a core reload)."  The 6.6
> documentation contains the same note that you can see here in the
> latest docs.
>
> I have no idea how hard it would be to extend the authorization plugin
> to support standalone cores as well as collections.  I imagine that if
> it were easy, it would have been done already.
>
> Thanks,
> Shawn
>
>



Resend: Authorization on 6.6.0

2018-03-12 Thread Terry Steichen
I'm resending the information below because the original message got the
security.json stuff garbled.


I'm using 6.6.0 with security.json active, having the content shown
below.  I am running standalone mode, have two solr cores defined:
email1, and email2.  Since the 'blockUnknown' is set to false, everyone
should have access to any unprotected resource.  As you can see, I have
three users defined: joe, solr and terry (the latter two having an admin
role).

What I expect to happen is for user joe (who is not an admin) to be able
to access core emails2 without being challenged for his credentials. 
But, user joe should also be challenged and not allowed to access emails1. 

But solr appears to ignore the "collections" portion of the permission -
it denies joe access to both cores. 

Is this a bug (in that auth doesn't work properly in 6.6.0 standalone),
or am I (once again) missing something?

Terry


{
    "authentication": {
    "class": "solr.BasicAuthPlugin",
    "blockUnknown": true,
    "credentials": {
    "solr": "IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=",
    "joe": "iGx0BaTgmjmCxrRmaD3IsCb2MJ21x1vqhfdzbwyu9MY=
P+aA0Bx811jzRwR97bOn/x/jyvpoKiHpWIRRXGAc8tg=",
    "terry": "q71fVfo/DIeCSfc1zw6YMyXVjU24Jr2oLniEkXFdPe0=
oSaEbu/0TCg8UehLQ9zfoH3AvrJBqCaIoJkt547WIrc="
    },
    "": {
    "v": 0
    }
    },
    "authorization": {
    "class": "solr.RuleBasedAuthorizationPlugin",
    "user-role": {
    "solr": "admin",
    "terry": "admin"
    },
    "permissions": [
    {
    "path": "/select",
    "role": "admin"
    }
    ]
    }
}


Authorization in Solr 6.6.0 Not Working Properly

2018-03-12 Thread Terry Steichen
I'm using 6.6.0 with security.json active, having the content shown
below.  I am running standalone mode, have two solr cores defined:
email1, and email2.  Since the 'blockUnknown' is set to false, everyone
should have access to any unprotected resource.  As you can see, I have
three users defined: joe, solr and terry (the latter two having an admin
role).

What I expect to happen is for user joe (who is not an admin) to be able
to access core emails2 without being challenged for his credentials. 
But, user joe should also be challenged and not allowed to access emails1. 

But solr appears to ignore the "collections" portion of the permission -
it denies joe access to both cores. 

Is this a bug (in that auth doesn't work properly in 6.6.0 standalone),
or am I (once again) missing something?

Terry

{     "authentication": {     "class": "solr.BasicAuthPlugin",
    "blockUnknown": false,     "credentials": {    
"solr": "IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c=",     "joe":
"iGx0BaTgmjmCxrRmaD3IsCb2MJ21x1vqhfdzbwyu9MY=
P+aA0Bx811jzRwR97bOn/x/jyvpoKiHpWIRRXGAc8tg=",     "terry":
"q71fVfo/DIeCSfc1zw6YMyXVjU24Jr2oLniEkXFdPe0=
oSaEbu/0TCg8UehLQ9zfoH3AvrJBqCaIoJkt547WIrc="     },     "": {
    "v": 0     }     },     "authorization": {    
"class": "solr.RuleBasedAuthorizationPlugin",     "user-role": {
    "solr": "admin",     "terry": "admin"     },
    "permissions": [     {    
"collection":"emails1",     "path": "/select",
    "role": "admin"     }     ]     } }



Setting Up Solr Authentication/Authorization

2018-03-09 Thread Terry Steichen
I'm trying to set up basic authentication/authorization with solr 6.6.0.

The documentation says to create a security.json file and describes the
content as:

{
"authentication":{
   "class":"solr.BasicAuthPlugin",
   "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0= 
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
},
"authorization":{
   "class":"solr.RuleBasedAuthorizationPlugin",
   "permissions":[{"name":"security-edit",
  "role":"admin"}]
   "user-role":{"solr":"admin"}
}}

Does that mean to literally use exactly the above as the security.json content, 
or customize it (in some fashion)?

The documentation  also mentions that the initial admin person is a user named 
"solr" with a password: "SolrRocks"  What's unclear is whether that's the 
password on which the hash (in security.json) was created or what?

What I can't figure out is whether the password hash is fixed, or whether it 
should be generated, and if so, how?

Also, some people on the web recommend altering the jetty xml files to do this 
- is it necessary too?

I'm certain this is fairly simple once I can get started - but I'm having 
trouble getting past step 1, and any help would be appreciated.

Terry



Re: Solr Read-Only?

2018-03-06 Thread Terry Steichen
Chris,

Thanks for your suggestion.  Restarting solr after an in-memory
corruption is, of course, trivial (compared to rebuilding the indexes).

Are there any solr directories that MUST be read/write (even with a
pre-built index)?  Would it suffice (for my purposes) to make only the
data/index directory R-O?

Terry


On 03/06/2018 04:20 PM, Christopher Schultz wrote:
> Terry,
>
> On 3/6/18 4:08 PM, Terry Steichen wrote:
> > Is it possible to run solr in a read-only directory?
>
> > I'm running it just fine on a ubuntu server which is accessible
> > only through SSH tunneling.  At the platform level, this is fine:
> > only authorized users can access it (via a browser on their machine
> > accessing a forwarded port).
>
> > The problem is that it's an all-or-nothing situation so everyone
> > who's authorized access to the platform has, in effect,
> > administrator privileges on solr.  I understand that authentication
> > is coming, but that it isn't here yet.  (Or, to add complexity, I
> > had to downgrade from 7.2.1 to 6.4.2 to overcome a new bug
> > concerning indexing of eml files, and 6.4.2 definitely doesn't have
> > authentication.)
>
> > Anyway, what I was wondering is if it might be possible to run solr
> > not as me (the administrator), but as a user with lesser privileges
> > so that no one who came through the SSH tunnel could (inadvertently
> > or otherwise) screw up the indexes.
>
> With shell access, the only protection you could provide would be
> through file-permissions. But of course Solr will need to be
> read-write in order to build the index in the first place. So you'd
> probably have to run read-write at first, build the index (perhaps
> that's already been done in the past), then (possibly) restart in
> read-only mode.
>
> Read-only can be achieved by simply revoking write-access to the data
> directories from the euid of the Solr process. Theoretically, you
> could switch from being read-write to read-only merely by changing
> file-permissions... no Solr restarts required.
>
> I'm not sure if it matters to you very much, but a user can still do
> some damage to the index even if the "server" is read-only (through
> file-permissions): they can issue a batch of DELETE or ADD requests
> that will effect the in-memory copies of the index. It might be
> temporary, but it might require that you restart the Solr instance to
> get back to a sane state.
>
> Hope that helps,
> -chris
>



Solr Read-Only?

2018-03-06 Thread Terry Steichen
Is it possible to run solr in a read-only directory?

I'm running it just fine on a ubuntu server which is accessible only
through SSH tunneling.  At the platform level, this is fine: only
authorized users can access it (via a browser on their machine accessing
a forwarded port). 

The problem is that it's an all-or-nothing situation so everyone who's
authorized access to the platform has, in effect, administrator
privileges on solr.  I understand that authentication is coming, but
that it isn't here yet.  (Or, to add complexity, I had to downgrade from
7.2.1 to 6.4.2 to overcome a new bug concerning indexing of eml files,
and 6.4.2 definitely doesn't have authentication.)

Anyway, what I was wondering is if it might be possible to run solr not
as me (the administrator), but as a user with lesser privileges so that
no one who came through the SSH tunnel could (inadvertently or
otherwise) screw up the indexes.

Terry



Re: Challenges of Indexing Email

2018-02-26 Thread Terry Steichen
Thanks Karthik.

(1) I thought the fix would be in 7.2.1, but it is not.  Any idea when
it will be available?

(2) Is there any way to force Solr indexing to treat an email message
(or thread) as plain text?

Terry


On 02/26/2018 10:37 AM, Karthik Ramachandran wrote:
> There is bug report for this
> https://issues.apache.org/jira/browse/SOLR-11622 which is fixed for future
> release.
>
> Before running into this issue we were running 6.4.2 which did not have
> this bug.
>
> On Mon, Feb 26, 2018 at 9:59 AM, Terry Steichen <te...@net-frame.com> wrote:
>
>> I am using Solr 7.2.1 and trying to index (among other documents)
>> individual emails and collected email threats.  Ideally, the indexing
>> would parse the email messages into their constituent fields.  But, for
>> my purposes, an acceptable alternative is to merely index the messages a
>> unstructured text.
>>
>> But Solr won't let me do either.  Whenever I try I get this message:
>> java.lang.NoClassDefFoundError:
>> org/apache/james/mime4j/stream/MimeConfig$Builder
>>
>> I noticed that others have encountered this error, and that there is (or
>> was) a bug report on it.
>>
>> I'm unsure what to do, other than forgo indexing emails.  Not sure if a
>> patch is available, but even if it is, I don't know what to do with it.
>>
>> Alternatively, it would seem that there should be some way to instruct
>> Solr to stop parsing and just treat the files as pure text - but I don't
>> know how to do that.
>>
>> Maybe I'll just have to install an earlier Solr version that doesn't
>> have this bug - could someone tell me what version that might be?
>>
>> Regards,
>>
>> Terry
>>
>>
>>
>



Challenges of Indexing Email

2018-02-26 Thread Terry Steichen
I am using Solr 7.2.1 and trying to index (among other documents)
individual emails and collected email threats.  Ideally, the indexing
would parse the email messages into their constituent fields.  But, for
my purposes, an acceptable alternative is to merely index the messages a
unstructured text.

But Solr won't let me do either.  Whenever I try I get this message: 
java.lang.NoClassDefFoundError:
org/apache/james/mime4j/stream/MimeConfig$Builder

I noticed that others have encountered this error, and that there is (or
was) a bug report on it.

I'm unsure what to do, other than forgo indexing emails.  Not sure if a
patch is available, but even if it is, I don't know what to do with it. 

Alternatively, it would seem that there should be some way to instruct
Solr to stop parsing and just treat the files as pure text - but I don't
know how to do that. 

Maybe I'll just have to install an earlier Solr version that doesn't
have this bug - could someone tell me what version that might be?

Regards,

Terry