Re: Three questions about huge tlog problem and CDCR

2019-12-18 Thread alwaysbluesky
found a typo. correcting "updateLogSynchronizer" is set to 6(1 min), not
1 hour



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Three questions about huge tlog problem and CDCR

2019-12-18 Thread Louis
* Environment: Solr Cloud 7.7.0, 3 nodes / CDCR bidirectional / CDCR buffer
disabled

Hello All,

I have some problem with tlog. They are getting bigger and bigger...

They don't seem to be deleted at all even after hard commit, so now the
total size of tlog files is more than 21GB..

Actually I see multiple tlog folders like,

 2.5GB tlog/
 6.7GB tlog.20190815170021077/ 
 6.7GB tlog.20190316225613751/
 ...

Are they all necessary for recovery? what is the tlog.2019 folders?


Based on my understanding, tlog files are for recovery when graceful
shutdown failed.. 

1) As long as I stop entire nodes gracefully, is it safe to delete tlog
files manually by using rm -rf ./tlogs?

2) I think that the reason why tlog files are not deleted is because of CDCR
not working properly.. So tlogs just stay forever until being synchronized..
And synchronization never happened and tlogs keep increasing.. Does my
theory make sense? 

3) Actually, we set up our replicator element's schdule to 1 hour and
updatelogsynchronizer element to 1 hour as well. Could this be the reason
for why CDCR is not working because of the interval is too long?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: how to exclude path from being queried

2019-12-18 Thread Shawn Heisey

On 12/18/2019 1:21 PM, Nan Yu wrote:

     I am trying to find all files containing a keyword in a directory (and 
many sub-directories).

     I did a quick indexing using


bin/post -c myCore /RootDir

     When I query the index using "keyword", all files whose path containing 
the keyword will be included in the search result. For example: 
/RootDir/KeywordReports/FileDoesNotContainKeyword.txt will be shown in the query result.
      The query is: http://localhost:8983/solr/myCore/select?q=keyword
   
     Is there a way to exclude files whose content does not contain the keyword but the path contains the keyword?

     Should I re-index the directory using some extra parameter? Or use extra 
condition in the query


It sounds like your default field is probably a catchall which has the 
contents of multiple source fields copied to it, including the content 
and the filename.


If you do not want the filename searched, then query a different field 
which does not contain that information.  You may need to adjust your 
schema and reindex for this to be possible.


You haven't shared the configs for this index, so it is not possible for 
us to confirm that guess.


Thanks,
Shawn


Re: how to exclude path from being queried

2019-12-18 Thread Paras Lehana
Hi Nan,

Are you using PathHierarchyTokenizer

?

On Thu, 19 Dec 2019 at 01:51, Nan Yu  wrote:

> Hi,
> I am trying to find all files containing a keyword in a directory (and
> many sub-directories).
>
> I did a quick indexing using
>
> bin/post -c myCore /RootDir
>
> When I query the index using "keyword", all files whose path
> containing the keyword will be included in the search result. For example:
> /RootDir/KeywordReports/FileDoesNotContainKeyword.txt will be shown in the
> query result.
>  The query is: http://localhost:8983/solr/myCore/select?q=keyword
>
> Is there a way to exclude files whose content does not contain the
> keyword but the path contains the keyword?
> Should I re-index the directory using some extra parameter? Or use
> extra condition in the query
>
>
> Thanks!
> Nan
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 


Re: Synonym expansions w/ phrase slop exhausting memory after upgrading to SOLR 7

2019-12-18 Thread Nick D
Michael,

Thank you so much, that was extremely helpful. My googlefu wasn't good
enough I guess.

1. Was my initial fix just to stop it from exploding.

2. Will be the perm solutions for now until we can get some things squared
away for 8.0.

Sounds like even in 8 there is a problem with any graph query expansion
potential still growing rather large but it just won't consume all
available memory, is that correct?

One final question, why would the maxbooleanqueries value in the solrconfig
still apply? Reading through all the jiras I thought that was supposed to
still be a fail safe, did I miss something?

Thanks again for your help,

Nick

On Wed, Dec 18, 2019, 8:10 AM Michael Gibney 
wrote:

> This is related to this issue:
> https://issues.apache.org/jira/browse/SOLR-13336
>
> Also tangentially relevant:
> https://issues.apache.org/jira/browse/LUCENE-8531
> https://issues.apache.org/jira/browse/SOLR-12243
>
> I think your options include:
> 1. setting slop=0, which restores SpanNearQuery as the graph phrase
> query implementation (see LUCENE-8531)
> 2. downgrading to 7.5 would avoid the OOM, but would cause graph
> phrase queries to be effectively ignored (see SOLR-12243)
> 3. upgrade to 8.0, which will restore the failsafe maxBooleanClauses,
> avoiding OOM but returning an error code for affected queries (which
> in your case sounds like most queries?) (see SOLR-13336)
>
> Michael
>
> On Tue, Dec 17, 2019 at 4:16 PM Nick D  wrote:
> >
> > Hello All,
> >
> > We recently upgraded from Solr 6.6 to Solr 7.7.2 and recently had spikes
> in
> > memory that eventually caused either an OOM or almost 100% utilization of
> > the available memory. After trying a few things, increasing the JVM heap,
> > making sure docValues were set for all Sort, facet fields (thought maybe
> > the fieldCache was blowing up), I was able to isolate a single query that
> > would cause the used memory to become fully exhausted and effectively
> > render the instance dead. After applying a timeAllowed  value to the
> query
> > and reducing the query phrase (system would crash on without throwing the
> > warning on longer queries containing synonyms). I was able to idenitify
> the
> > following warning in the logs:
> >
> > o.a.s.s.SolrIndexSearcher Query: 
> >
> > the request took too long to iterate over terms. Timeout: timeoutAt:
> > 812182664173653 (System.nanoTime(): 812182715745553),
> > TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@7a0db441
> >
> > I have narrowed the problem down to the following:
> > the way synonyms are being expaneded along with phrase slop.
> >
> > With a ps=5 I get 4096 possible permutations of the phrase being searched
> > with because of synonyms, looking similar to:
> > ngs_title:"bereavement leave type build bereavement leave type data p"~5
> >  ngs_title:"bereavement leave type build bereavement bereavement type
> data
> > p"~5
> >  ngs_title:"bereavement leave type build bereavement jury duty type data
> > p"~5
> >  ngs_title:"bereavement leave type build bereavement maternity leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build bereavement paternity type data
> > p"~5
> >  ngs_title:"bereavement leave type build bereavement paternity leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build bereavement adoption leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build jury duty maternity leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build jury duty paternity type data
> p"~5
> >  ngs_title:"bereavement leave type build jury duty paternity leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build jury duty adoption leave type
> data
> > p"~5
> >  ngs_title:"bereavement leave type build jury duty absence type data p"~5
> >  ngs_title:"bereavement leave type build maternity leave leave type data
> > p"~5
> >  ngs_title:"bereavement leave type build maternity leave bereavement type
> > data p"~5
> >  ngs_title:"bereavement leave type build maternity leave jury duty type
> > data p"~5
> >
> > 
> >
> > Previously in Solr 6 that same query, with the same synonyms (and query
> > analysis chain) would produce a parsedQuery like when using a =5:
> > DisjunctionMaxQuery(((ngs_field_description:\"leave leave type build
> leave
> > leave type data ? p leave leave type type.enabled\"~5)^3.0 |
> > (ngs_title:\"leave leave type build leave leave type data ? p leave leave
> > type type.enabled\"~5)^10.0)
> >
> > The expansion wasn't being applied to the added disjunctionMaxQuery to
> when
> > adjusting rankings with phrase slop.
> >
> > In general the parsedqueries between 6 and 7 are differnet, with some new
> > `spanNears` showing but they don't create the memory consumpution issues
> > that I have seen when a large synonym expansion is happening along w/
> using
> > a PS parameter.
> >
> > I didn't see much in terms on release notes changes for synonym changes
> > (outside of SOW=false 

Re: Starting Solr automatically

2019-12-18 Thread Shawn Heisey

On 12/16/2019 9:48 PM, Anuj Bhargava wrote:

Often solr stops working. We have to then go to the root directory and give
the command *'service solr start*'

Is there a way to automatically start solr when it stops.


If Solr is stopping, then something went wrong.  Something that will 
probably continue to go wrong if you simply restart Solr.  An OOM (out 
of memory) like Paras mentioned in his reply is the most likely cause.


You haven't mentioned which version of Solr you've got.  Most recent 
versions, when started on a non-Windows operating system, will 
self-terminate if Java throws an OutOfMemoryError (OOME) exception.  If 
this happens, a separate logfile with "oom" in its filename should be 
created.  The reason that this happens is that operation of any Java 
program is completely unpredictable after an OOME.


If OOME is happening, then you have exactly two ways to deal with it. 
They are outlined here:


https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems#SolrPerformanceProblems-JavaHeap

Thanks,
Shawn


how to exclude path from being queried

2019-12-18 Thread Nan Yu
Hi, 
    I am trying to find all files containing a keyword in a directory (and many 
sub-directories).
   
    I did a quick indexing using 

bin/post -c myCore /RootDir

    When I query the index using "keyword", all files whose path containing the 
keyword will be included in the search result. For example: 
/RootDir/KeywordReports/FileDoesNotContainKeyword.txt will be shown in the 
query result. 
     The query is: http://localhost:8983/solr/myCore/select?q=keyword
  
    Is there a way to exclude files whose content does not contain the keyword 
but the path contains the keyword?
    Should I re-index the directory using some extra parameter? Or use extra 
condition in the query 


Thanks!
Nan 



Re: CVE-2017-7525 fix for Solr 7.7.x

2019-12-18 Thread Kevin Risden
There are no specific plans for any 7.x branch releases that I'm aware of.
Specifically for SOLR-13110, that required upgrading Hadoop 2.x to 3.x for
specifically jackson-mapper-asl and there are no plans to backport that to
7.x even if there was a future 7.x release.

Kevin Risden


On Wed, Dec 18, 2019 at 8:44 AM Mehai, Lotfi 
wrote:

> Hello;
>
> We are using Solr 7.7.0. The CVE-2017-7525 have been fixed for Solr 8.x.
> https://issues.apache.org/jira/browse/SOLR-13110
>
> When the fix will be available for Solr 7.7.x
>
> Lotfi
>


Re: number of files indexed (re-formatted)

2019-12-18 Thread Erick Erickson
I’d urge you to consider moving the process from using ExtractingRequestHandler 
(i.e. just sending the data to Solr) to doing the Tika parser externally. 
ExtractingRequestHandler is a great way to get started, but I’ve often found 
that I need much finer control over the process.

Here’s the full treatment:
https://lucidworks.com/post/indexing-with-solrj/

Best,
Erick

> On Dec 18, 2019, at 11:15 AM, Jörn Franke  wrote:
> 
> This depends on your ingestion process. Usually the unique ids that are not 
> filenames may come not from a file or your ingestion process does not tel the 
> file name. In this case the Collection seems to be configured to generate a 
> unique identifier.
> 
> Maybe you can describe more in detail on how you process the files.
> 
> A wild speculation could be that they come from inside a zip file. In this 
> case Metadata from Tika could be used as an Id were you concatenation zip 
> file + file inside zip file .
> However we don’t know what you have defined how your ingestion process looks 
> like so this is pure speculation from my side.
> 
>> Am 18.12.2019 um 16:40 schrieb Nan Yu :
>> 
>> Sorry that I just found out that the mailing list takes plain text and my 
>> previous post looks really messy. So I reformatted it.
>> 
>> 
>> Hi,
>>I did a simple indexing of a directory that contains a lot of pdf, text, 
>> doc, zip etc. There are no structures for the content of the files and I 
>> would like to index them and later on search "key words" within the files.
>> 
>> 
>>After creating the core, I indexed the files in the directory using the 
>> following command: 
>> 
>> 
>> bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log
>> 
>> 
>>The log file shows something like below (the first and last few lines in 
>> the log file):
>> 
>> 
>> java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes 
>> -Dport=8983 -Dm=15g -Dc=myCore -Ddata=files -Drecursive=yes 
>> org.apache.solr.util.SimplePostTool /DATA_FOLDER
>> SimplePostTool version 5.0.0
>> Posting files to [base] url http://localhost:8983/solr/myCore/update...
>> Entering auto mode. File endings considered are 
>> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
>> ...
>> ...
>> ...
>> POSTing file Report.pdf (application/pdf) to [base]/extract
>> 47256 files indexed.
>> COMMITting Solr index changes to http://localhost:8983/solr/myCore/update...
>> Time spent: 1:03:59.587
>> 
>> 
>> 
>> 
>> But when using browser to try to look at the result, the "overview" 
>> (http://localhost:8983/solr/#/myCore/core-overview) shows:
>> Num Docs: 47648
>> 
>> 
>> Most of the files indexed has an metadata id has the value of the full path 
>> of the file indexed, such as /DATA_FOLDER/20180321/Report.pdf 
>> 
>> 
>> But there are about 400 of them, the id looks like: 
>> 232d7bd6-c586-4726-8d2b-bc9b1febcff4.
>> 
>> 
>> So my questions are:
>> (1)why the two numbers are different (in log file vs. in the overview).
>> (2)for those ids that are not a full path of a file, how do I know where 
>> they comes from (the original file)?
>> 
>> 
>> 
>> 
>> Thanks for your help!
>> Nan
>> 
>> 
>> 
>> 
>> PS: a few examples of query result for those strange ids:
>> 
>> 
>> {
>>"bolt-small-online":["Test strip-north"],
>>"3696714.008":[3702848.584],
>>"380614.564":[376900.143],
>>"100.038":[111.074],
>>"gpo-bolt":["teststrip"],
>>"id":"232d7bd6-c586-4726-8d2b-bc9b1febcff4",
>>"_version_":1652839231413813252
>> }
>> 
>> 
>> 
>> 
>> {
>>"Date":["8/24/2001"],
>>"EXT31":[0],
>>"EXT32":[0.12],
>>"Aggregate":[0.12],
>>"Pounds_Vap":[37],
>>"Gallons_Vap":[5.8],
>>"Gallons_Liq":[0],
>>"Gallons_Tot":[5.8],
>>"Avg_Rate":[1.8],
>>"Gallons_Rec":[577],
>>"Water":[577],
>>"id":"840c05af-caf0-4407-8753-dcc6957abcc5",
>>"Well_s_":["EXT31;EXT32"],
>>"Time__hrs_":[3.25],
>>"_version_":1652898731969740800}]
>>  }
>> 
>> 
>> {
>>"2":[4],
>>"SFS1":["PLM1"],
>>"1.00":[1.0],
>>"69":[79],
>>"id":"e675a6f5-0a3e-41b1-b1fe-b3098d0be725",
>>"_version_":1652825435791163395
>> }



Re: number of files indexed (re-formatted)

2019-12-18 Thread Jörn Franke
This depends on your ingestion process. Usually the unique ids that are not 
filenames may come not from a file or your ingestion process does not tel the 
file name. In this case the Collection seems to be configured to generate a 
unique identifier.

Maybe you can describe more in detail on how you process the files.

A wild speculation could be that they come from inside a zip file. In this case 
Metadata from Tika could be used as an Id were you concatenation zip file + 
file inside zip file .
However we don’t know what you have defined how your ingestion process looks 
like so this is pure speculation from my side.

> Am 18.12.2019 um 16:40 schrieb Nan Yu :
> 
> Sorry that I just found out that the mailing list takes plain text and my 
> previous post looks really messy. So I reformatted it.
> 
> 
> Hi,
> I did a simple indexing of a directory that contains a lot of pdf, text, 
> doc, zip etc. There are no structures for the content of the files and I 
> would like to index them and later on search "key words" within the files.
> 
> 
> After creating the core, I indexed the files in the directory using the 
> following command: 
> 
> 
> bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log
> 
> 
> The log file shows something like below (the first and last few lines in 
> the log file):
> 
> 
> java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes 
> -Dport=8983 -Dm=15g -Dc=myCore -Ddata=files -Drecursive=yes 
> org.apache.solr.util.SimplePostTool /DATA_FOLDER
> SimplePostTool version 5.0.0
> Posting files to [base] url http://localhost:8983/solr/myCore/update...
> Entering auto mode. File endings considered are 
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> ...
> ...
> ...
> POSTing file Report.pdf (application/pdf) to [base]/extract
> 47256 files indexed.
> COMMITting Solr index changes to http://localhost:8983/solr/myCore/update...
> Time spent: 1:03:59.587
> 
> 
> 
> 
> But when using browser to try to look at the result, the "overview" 
> (http://localhost:8983/solr/#/myCore/core-overview) shows:
> Num Docs: 47648
> 
> 
> Most of the files indexed has an metadata id has the value of the full path 
> of the file indexed, such as /DATA_FOLDER/20180321/Report.pdf 
> 
> 
> But there are about 400 of them, the id looks like: 
> 232d7bd6-c586-4726-8d2b-bc9b1febcff4.
> 
> 
> So my questions are:
> (1)why the two numbers are different (in log file vs. in the overview).
> (2)for those ids that are not a full path of a file, how do I know where they 
> comes from (the original file)?
> 
> 
> 
> 
> Thanks for your help!
> Nan
> 
> 
> 
> 
> PS: a few examples of query result for those strange ids:
> 
> 
> {
> "bolt-small-online":["Test strip-north"],
> "3696714.008":[3702848.584],
> "380614.564":[376900.143],
> "100.038":[111.074],
> "gpo-bolt":["teststrip"],
> "id":"232d7bd6-c586-4726-8d2b-bc9b1febcff4",
> "_version_":1652839231413813252
> }
> 
> 
> 
> 
> {
> "Date":["8/24/2001"],
> "EXT31":[0],
> "EXT32":[0.12],
> "Aggregate":[0.12],
> "Pounds_Vap":[37],
> "Gallons_Vap":[5.8],
> "Gallons_Liq":[0],
> "Gallons_Tot":[5.8],
> "Avg_Rate":[1.8],
> "Gallons_Rec":[577],
> "Water":[577],
> "id":"840c05af-caf0-4407-8753-dcc6957abcc5",
> "Well_s_":["EXT31;EXT32"],
> "Time__hrs_":[3.25],
> "_version_":1652898731969740800}]
>   }
> 
> 
> {
> "2":[4],
> "SFS1":["PLM1"],
> "1.00":[1.0],
> "69":[79],
> "id":"e675a6f5-0a3e-41b1-b1fe-b3098d0be725",
> "_version_":1652825435791163395
> }


Re: Synonym expansions w/ phrase slop exhausting memory after upgrading to SOLR 7

2019-12-18 Thread Michael Gibney
This is related to this issue:
https://issues.apache.org/jira/browse/SOLR-13336

Also tangentially relevant:
https://issues.apache.org/jira/browse/LUCENE-8531
https://issues.apache.org/jira/browse/SOLR-12243

I think your options include:
1. setting slop=0, which restores SpanNearQuery as the graph phrase
query implementation (see LUCENE-8531)
2. downgrading to 7.5 would avoid the OOM, but would cause graph
phrase queries to be effectively ignored (see SOLR-12243)
3. upgrade to 8.0, which will restore the failsafe maxBooleanClauses,
avoiding OOM but returning an error code for affected queries (which
in your case sounds like most queries?) (see SOLR-13336)

Michael

On Tue, Dec 17, 2019 at 4:16 PM Nick D  wrote:
>
> Hello All,
>
> We recently upgraded from Solr 6.6 to Solr 7.7.2 and recently had spikes in
> memory that eventually caused either an OOM or almost 100% utilization of
> the available memory. After trying a few things, increasing the JVM heap,
> making sure docValues were set for all Sort, facet fields (thought maybe
> the fieldCache was blowing up), I was able to isolate a single query that
> would cause the used memory to become fully exhausted and effectively
> render the instance dead. After applying a timeAllowed  value to the query
> and reducing the query phrase (system would crash on without throwing the
> warning on longer queries containing synonyms). I was able to idenitify the
> following warning in the logs:
>
> o.a.s.s.SolrIndexSearcher Query: 
>
> the request took too long to iterate over terms. Timeout: timeoutAt:
> 812182664173653 (System.nanoTime(): 812182715745553),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@7a0db441
>
> I have narrowed the problem down to the following:
> the way synonyms are being expaneded along with phrase slop.
>
> With a ps=5 I get 4096 possible permutations of the phrase being searched
> with because of synonyms, looking similar to:
> ngs_title:"bereavement leave type build bereavement leave type data p"~5
>  ngs_title:"bereavement leave type build bereavement bereavement type data
> p"~5
>  ngs_title:"bereavement leave type build bereavement jury duty type data
> p"~5
>  ngs_title:"bereavement leave type build bereavement maternity leave type
> data p"~5
>  ngs_title:"bereavement leave type build bereavement paternity type data
> p"~5
>  ngs_title:"bereavement leave type build bereavement paternity leave type
> data p"~5
>  ngs_title:"bereavement leave type build bereavement adoption leave type
> data p"~5
>  ngs_title:"bereavement leave type build jury duty maternity leave type
> data p"~5
>  ngs_title:"bereavement leave type build jury duty paternity type data p"~5
>  ngs_title:"bereavement leave type build jury duty paternity leave type
> data p"~5
>  ngs_title:"bereavement leave type build jury duty adoption leave type data
> p"~5
>  ngs_title:"bereavement leave type build jury duty absence type data p"~5
>  ngs_title:"bereavement leave type build maternity leave leave type data
> p"~5
>  ngs_title:"bereavement leave type build maternity leave bereavement type
> data p"~5
>  ngs_title:"bereavement leave type build maternity leave jury duty type
> data p"~5
>
> 
>
> Previously in Solr 6 that same query, with the same synonyms (and query
> analysis chain) would produce a parsedQuery like when using a =5:
> DisjunctionMaxQuery(((ngs_field_description:\"leave leave type build leave
> leave type data ? p leave leave type type.enabled\"~5)^3.0 |
> (ngs_title:\"leave leave type build leave leave type data ? p leave leave
> type type.enabled\"~5)^10.0)
>
> The expansion wasn't being applied to the added disjunctionMaxQuery to when
> adjusting rankings with phrase slop.
>
> In general the parsedqueries between 6 and 7 are differnet, with some new
> `spanNears` showing but they don't create the memory consumpution issues
> that I have seen when a large synonym expansion is happening along w/ using
> a PS parameter.
>
> I didn't see much in terms on release notes changes for synonym changes
> (outside of SOW=false being the default for version . 7).
>
> The field being opertated on has the following query analysis chain:
>
>  
> 
>  words="stopwords.txt"/>
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>   
>
> Not sure if there is a change in phrase slop that now takes synonyms into
> account and if there is way to disable that kind of expansion or not. I am
> not sure if it is related to SOLR-10980
>  or
> not, does seem to be related,  but referenced Solr 6 which does not do the
> expansion.
>
> Any help would be greatly appreciated.
>
> Nick


number of files indexed (re-formatted)

2019-12-18 Thread Nan Yu
Sorry that I just found out that the mailing list takes plain text and my 
previous post looks really messy. So I reformatted it.


Hi,
    I did a simple indexing of a directory that contains a lot of pdf, text, 
doc, zip etc. There are no structures for the content of the files and I would 
like to index them and later on search "key words" within the files.


    After creating the core, I indexed the files in the directory using the 
following command: 


bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log


    The log file shows something like below (the first and last few lines in 
the log file):


java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes 
-Dport=8983 -Dm=15g -Dc=myCore -Ddata=files -Drecursive=yes 
org.apache.solr.util.SimplePostTool /DATA_FOLDER
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/myCore/update...
Entering auto mode. File endings considered are 
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
...
...
...
POSTing file Report.pdf (application/pdf) to [base]/extract
47256 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/myCore/update...
Time spent: 1:03:59.587




But when using browser to try to look at the result, the "overview" 
(http://localhost:8983/solr/#/myCore/core-overview) shows:
Num Docs: 47648


Most of the files indexed has an metadata id has the value of the full path of 
the file indexed, such as /DATA_FOLDER/20180321/Report.pdf 


But there are about 400 of them, the id looks like: 
232d7bd6-c586-4726-8d2b-bc9b1febcff4.


So my questions are:
(1)why the two numbers are different (in log file vs. in the overview).
(2)for those ids that are not a full path of a file, how do I know where they 
comes from (the original file)?




Thanks for your help!
Nan




PS: a few examples of query result for those strange ids:


{
"bolt-small-online":["Test strip-north"],
"3696714.008":[3702848.584],
"380614.564":[376900.143],
"100.038":[111.074],
"gpo-bolt":["teststrip"],
"id":"232d7bd6-c586-4726-8d2b-bc9b1febcff4",
"_version_":1652839231413813252
}




{
"Date":["8/24/2001"],
"EXT31":[0],
"EXT32":[0.12],
"Aggregate":[0.12],
"Pounds_Vap":[37],
"Gallons_Vap":[5.8],
"Gallons_Liq":[0],
"Gallons_Tot":[5.8],
"Avg_Rate":[1.8],
"Gallons_Rec":[577],
"Water":[577],
"id":"840c05af-caf0-4407-8753-dcc6957abcc5",
"Well_s_":["EXT31;EXT32"],
"Time__hrs_":[3.25],
"_version_":1652898731969740800}]
  }


{
"2":[4],
"SFS1":["PLM1"],
"1.00":[1.0],
"69":[79],
"id":"e675a6f5-0a3e-41b1-b1fe-b3098d0be725",
"_version_":1652825435791163395
}


number of files indexed

2019-12-18 Thread Nan Yu
Hi,    I did a simple indexing of a directory that contains a lot of pdf, text, 
doc, zip etc. There are no structures for the content of the files and I would 
like to index them and later on search "key words" within the files.
    After creating the core, I indexed the files in the directory using the 
following command: 
bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log
    The log file shows something like below (the first and last few lines in 
the log file):
java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes 
-Dport=8983 -Dm=15g -Dc=myCore -Ddata=files -Drecursive=yes 
org.apache.solr.util.SimplePostTool /DATA_FOLDERSimplePostTool version 
5.0.0Posting files to [base] url 
http://localhost:8983/solr/myCore/update...Entering auto mode. File endings 
considered are 
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log.POSTing
 file Report.pdf (application/pdf) to [base]/extract47256 files 
indexed.COMMITting Solr index changes to 
http://localhost:8983/solr/myCore/update...Time spent: 1:03:59.587

But when using browser to try to look at the result, the "overview" 
(http://localhost:8983/solr/#/myCore/core-overview) shows:Num Docs: 47648
Most of the files indexed has an metadata id has the value of the full path of 
the file indexed, such as /DATA_FOLDER/20180321/Report.pdf 
But there are about 400 of them, the id looks like: 
232d7bd6-c586-4726-8d2b-bc9b1febcff4.
So my questions are:(1)why the two numbers are different (in log file vs. in 
the overview).(2)for those ids that are not a full path of a file, how do I 
know where they comes from (the original file)?

Thanks for your help!Nan

PS: a few examples of query result for those strange ids:
 { "bolt-small-online":["Test strip-north"], "3696714.008":[3702848.584], 
"380614.564":[376900.143], "100.038":[111.074], "gpo-bolt":["teststrip"], 
"id":"232d7bd6-c586-4726-8d2b-bc9b1febcff4", "_version_":1652839231413813252}

{ "Date":["8/24/2001"], "EXT31":[0], "EXT32":[0.12], "Aggregate":[0.12], 
"Pounds_Vap":[37], "Gallons_Vap":[5.8], "Gallons_Liq":[0], "Gallons_Tot":[5.8], 
"Avg_Rate":[1.8], "Gallons_Rec":[577], "Water":[577], 
"id":"840c05af-caf0-4407-8753-dcc6957abcc5", "Well_s_":["EXT31;EXT32"], 
"Time__hrs_":[3.25], "_version_":1652898731969740800}] }
 { "2":[4], "SFS1":["PLM1"], "1.00":[1.0], "69":[79], 
"id":"e675a6f5-0a3e-41b1-b1fe-b3098d0be725", "_version_":1652825435791163395}




Move SOLR from cloudera HDFS to SOLR on Docker

2019-12-18 Thread Wael Kader
Hello,

I want to move data from my SOLR setup on Cloudera Hadoop to a docker SOLR
container.
I don't need to run all the hadoop services in my setup as I am only
currently using SOLR from the cloudera HDP.

My concern now is to know what's the best way to move the data and schema
to Docker container.
I don't mind moving data to an older version of SOLR Container to match the
4.10.3 SOLR Version I have on Cloudera.

Much help is appreciated.

-- 
Regards,
Wael


CVE-2017-7525 fix for Solr 7.7.x

2019-12-18 Thread Mehai, Lotfi
Hello;

We are using Solr 7.7.0. The CVE-2017-7525 have been fixed for Solr 8.x.
https://issues.apache.org/jira/browse/SOLR-13110

When the fix will be available for Solr 7.7.x

Lotfi