subject:"Indexing issue"

Solr 7.7 Indexing issue

2020-09-30 Thread Manisha Rahatadkar

Hello all

We are using Apache Solr 7.7 on Windows platform. The data is synced to Solr 
using Solr.Net commit. The data is being synced to SOLR in batches. The 
document size is very huge (~0.5GB average) and solr indexing is taking long 
time. Total document size is ~200GB. As the solr commit is done as a part of 
API, the API calls are failing as document indexing is not completed.

  1.  What is your advise on syncing such a large volume of data to Solr KB.
  2.  Because of the search fields requirements, almost 8 fields are defined as 
Text fields.
  3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such a large 
volume of data? ( IF "%SOLR_JAVA_MEM%"=="" set SOLR_JAVA_MEM=-Xms2g -Xmx2g)
  4.  How to set up Solr in production on Windows? Currently it's set up as a 
standalone engine and client is requested to take the backup of the drive. Is 
there any other better way to do? How to set up for the disaster recovery?

Thanks in advance.

Regards
Manisha Rahatadkar


Confidentiality Notice

This email message, including any attachments, is for the sole use of the 
intended recipient and may contain confidential and privileged information. Any 
unauthorized view, use, disclosure or distribution is prohibited. If you are 
not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message. Anju Software, Inc. 4500 S. 
Lakeshore Drive, Suite 620, Tempe, AZ USA 85282.

Re: Solr 8.5.2 indexing issue

2020-07-02 Thread gnandre

It seems that the issue is not with reference_url field itself. There is
one copy field which has the reference_url field as source and another
field called url_path as destination.
This destination field url_path has the following field type definition.

If I remove  SynonymGraphFilterFactory and FlattenGraphFilterFactory in
above field type definition then it works otherwise it throws the
same error (IndexOutOfBoundsException) .

On Sun, Jun 28, 2020 at 9:06 AM Erick Erickson 
wrote:

> How are you sending this to Solr? I just tried 8.5, submitting that doc
> through the admin UI and it works fine.
> I defined “asset_id” with as the same type as your reference_url field.
>
> And does the log on the Solr node that tries to index this give any more
> info?
>
> Best,
> Erick
>
> > On Jun 27, 2020, at 10:45 PM, gnandre  wrote:
> >
> > {
> >"asset_id":"add-ons:576deefef7453a9189aa039b66500eb2",
> >
> >
> "reference_url":"modeling-a-high-speed-backplane-part-3-4-port-s-parameters-to-differential-tdr-and-tdt.html"}
>
>

Re: Solr 8.5.2 indexing issue

2020-06-28 Thread Erick Erickson

How are you sending this to Solr? I just tried 8.5, submitting that doc through 
the admin UI and it works fine. 
I defined “asset_id” with as the same type as your reference_url field.

And does the log on the Solr node that tries to index this give any more info?

Best,
Erick

> On Jun 27, 2020, at 10:45 PM, gnandre  wrote:
> 
> {
>"asset_id":"add-ons:576deefef7453a9189aa039b66500eb2",
> 
> "reference_url":"modeling-a-high-speed-backplane-part-3-4-port-s-parameters-to-differential-tdr-and-tdt.html"}

Solr 8.5.2 indexing issue

2020-06-27 Thread gnandre

Hi,

I have the following document which fails to get indexed.

{
"asset_id":"add-ons:576deefef7453a9189aa039b66500eb2",

"reference_url":"modeling-a-high-speed-backplane-part-3-4-port-s-parameters-to-differential-tdr-and-tdt.html"}

I am not sure what is so special about the content in the reference_url
field.

reference_url field is defined as follows in schema:



It throws the following error.

Status: 
{"data":{"responseHeader":{"status":400,"QTime":18},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.IndexOutOfBoundsException"],"msg":"Exception
writing document id add-ons:576deefef7453a9189aa039b66500eb2 to the index;
possible analysis
error.","code":400}},"status":400,"config":{"method":"POST","transformRequest":[null],"transformResponse":[null],"jsonpCallbackParam":"callback","headers":{"Content-type":"application/json","Accept":"application/json,
text/plain, */*","X-Requested-With":"XMLHttpRequest"},"data":"[{\n
\"asset_id\":\"add-ons:576deefef7453a9189aa039b66500eb2\",\n
\"reference_url\":\"modeling-a-high-speed-backplane-part-3-4-port-s-parameters-to-differential-tdr-and-tdt.html\"}]","url":"add-ons/update","params":{"wt":"json","_":1593304427428,"commitWithin":1000,"overwrite":true},"timeout":1},"statusText":"Bad
Request","xhrStatus":"complete","resource":{"0":"[","1":"{","2":"\n","3":"
","4":" ","5":" ","6":" ","7":" ","8":" ","9":" ","10":"
","11":"\"","12":"a","13":"s","14":"s","15":"e","16":"t","17":"_","18":"i","19":"d","20":"\"","21":":","22":"\"","23":"a","24":"d","25":"d","26":"-","27":"o","28":"n","29":"s","30":":","31":"5","32":"7","33":"6","34":"d","35":"e","36":"e","37":"f","38":"e","39":"f","40":"7","41":"4","42":"5","43":"3","44":"a","45":"9","46":"1","47":"8","48":"9","49":"a","50":"a","51":"0","52":"3","53":"9","54":"b","55":"6","56":"6","57":"5","58":"0","59":"0","60":"e","61":"b","62":"2","63":"\"","64":",","65":"\n","66":"
","67":" ","68":" ","69":" ","70":" ","71":" ","72":" ","73":"
","74":"\"","75":"r","76":"e","77":"f","78":"e","79":"r","80":"e","81":"n","82":"c","83":"e","84":"_","85":"u","86":"r","87":"l","88":"\"","89":":","90":"\"","91":"m","92":"o","93":"d","94":"e","95":"l","96":"i","97":"n","98":"g","99":"-","100":"a","101":"-","102":"h","103":"i","104":"g","105":"h","106":"-","107":"s","108":"p","109":"e","110":"e","111":"d","112":"-","113":"b","114":"a","115":"c","116":"k","117":"p","118":"l","119":"a","120":"n","121":"e","122":"-","123":"p","124":"a","125":"r","126":"t","127":"-","128":"3","129":"-","130":"4","131":"-","132":"p","133":"o","134":"r","135":"t","136":"-","137":"s","138":"-","139":"p","140":"a","141":"r","142":"a","143":"m","144":"e","145":"t","146":"e","147":"r","148":"s","149":"-","150":"t","151":"o","152":"-","153":"d","154":"i","155":"f","156":"f","157":"e","158":"r","159":"e","160":"n","161":"t","162":"i","163":"a","164":"l","165":"-","166":"t","167":"d","168":"r","169":"-","170":"a","171":"n","172":"d","173":"-","174":"t","175":"d","176":"t","177":".","178":"h","179":"t","180":"m","181":"l","182":"\"","183":"}","184":"]"}}

Re: Migration: SOLR8-Java8 -> SOLR8-JAVA11 indexing issue.

2019-10-24 Thread anup.junagade

Thanks Shawn for checking.

As advised we will execute the indexing with the new settings as mentioned
and will update the results.

Here are the links to missing attachments:

Attachment 1:  OpenJDK 11 vs OpenJDK 8 key metrics
  
Attachment 2:   OpenJDK 11 vs OpenJDK 8 waiting QTP Threads
  
Attachment 3:  OpenJDK 11 Thread dump
  



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Migration: SOLR8-Java8 -> SOLR8-JAVA11 indexing issue.

2019-10-24 Thread Shawn Heisey


On 10/24/2019 11:50 AM, Junagade, Anup wrote:

   *   Attachment 1: OpenJDK 8 vs OpenJDK 8 key metrics
   *   Attachment 2:  OpenJDK 8 vs OpenJDK 8 waiting QTP Threads
   *   Attachment 3: OpenJDK 11 Thread dump


There are no attachments.  Apache mailing lists swallow almost all 
attachments.  You will need to use a file sharing website to 
successfully get files to us.



Heap allocated: 32 GB


If you set your heap to 31GB, you'll actually have more memory available 
to Java than a heap size of 32GB.  This is because at 32GB, longer 
pointers are required.  Solr has a tendency to create a very large 
number of small objects, so the pointer size increase ends up using a 
lot of memory.




   100
   150



These numbers are huge.  Without setting maxMergeAtOnceExplicit, you're 
not getting the full benefit of increasing these settings beyond the 
defaults of 10.  Set maxMergeAtOnce and segmentsPerTier to the same 
number and then use three times that number for maxMergeAtOnceExplicit. 
The Explicit setting is not mentioned in Solr documentation.  Numbers as 
big as you have chosen will result in Solr keeping a LOT of files open, 
becasue the index will end up with a large number of segments.  The OS 
will definitely need to have its "max open files" limit increased.



GC and ZK Settings

-DzkClientTimeout=30
-DzkHost= ,,
-XX:+PrintGCDetails
-XX:+UseG1GC
-XX:+UseStringDeduplication
-XX:ConcGCThreads=8
-XX:InitiatingHeapOccupancyPercent=70
-XX:MaxGCPauseMillis=200
-XX:ParallelGCThreads=32
-XX:PermSize=512m
-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,
uptime:filecount=9,
filesize=20M
-Xms32g-Xmx32g
-Xss256k-Xss256k
-verbose:gc


It looks like you have used your own GC settings instead of those that 
Solr comes with.  Your settings are missing one of the most important 
parameters for good GC performance.  You should let Solr's start script 
handle GC tuning and GC logging without interference.


Thanks,
Shawn

Solr 8.1.1 Indexing issue while migrating Java8 -> Java11

2019-10-24 Thread anup.junagade

 
We are trying to migrate our SOLR 8.1.1 cluster from OpenJDK Java 8 to
OpenJDK Java 11 and are facing issues with Indexing. While our indexing is
happening flawlessly on Java 8, it crawls or maybe I should say it stalls
with Java 11.
Any pointers/help is appreciated.
 
*Symptoms*
 
With OpenJDK 11 and SOLR 8.1.1 we see that for the first 30 minutes response
times for updates similar to our current implementation (OpenJDK 8 and SOLR
8.1.1). It has to be noted that there are no read queries being executed at
the time of indexing
On the OpenJDK 11 implementation, the qtp active threads continuously
increasing to thousands while OpenJDK 8 implementation stops after
approximately going up to 150.
On the On the OpenJDK 11 implementation, the classes loaded start at a very
high number and stay there as opposed to  OpenJDK 8  implementation where
the number of classes loaded are small to begin with and remains under
control. I believe the qtp threads in wait state mentioned above are causing
this symptom
Attachment 1:  OpenJDK 8 vs OpenJDK 8 key metrics
  
Attachment 2:   OpenJDK 8 vs OpenJDK 8 waiting QTP Threads
  
Attachment 3:  OpenJDK 11 Thread dump
  
 
 
*Following are the key configuration of our application.*
 
Index Size: 8 GB/shard
Total no of Documents in Solr cluster: 70 Million
Average Document size: 15 KB
JSON Payload for each update contains: 50 docs
Average Time Taken to post for 50 Docs: 300 milli seconds
Average Rate at which documents are being posted to SOLR: 7500 requests per
second
No of shards in the Cluster: 10 (No Replicas)
CPUs: 32
Memory: 128 GB
Heap allocated: 32 GB
SOLR Client: 8.1.1
ZK Ensemble: 3
 

  100
  150

 

48


18
false

 
GC and ZK Settings
 
-DzkClientTimeout=30
-DzkHost= ,,
-XX:+PrintGCDetails
-XX:+UseG1GC
-XX:+UseStringDeduplication
-XX:ConcGCThreads=8
-XX:InitiatingHeapOccupancyPercent=70
-XX:MaxGCPauseMillis=200
-XX:ParallelGCThreads=32
-XX:PermSize=512m
-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,
uptime:filecount=9,
filesize=20M
-Xms32g-Xmx32g
-Xss256k-Xss256k
-verbose:gc
 

 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Migration: SOLR8-Java8 -> SOLR8-JAVA11 indexing issue.

2019-10-24 Thread Junagade, Anup

We are trying to migrate our SOLR 8.1.1 cluster from OpenJDK Java 8 to OpenJDK 
Java 11 and are facing issues with Indexing. While our indexing is happening 
flawlessly on Java 8, it crawls or maybe I should say it stalls with Java 11.
Any pointers/help is appreciated.

Symptoms


  *   With OpenJDK 11 and SOLR 8.1.1 we see that for the first 30 minutes 
response times for updates similar to our current implementation (OpenJDK 8 and 
SOLR 8.1.1). It has to be noted that there are no read queries being executed 
at the time of indexing
  *   On the OpenJDK 11 implementation, the qtp active threads continuously 
increasing to thousands while OpenJDK 8 implementation stops after 
approximately going up to 150.
  *   On the On the OpenJDK 11 implementation, the classes loaded start at a 
very high number and stay there as opposed to  OpenJDK 8  implementation where 
the number of classes loaded are small to begin with and remains under control. 
I believe the qtp threads in wait state mentioned above are causing this symptom
  *   Attachment 1: OpenJDK 8 vs OpenJDK 8 key metrics
  *   Attachment 2:  OpenJDK 8 vs OpenJDK 8 waiting QTP Threads
  *   Attachment 3: OpenJDK 11 Thread dump


Following are the key metrics/configuration of our application.

Index Size: 8 GB/shard
Total no of Documents in Solr cluster: 70 Million
Average Document size: 15 KB
JSON Payload for each update contains: 50 docs
Average Time Taken to post for 50 Docs: 300 milli seconds
Average Rate at which documents are being posted to SOLR: 7500 requests per 
second
No of shards in the Cluster: 10 (No Replicas)
CPUs: 32
Memory: 128 GB
Heap allocated: 32 GB
SOLR Client: 8.1.1
ZK Ensemble: 3


  100
  150



48


18
false


GC and ZK Settings

-DzkClientTimeout=30
-DzkHost= ,,
-XX:+PrintGCDetails
-XX:+UseG1GC
-XX:+UseStringDeduplication
-XX:ConcGCThreads=8
-XX:InitiatingHeapOccupancyPercent=70
-XX:MaxGCPauseMillis=200
-XX:ParallelGCThreads=32
-XX:PermSize=512m
-Xlog:gc*:file=/var/solr/logs/solr_gc.log:time,
uptime:filecount=9,
filesize=20M
-Xms32g-Xmx32g
-Xss256k-Xss256k
-verbose:gc

Thanks,
Anup

This message, including any attachments, is the property of Transform HoldCo 
LLC and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.

Re: Regarding pdf indexing issue

2018-07-11 Thread Terry Steichen

Walter,

Well said.  (And I love the hamburger conversion analogy - very apt.)

The only thing I will add is that when you have a collection of similar
rich text documents, you might be able to construct queries to respect
internal structures within the documents.  If all/most of your documents
have a unique line like "subject:", you might be able to be selective.

Also, if your documents are organized on disk in some categorical way,
you can include in your query, a reference to that categorical
information (via the id:*pattern* field).

Finally, there *might* be useful information in the metadata that you
can use in refining your searches.

Terry


On 07/11/2018 11:42 AM, Walter Underwood wrote:
> PDF is not a structured document format. It is a printer control format.
>
> PDF does not have a paragraph marker. Instead, it says to move
> to this spot on the page, choose this font, and print this letter. For a
> paragraph, it moves farther. For the next letter in a word, it moves a 
> little bit. Extracting paragraphs from that is a difficult pattern recognition
> problem.
>
> I worked with a PDF of a two-column magazine article that printed
> the first line of column 1, then the first line of column 2, then the 
> second line of column 1, and so on. If a line ended with a hyphenated
> word, too bad.
>
> Extracting structure from a PDF document is somewhere between 
> very hard and impossible. Someone I worked with said that getting
> structured text from PDF was like turning hamburger back into a cow.
>
> Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
> is used. It appears to be an accessibility feature, so it still might not
> be useful for search.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 11, 2018, at 8:07 AM, Erick Erickson  wrote:
>>
>> Solr will not do this automatically, the Extracting Request Handler
>> simply indexes the entire contents of the doc without regard to things
>> like paragraphs etc. Ditto with HTML. This is actually a task that
>> requires getting into Tika and using all the bells and whistles there.
>>
>> I'd recommend two things:
>>
>> 1> Take the PDF parsing offline, i.e. in a separate client. There are
>> many reasons for this, in particular you can attempt to do what you're
>> asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>
>> 2> Talk to the Tika folks about the best ways to make Tika return the
>> information such that you can index them and get what you'd like.
>>
>> Best,
>> Erick
>>
>> On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
>>  wrote:
>>> Hello Team,
>>>
>>> I am using the Solr for indexing and searching for pdf document
>>>
>>> I have go through with your website document and installed solr but unable
>>> to index and search the document.
>>>
>>> For example: Suppose we have a PDF file which have no of paragraph with
>>> separate heading.
>>>
>>> So If I search for the title on indexed pdf the result should be contain
>>> the paragraph from where the title belongs.
>>>
>>> I am unable to perform this task.
>>>
>>> I have run the below command for upload the pdf
>>>
>>> *bin/post -c gettingstarted pdf-sample.pdf*
>>>
>>> and for searching I am running the command
>>>
>>> *curl http://localhost:8983/solr/gettingstarted/select?q='*
>>> >>
>>> Please suggest me anything and let me know if I am missing anything
>>>
>>> Thanks,
>>>
>>> Rahul
>

Re: Regarding pdf indexing issue

2018-07-11 Thread Shamik Sinha

You may try to use tesseract tool to check data extraction from pdf or
images and then go forward accordingly. As far as I understand the PDF is
an image and not data. The searchable PDF actually overlays the selectable
text as hidden text over the PDF image. These PDFs can be indexed and
extracted. These are mostly supported in english and other latin
derivatives. You may face problems to extract/index text based on any other
language. Handwritten text converted to PDFs are next to impossible to
index/extract. Apache Tika may be the solution you are looking for
On Wed 11 Jul, 2018, 9:12 PM Walter Underwood, 
wrote:

> PDF is not a structured document format. It is a printer control format.
>
> PDF does not have a paragraph marker. Instead, it says to move
> to this spot on the page, choose this font, and print this letter. For a
> paragraph, it moves farther. For the next letter in a word, it moves a
> little bit. Extracting paragraphs from that is a difficult pattern
> recognition
> problem.
>
> I worked with a PDF of a two-column magazine article that printed
> the first line of column 1, then the first line of column 2, then the
> second line of column 1, and so on. If a line ended with a hyphenated
> word, too bad.
>
> Extracting structure from a PDF document is somewhere between
> very hard and impossible. Someone I worked with said that getting
> structured text from PDF was like turning hamburger back into a cow.
>
> Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
> is used. It appears to be an accessibility feature, so it still might not
> be useful for search.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jul 11, 2018, at 8:07 AM, Erick Erickson 
> wrote:
> >
> > Solr will not do this automatically, the Extracting Request Handler
> > simply indexes the entire contents of the doc without regard to things
> > like paragraphs etc. Ditto with HTML. This is actually a task that
> > requires getting into Tika and using all the bells and whistles there.
> >
> > I'd recommend two things:
> >
> > 1> Take the PDF parsing offline, i.e. in a separate client. There are
> > many reasons for this, in particular you can attempt to do what you're
> > asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
> >
> > 2> Talk to the Tika folks about the best ways to make Tika return the
> > information such that you can index them and get what you'd like.
> >
> > Best,
> > Erick
> >
> > On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
> >  wrote:
> >> Hello Team,
> >>
> >> I am using the Solr for indexing and searching for pdf document
> >>
> >> I have go through with your website document and installed solr but
> unable
> >> to index and search the document.
> >>
> >> For example: Suppose we have a PDF file which have no of paragraph with
> >> separate heading.
> >>
> >> So If I search for the title on indexed pdf the result should be contain
> >> the paragraph from where the title belongs.
> >>
> >> I am unable to perform this task.
> >>
> >> I have run the below command for upload the pdf
> >>
> >> *bin/post -c gettingstarted pdf-sample.pdf*
> >>
> >> and for searching I am running the command
> >>
> >> *curl http://localhost:8983/solr/gettingstarted/select?q='*
> >>  >>
> >> Please suggest me anything and let me know if I am missing anything
> >>
> >> Thanks,
> >>
> >> Rahul
>
>

Re: Regarding pdf indexing issue

2018-07-11 Thread Walter Underwood

PDF is not a structured document format. It is a printer control format.

PDF does not have a paragraph marker. Instead, it says to move
to this spot on the page, choose this font, and print this letter. For a
paragraph, it moves farther. For the next letter in a word, it moves a 
little bit. Extracting paragraphs from that is a difficult pattern recognition
problem.

I worked with a PDF of a two-column magazine article that printed
the first line of column 1, then the first line of column 2, then the 
second line of column 1, and so on. If a line ended with a hyphenated
word, too bad.

Extracting structure from a PDF document is somewhere between 
very hard and impossible. Someone I worked with said that getting
structured text from PDF was like turning hamburger back into a cow.

Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
is used. It appears to be an accessibility feature, so it still might not
be useful for search.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 11, 2018, at 8:07 AM, Erick Erickson  wrote:
> 
> Solr will not do this automatically, the Extracting Request Handler
> simply indexes the entire contents of the doc without regard to things
> like paragraphs etc. Ditto with HTML. This is actually a task that
> requires getting into Tika and using all the bells and whistles there.
> 
> I'd recommend two things:
> 
> 1> Take the PDF parsing offline, i.e. in a separate client. There are
> many reasons for this, in particular you can attempt to do what you're
> asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> 2> Talk to the Tika folks about the best ways to make Tika return the
> information such that you can index them and get what you'd like.
> 
> Best,
> Erick
> 
> On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
>  wrote:
>> Hello Team,
>> 
>> I am using the Solr for indexing and searching for pdf document
>> 
>> I have go through with your website document and installed solr but unable
>> to index and search the document.
>> 
>> For example: Suppose we have a PDF file which have no of paragraph with
>> separate heading.
>> 
>> So If I search for the title on indexed pdf the result should be contain
>> the paragraph from where the title belongs.
>> 
>> I am unable to perform this task.
>> 
>> I have run the below command for upload the pdf
>> 
>> *bin/post -c gettingstarted pdf-sample.pdf*
>> 
>> and for searching I am running the command
>> 
>> *curl http://localhost:8983/solr/gettingstarted/select?q='*
>> > 
>> Please suggest me anything and let me know if I am missing anything
>> 
>> Thanks,
>> 
>> Rahul

Re: Regarding pdf indexing issue

2018-07-11 Thread Erick Erickson

Solr will not do this automatically, the Extracting Request Handler
simply indexes the entire contents of the doc without regard to things
like paragraphs etc. Ditto with HTML. This is actually a task that
requires getting into Tika and using all the bells and whistles there.

I'd recommend two things:

1> Take the PDF parsing offline, i.e. in a separate client. There are
many reasons for this, in particular you can attempt to do what you're
asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/

2> Talk to the Tika folks about the best ways to make Tika return the
information such that you can index them and get what you'd like.

Best,
Erick

On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
 wrote:
> Hello Team,
>
> I am using the Solr for indexing and searching for pdf document
>
> I have go through with your website document and installed solr but unable
> to index and search the document.
>
> For example: Suppose we have a PDF file which have no of paragraph with
> separate heading.
>
> So If I search for the title on indexed pdf the result should be contain
> the paragraph from where the title belongs.
>
> I am unable to perform this task.
>
> I have run the below command for upload the pdf
>
> *bin/post -c gettingstarted pdf-sample.pdf*
>
> and for searching I am running the command
>
> *curl http://localhost:8983/solr/gettingstarted/select?q='*
> 
> Please suggest me anything and let me know if I am missing anything
>
> Thanks,
>
> Rahul

Regarding pdf indexing issue

2018-07-11 Thread Rahul Prasad Dwivedi

Hello Team,

I am using the Solr for indexing and searching for pdf document

I have go through with your website document and installed solr but unable
to index and search the document.

For example: Suppose we have a PDF file which have no of paragraph with
separate heading.

So If I search for the title on indexed pdf the result should be contain
the paragraph from where the title belongs.

I am unable to perform this task.

I have run the below command for upload the pdf

*bin/post -c gettingstarted pdf-sample.pdf*

and for searching I am running the command

*curl http://localhost:8983/solr/gettingstarted/select?q='*

Re: Indexing issue - index get deleted

2015-06-11 Thread Alessandro Benedetti

Hi Chris,
Amazing Analysis !
I did actually not investigated the log, because I was first trying to get
more information from the user.
We are running full import and delta import crons .

Fulll index once a day

delta index : every 10 mins


last night my index automatically deleted(numdocs=0).

attaching logs for review .

Reading better the user initial mail , he does a full import as well ( and
at this point, cleaning the Index) .
Not sure is there any practical reason to do that, the user will clarify
that to us.

So after the clean happened, something prevented the full import to
proceed, and we had the weird behaviour monitored in the logs.

Really curious of understanding this better :)


2015-06-11 1:36 GMT+01:00 Chris Hostetter hossman_luc...@fucit.org:


 : The guys was using delta import anyway, so maybe the problem is
 : different and not related to the clean.

 that's not what the logs say.

 Here's what i see...

 Log begins with server startup @ Jun 10, 2015 11:14:56 AM

 The DeletionPolicy for the shopclue_prod core is initialized at Jun
 10, 2015 11:15:04 AM and we see a few interesting things here we note
 for the future as we keep reading...

 1) There is currently commits:num=1 commits on disk
 2) the current index dir in use is index.20150311161021822
 3) the current segment  generation are segFN=segments_1a,generation=46

 Immediately after this, we see some searcher warming using a searcher with
 this same segments file, and then this searcher is registered (Jun 10,
 2015 11:15:05 AM) and the core is registered.

 Next we see some replication polling, and we see what look like some
 simple monitoring requests for q=* which return hits=85898 being
 repeated over and over.

 At Jun 10, 2015 11:16:30 AM we see some requests for /dataimport that
 look like they are coming from the UI. and then at Jun 10, 2015 11:17:01
 AM we see a request for a full import started.

 We have no idea what the data import configuration file looks like, so we
 have no idea if clean=false is being used or not.  it's certianly not
 specified in the URL.

 We see some more monitoring URLs returning hits=85898 and some more
 /repliation status calls, and then @ Jun 10, 2015 11:18:02 AM we see the
 first commit executed since hte server started up.

 there's no indication that this commit came from an external request (eg
 /update) so probably was made by some internal request.  One
 possiblility is that it came from DIH finishing -- but i doubt it, i'm
 fairly sure that would have involved more logging then this.  A more
 probably scenerio is that it came from an autoCommit setting -- the fact
 that it is almost exactly 60 seconds after DIH started -- and almost
 exactly 60 seconds after DIH may have done a deleteAll query due to
 clean=true -- makes it seem very likely that this was a 1 minute
 autoCommit)

 (but since we don't have either hte data import config, or the
 solrconfig.xml, we have no way of knowing -- it's all just guess work.)

 Very importantly, note that this commit is not opening a new searcher...

 Jun 10, 2015 11:18:02 AM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

 Here are some other interesting things to note from the logging
 that comes from the DeletionPolicy when this commit happens...

 1) it now notes that there are commits:num=2 on disk
 2) the current index dir hasn't changed (index.20150311161021822) so
 some weird replication command didn't swap the world out from under us
 3) the newest segment/generation are segFN=segments_1b,generation=47
 4) the newest commit has no other files in it besides the segments file.

 this means, with out a doubt, there are no documents in this commits view
 of the index.  they have all been deleted by something.


 At this point the *old* searcher (for commit generation 46) is still in
 use however -- nothing has done an openSearcher=true.

 we see more /dataimport status requests, and other requests that appear to
 come from the Solr UI, and more monitoring queries that still return
 hits=85898 because the same searcher is in use.

 At Jun 10, 2015 11:27:04 AM we see another commit happen -- again, no
 indication that this came from an outside /update request, so it might be
 from DIH, or it might be from an autoCommit setting.  the fact that it is
 nearly exactly 10 minutes after DIH started (and probably did a clean=true
 deleteAll query) makes it seem extremely likely this is an autoSoftCommit
 setting kicking in.

 Very importantly, note that this softCommit *does* open a new searcher...

 Jun 10, 2015 11:27:04 AM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start

 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}


 In less then a second, this new searcher is warmed up and the next time we
 see a q=* monitoring query get

Re: Indexing issue - index get deleted

2015-06-11 Thread Midas A

Thanks . for replying ..

please find the data-config



On Thu, Jun 11, 2015 at 6:06 AM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : The guys was using delta import anyway, so maybe the problem is
 : different and not related to the clean.

 that's not what the logs say.

 Here's what i see...

 Log begins with server startup @ Jun 10, 2015 11:14:56 AM

 The DeletionPolicy for the shopclue_prod core is initialized at Jun
 10, 2015 11:15:04 AM and we see a few interesting things here we note
 for the future as we keep reading...

 1) There is currently commits:num=1 commits on disk
 2) the current index dir in use is index.20150311161021822
 3) the current segment  generation are segFN=segments_1a,generation=46

 Immediately after this, we see some searcher warming using a searcher with
 this same segments file, and then this searcher is registered (Jun 10,
 2015 11:15:05 AM) and the core is registered.

 Next we see some replication polling, and we see what look like some
 simple monitoring requests for q=* which return hits=85898 being
 repeated over and over.

 At Jun 10, 2015 11:16:30 AM we see some requests for /dataimport that
 look like they are coming from the UI. and then at Jun 10, 2015 11:17:01
 AM we see a request for a full import started.

 We have no idea what the data import configuration file looks like, so we
 have no idea if clean=false is being used or not.  it's certianly not
 specified in the URL.

 We see some more monitoring URLs returning hits=85898 and some more
 /repliation status calls, and then @ Jun 10, 2015 11:18:02 AM we see the
 first commit executed since hte server started up.

 there's no indication that this commit came from an external request (eg
 /update) so probably was made by some internal request.  One
 possiblility is that it came from DIH finishing -- but i doubt it, i'm
 fairly sure that would have involved more logging then this.  A more
 probably scenerio is that it came from an autoCommit setting -- the fact
 that it is almost exactly 60 seconds after DIH started -- and almost
 exactly 60 seconds after DIH may have done a deleteAll query due to
 clean=true -- makes it seem very likely that this was a 1 minute
 autoCommit)

 (but since we don't have either hte data import config, or the
 solrconfig.xml, we have no way of knowing -- it's all just guess work.)

 Very importantly, note that this commit is not opening a new searcher...

 Jun 10, 2015 11:18:02 AM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

 Here are some other interesting things to note from the logging
 that comes from the DeletionPolicy when this commit happens...

 1) it now notes that there are commits:num=2 on disk
 2) the current index dir hasn't changed (index.20150311161021822) so
 some weird replication command didn't swap the world out from under us
 3) the newest segment/generation are segFN=segments_1b,generation=47
 4) the newest commit has no other files in it besides the segments file.

 this means, with out a doubt, there are no documents in this commits view
 of the index.  they have all been deleted by something.


 At this point the *old* searcher (for commit generation 46) is still in
 use however -- nothing has done an openSearcher=true.

 we see more /dataimport status requests, and other requests that appear to
 come from the Solr UI, and more monitoring queries that still return
 hits=85898 because the same searcher is in use.

 At Jun 10, 2015 11:27:04 AM we see another commit happen -- again, no
 indication that this came from an outside /update request, so it might be
 from DIH, or it might be from an autoCommit setting.  the fact that it is
 nearly exactly 10 minutes after DIH started (and probably did a clean=true
 deleteAll query) makes it seem extremely likely this is an autoSoftCommit
 setting kicking in.

 Very importantly, note that this softCommit *does* open a new searcher...

 Jun 10, 2015 11:27:04 AM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start

 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}


 In less then a second, this new searcher is warmed up and the next time we
 see a q=* monitoring query get logged, it returns hits=0.

 Note that at no point in the logs, after the DataImporter is started, do
 we see it log anything other then that it has initiated the request to
 MySQL -- we do see some logs starting ~ Jun 10, 2015 11:41:19 AM
 indicating that someone was using the Web UI to look at the dataimport
 handler's status report.  it would be really nice to know what that person
 saw at that point -- because my guess is DIH was still running and was
 staled waiting for MySql, and hadn't even started adding docs to Solr (if
 it had, i'm certian there would have been some log of it).

 So instead, the combination of a

Re: Indexing issue - index get deleted

2015-06-10 Thread Alessandro Benedetti

Let me answer in line, to get more info :

2015-06-10 10:59 GMT+01:00 Midas A test.mi...@gmail.com:

Hi Alessandro,

Please find the answers inline and help me out to figure out this problem.

1) Solr version : *4.2.1*
2) Solr architecture :* Master -slave/ Replication with requestHandler*

Where happened the issue ?
Have you read this :
The SQL Entity Processor

The SqlEntityProcessor is the default processor. The associated data source
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#UploadingStructuredDataStoreDatawiththeDataImportHandler-JdbcDataSource
should
be a JDBC URL.

The entity attributes specific to this processor are shown in the table
below.

Attribute

Use

query

Required. The SQL query used to select rows.

deltaQuery

SQL query used if the operation is delta-import. This query selects the
primary keys of the rows which will be parts of the delta-update. The pks
will be available to the deltaImportQuery through the variable
${dataimporter.delta.column-name}.

parentDeltaQuery

SQL query used if the operation is delta-import.

deletedPkQuery

SQL query used if the operation is delta-import.

deltaImportQuery

SQL query used if the operation is delta-import. If this is not present,
DIH tries to construct the import query by(after identifying the delta)
modifying the 'query' (this is error prone). There is a namespace
${dataimporter.delta.column-name} which can be used in this query. For
example, select * from tbl where id=${dataimporter.delta.id}.

It is from Solr official wiki.
You should be sure you adhere to the proper configurations.

3) Kind of data source indexed : *Mysql *

what about your delta query ? that one is the responsible for the delta
indexing

4) What happened to the datasource ? any change in there ? : *No change *

Nothing relevant happened there ? any deletion or weird update to the
database ?

5) Was the index actually deleted ? All docs deleted ? Index file segments
deleted ? Index corrupted ? : *all docs deleted , segment files are there.
index file is also there .*

So a deletion + commit happened, but still no merge purging the index
deleted content ?

6) What about system resources ?
* JVM: 30 GB*
* RAM: 48 GB*

*CPU : 8 core*

eheheh not interested in your current resources, I have no indication of
the size of your data, My question was more related to check if the system
was healthy from the system resource point of view.

Cheers

On Wed, Jun 10, 2015 at 2:13 PM, Alessandro Benedetti
benedetti.ale...@gmail.com wrote:

Let me try to help you, first of all I would like to encourage people to
post more information about their scenario than This is my log, index
deleted, help me :)

This kind of Info can be really useful :

1) Solr version
2) Solr architecture ( Solr Cloud ? Solr Cloud configuration ? Manual
Sharding ? Manual Replication ? where the problem happened ? )
3) Kind of data source indexed
4) What happened to the datasource ? any change in there ?
5) Was the index actually deleted ? All docs deleted ? Index file
segments
deleted ? Index corrupted ?
6) What about system resources ?

These questions are only few example one that everyone should always post
along their mysterious problem !

Hope this helps,

Cheers

2015-06-10 9:15 GMT+01:00 Midas A test.mi...@gmail.com:

We are running full import and delta import crons .

Fulll index once a day

delta index : every 10 mins

last night my index automatically deleted(numdocs=0).

attaching logs for review .

please suggest to resolve the issue.

--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Indexing issue - index get deleted

2015-06-10 Thread Alessandro Benedetti

Wow, Upaya, I didn't know that clean was default=true in the delta import
as well!
I did know it was default in the full import, but I agree with you that
having a default to true for delta import is very dangerous !

But assuming the user was using the delta import so far, if cleaning every
time, how was possible to have a coherent index ?

Using a delta import with clean=true should produce a non consistent index
with only a subset ( the latest modified) of the entire data set !

Cheers

2015-06-10 11:46 GMT+01:00 Upayavira u...@odoko.co.uk:

 Note the clean= parameter to the DIH. It defaults to true. It will wipe
 your index before it runs. Perhaps it succeeded at wiping, but failed to
 connect to your database. Hence an empty DB?

 clean=true is, IMO, a very dangerous default option.

 Upayavira

 On Wed, Jun 10, 2015, at 10:59 AM, Midas A wrote:
  Hi Alessandro,
 
  Please find the answers inline and help me out to figure out this
  problem.
 
  1) Solr version : *4.2.1*
  2) Solr architecture :* Master -slave/ Replication with requestHandler*
 
  3) Kind of data source indexed : *Mysql *
  4) What happened to the datasource ? any change in there ? : *No change *
  5) Was the index actually deleted ? All docs deleted ? Index file
  segments
  deleted ? Index corrupted ? : *all docs deleted , segment files  are
  there.
  index file is also there .*
  6) What about system resources ?
  * JVM: 30 GB*
  * RAM: 48 GB*
 
  *CPU : 8 core*
 
 
  On Wed, Jun 10, 2015 at 2:13 PM, Alessandro Benedetti 
  benedetti.ale...@gmail.com wrote:
 
   Let me try to help you, first of all I would like to encourage people
 to
   post more information about their scenario than This is my log, index
   deleted, help me :)
  
   This kind of Info can be really useful :
  
   1) Solr version
   2) Solr architecture ( Solr Cloud ? Solr Cloud configuration ? Manual
   Sharding ? Manual Replication ? where the problem happened ? )
   3) Kind of data source indexed
   4) What happened to the datasource ? any change in there ?
   5) Was the index actually deleted ? All docs deleted ? Index file
 segments
   deleted ? Index corrupted ?
   6) What about system resources ?
  
   These questions are only few example one that everyone should always
 post
   along their mysterious problem !
  
   Hope this helps,
  
   Cheers
  
  
   2015-06-10 9:15 GMT+01:00 Midas A test.mi...@gmail.com:
  
   
We are running full import and delta import crons .
   
Fulll index once a day
   
delta index : every 10 mins
   
   
last night my index automatically deleted(numdocs=0).
   
attaching logs for review .
   
please suggest to resolve the issue.
   
   
  
  
   --
   --
  
   Benedetti Alessandro
   Visiting card : http://about.me/alessandro_benedetti
  
   Tyger, tyger burning bright
   In the forests of the night,
   What immortal hand or eye
   Could frame thy fearful symmetry?
  
   William Blake - Songs of Experience -1794 England
  




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Indexing issue - index get deleted

2015-06-10 Thread Midas A

Hi Alessandro,

Please find the answers inline and help me out to figure out this problem.

1) Solr version : *4.2.1*
2) Solr architecture :* Master -slave/ Replication with requestHandler*

3) Kind of data source indexed : *Mysql *
4) What happened to the datasource ? any change in there ? : *No change *
5) Was the index actually deleted ? All docs deleted ? Index file segments
deleted ? Index corrupted ? : *all docs deleted , segment files  are there.
index file is also there .*
6) What about system resources ?
* JVM: 30 GB*
* RAM: 48 GB*

*CPU : 8 core*


On Wed, Jun 10, 2015 at 2:13 PM, Alessandro Benedetti 
benedetti.ale...@gmail.com wrote:

 Let me try to help you, first of all I would like to encourage people to
 post more information about their scenario than This is my log, index
 deleted, help me :)

 This kind of Info can be really useful :

 1) Solr version
 2) Solr architecture ( Solr Cloud ? Solr Cloud configuration ? Manual
 Sharding ? Manual Replication ? where the problem happened ? )
 3) Kind of data source indexed
 4) What happened to the datasource ? any change in there ?
 5) Was the index actually deleted ? All docs deleted ? Index file segments
 deleted ? Index corrupted ?
 6) What about system resources ?

 These questions are only few example one that everyone should always post
 along their mysterious problem !

 Hope this helps,

 Cheers


 2015-06-10 9:15 GMT+01:00 Midas A test.mi...@gmail.com:

 
  We are running full import and delta import crons .
 
  Fulll index once a day
 
  delta index : every 10 mins
 
 
  last night my index automatically deleted(numdocs=0).
 
  attaching logs for review .
 
  please suggest to resolve the issue.
 
 


 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?

 William Blake - Songs of Experience -1794 England

Re: Indexing issue - index get deleted

2015-06-10 Thread Upayavira

Note the clean= parameter to the DIH. It defaults to true. It will wipe
your index before it runs. Perhaps it succeeded at wiping, but failed to
connect to your database. Hence an empty DB?

clean=true is, IMO, a very dangerous default option.

Upayavira

On Wed, Jun 10, 2015, at 10:59 AM, Midas A wrote:
 Hi Alessandro,
 
 Please find the answers inline and help me out to figure out this
 problem.
 
 1) Solr version : *4.2.1*
 2) Solr architecture :* Master -slave/ Replication with requestHandler*
 
 3) Kind of data source indexed : *Mysql *
 4) What happened to the datasource ? any change in there ? : *No change *
 5) Was the index actually deleted ? All docs deleted ? Index file
 segments
 deleted ? Index corrupted ? : *all docs deleted , segment files  are
 there.
 index file is also there .*
 6) What about system resources ?
 * JVM: 30 GB*
 * RAM: 48 GB*
 
 *CPU : 8 core*
 
 
 On Wed, Jun 10, 2015 at 2:13 PM, Alessandro Benedetti 
 benedetti.ale...@gmail.com wrote:
 
  Let me try to help you, first of all I would like to encourage people to
  post more information about their scenario than This is my log, index
  deleted, help me :)
 
  This kind of Info can be really useful :
 
  1) Solr version
  2) Solr architecture ( Solr Cloud ? Solr Cloud configuration ? Manual
  Sharding ? Manual Replication ? where the problem happened ? )
  3) Kind of data source indexed
  4) What happened to the datasource ? any change in there ?
  5) Was the index actually deleted ? All docs deleted ? Index file segments
  deleted ? Index corrupted ?
  6) What about system resources ?
 
  These questions are only few example one that everyone should always post
  along their mysterious problem !
 
  Hope this helps,
 
  Cheers
 
 
  2015-06-10 9:15 GMT+01:00 Midas A test.mi...@gmail.com:
 
  
   We are running full import and delta import crons .
  
   Fulll index once a day
  
   delta index : every 10 mins
  
  
   last night my index automatically deleted(numdocs=0).
  
   attaching logs for review .
  
   please suggest to resolve the issue.
  
  
 
 
  --
  --
 
  Benedetti Alessandro
  Visiting card : http://about.me/alessandro_benedetti
 
  Tyger, tyger burning bright
  In the forests of the night,
  What immortal hand or eye
  Could frame thy fearful symmetry?
 
  William Blake - Songs of Experience -1794 England

Re: Indexing issue - index get deleted

2015-06-10 Thread Upayavira

I was only speaking about full import regarding the default of
clean=true. However, looking at the source code, it doesn't seem to
differentiate especially between a full and a delta in relation to the
default of clean=true, which would be pretty crappy. However, I'd need
to try it.

Upayavira

On Wed, Jun 10, 2015, at 11:57 AM, Alessandro Benedetti wrote:
 Wow, Upaya, I didn't know that clean was default=true in the delta import
 as well!
 I did know it was default in the full import, but I agree with you that
 having a default to true for delta import is very dangerous !
 
 But assuming the user was using the delta import so far, if cleaning
 every
 time, how was possible to have a coherent index ?
 
 Using a delta import with clean=true should produce a non consistent
 index
 with only a subset ( the latest modified) of the entire data set !
 
 Cheers
 
 2015-06-10 11:46 GMT+01:00 Upayavira u...@odoko.co.uk:
 
  Note the clean= parameter to the DIH. It defaults to true. It will wipe
  your index before it runs. Perhaps it succeeded at wiping, but failed to
  connect to your database. Hence an empty DB?
 
  clean=true is, IMO, a very dangerous default option.
 
  Upayavira
 
  On Wed, Jun 10, 2015, at 10:59 AM, Midas A wrote:
   Hi Alessandro,
  
   Please find the answers inline and help me out to figure out this
   problem.
  
   1) Solr version : *4.2.1*
   2) Solr architecture :* Master -slave/ Replication with requestHandler*
  
   3) Kind of data source indexed : *Mysql *
   4) What happened to the datasource ? any change in there ? : *No change *
   5) Was the index actually deleted ? All docs deleted ? Index file
   segments
   deleted ? Index corrupted ? : *all docs deleted , segment files  are
   there.
   index file is also there .*
   6) What about system resources ?
   * JVM: 30 GB*
   * RAM: 48 GB*
  
   *CPU : 8 core*
  
  
   On Wed, Jun 10, 2015 at 2:13 PM, Alessandro Benedetti 
   benedetti.ale...@gmail.com wrote:
  
Let me try to help you, first of all I would like to encourage people
  to
post more information about their scenario than This is my log, index
deleted, help me :)
   
This kind of Info can be really useful :
   
1) Solr version
2) Solr architecture ( Solr Cloud ? Solr Cloud configuration ? Manual
Sharding ? Manual Replication ? where the problem happened ? )
3) Kind of data source indexed
4) What happened to the datasource ? any change in there ?
5) Was the index actually deleted ? All docs deleted ? Index file
  segments
deleted ? Index corrupted ?
6) What about system resources ?
   
These questions are only few example one that everyone should always
  post
along their mysterious problem !
   
Hope this helps,
   
Cheers
   
   
2015-06-10 9:15 GMT+01:00 Midas A test.mi...@gmail.com:
   

 We are running full import and delta import crons .

 Fulll index once a day

 delta index : every 10 mins


 last night my index automatically deleted(numdocs=0).

 attaching logs for review .

 please suggest to resolve the issue.


   
   
--
--
   
Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti
   
Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?
   
William Blake - Songs of Experience -1794 England
   
 
 
 
 
 -- 
 --
 
 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti
 
 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?
 
 William Blake - Songs of Experience -1794 England

Re: Indexing issue - index get deleted

2015-06-10 Thread Alessandro Benedetti

Let me try to help you, first of all I would like to encourage people to
post more information about their scenario than This is my log, index
deleted, help me :)

This kind of Info can be really useful :

1) Solr version
2) Solr architecture ( Solr Cloud ? Solr Cloud configuration ? Manual
Sharding ? Manual Replication ? where the problem happened ? )
3) Kind of data source indexed
4) What happened to the datasource ? any change in there ?
5) Was the index actually deleted ? All docs deleted ? Index file segments
deleted ? Index corrupted ?
6) What about system resources ?

These questions are only few example one that everyone should always post
along their mysterious problem !

Hope this helps,

Cheers


2015-06-10 9:15 GMT+01:00 Midas A test.mi...@gmail.com:


 We are running full import and delta import crons .

 Fulll index once a day

 delta index : every 10 mins


 last night my index automatically deleted(numdocs=0).

 attaching logs for review .

 please suggest to resolve the issue.




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Indexing issue - index get deleted

2015-06-10 Thread Alessandro Benedetti

Just taking a look to the code :


if (requestParams.containsKey(clean)) {
  clean = StrUtils.parseBool( (String) requestParams.get(clean), true);
} else if (DataImporter.DELTA_IMPORT_CMD.equals(command) ||
DataImporter.IMPORT_CMD.equals(command)) {
  clean = false;
} else  {
  clean = debug ? false : true;
}


Which make sense, as I would be surprised to see a delta import with a
default cleaning.

The guys was using delta import anyway, so maybe the problem is
different and not related to the clean.

But he needs definitely to give us more information .

Cheers


2015-06-10 12:11 GMT+01:00 Upayavira u...@odoko.co.uk:

 I was only speaking about full import regarding the default of
 clean=true. However, looking at the source code, it doesn't seem to
 differentiate especially between a full and a delta in relation to the
 default of clean=true, which would be pretty crappy. However, I'd need
 to try it.

 Upayavira

 On Wed, Jun 10, 2015, at 11:57 AM, Alessandro Benedetti wrote:
  Wow, Upaya, I didn't know that clean was default=true in the delta import
  as well!
  I did know it was default in the full import, but I agree with you that
  having a default to true for delta import is very dangerous !
 
  But assuming the user was using the delta import so far, if cleaning
  every
  time, how was possible to have a coherent index ?
 
  Using a delta import with clean=true should produce a non consistent
  index
  with only a subset ( the latest modified) of the entire data set !
 
  Cheers
 
  2015-06-10 11:46 GMT+01:00 Upayavira u...@odoko.co.uk:
 
   Note the clean= parameter to the DIH. It defaults to true. It will wipe
   your index before it runs. Perhaps it succeeded at wiping, but failed
 to
   connect to your database. Hence an empty DB?
  
   clean=true is, IMO, a very dangerous default option.
  
   Upayavira
  
   On Wed, Jun 10, 2015, at 10:59 AM, Midas A wrote:
Hi Alessandro,
   
Please find the answers inline and help me out to figure out this
problem.
   
1) Solr version : *4.2.1*
2) Solr architecture :* Master -slave/ Replication with
 requestHandler*
   
3) Kind of data source indexed : *Mysql *
4) What happened to the datasource ? any change in there ? : *No
 change *
5) Was the index actually deleted ? All docs deleted ? Index file
segments
deleted ? Index corrupted ? : *all docs deleted , segment files  are
there.
index file is also there .*
6) What about system resources ?
* JVM: 30 GB*
* RAM: 48 GB*
   
*CPU : 8 core*
   
   
On Wed, Jun 10, 2015 at 2:13 PM, Alessandro Benedetti 
benedetti.ale...@gmail.com wrote:
   
 Let me try to help you, first of all I would like to encourage
 people
   to
 post more information about their scenario than This is my log,
 index
 deleted, help me :)

 This kind of Info can be really useful :

 1) Solr version
 2) Solr architecture ( Solr Cloud ? Solr Cloud configuration ?
 Manual
 Sharding ? Manual Replication ? where the problem happened ? )
 3) Kind of data source indexed
 4) What happened to the datasource ? any change in there ?
 5) Was the index actually deleted ? All docs deleted ? Index file
   segments
 deleted ? Index corrupted ?
 6) What about system resources ?

 These questions are only few example one that everyone should
 always
   post
 along their mysterious problem !

 Hope this helps,

 Cheers


 2015-06-10 9:15 GMT+01:00 Midas A test.mi...@gmail.com:

 
  We are running full import and delta import crons .
 
  Fulll index once a day
 
  delta index : every 10 mins
 
 
  last night my index automatically deleted(numdocs=0).
 
  attaching logs for review .
 
  please suggest to resolve the issue.
 
 


 --
 --

 Benedetti Alessandro
 Visiting card : http://about.me/alessandro_benedetti

 Tyger, tyger burning bright
 In the forests of the night,
 What immortal hand or eye
 Could frame thy fearful symmetry?

 William Blake - Songs of Experience -1794 England

  
 
 
 
  --
  --
 
  Benedetti Alessandro
  Visiting card : http://about.me/alessandro_benedetti
 
  Tyger, tyger burning bright
  In the forests of the night,
  What immortal hand or eye
  Could frame thy fearful symmetry?
 
  William Blake - Songs of Experience -1794 England




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England

Re: Indexing issue - index get deleted

2015-06-10 Thread Chris Hostetter


: The guys was using delta import anyway, so maybe the problem is
: different and not related to the clean.

that's not what the logs say.

Here's what i see...

Log begins with server startup @ Jun 10, 2015 11:14:56 AM

The DeletionPolicy for the shopclue_prod core is initialized at Jun 
10, 2015 11:15:04 AM and we see a few interesting things here we note 
for the future as we keep reading...

1) There is currently commits:num=1 commits on disk
2) the current index dir in use is index.20150311161021822
3) the current segment  generation are segFN=segments_1a,generation=46

Immediately after this, we see some searcher warming using a searcher with 
this same segments file, and then this searcher is registered (Jun 10, 
2015 11:15:05 AM) and the core is registered.

Next we see some replication polling, and we see what look like some 
simple monitoring requests for q=* which return hits=85898 being 
repeated over and over.

At Jun 10, 2015 11:16:30 AM we see some requests for /dataimport that 
look like they are coming from the UI. and then at Jun 10, 2015 11:17:01 
AM we see a request for a full import started.

We have no idea what the data import configuration file looks like, so we 
have no idea if clean=false is being used or not.  it's certianly not 
specified in the URL.

We see some more monitoring URLs returning hits=85898 and some more 
/repliation status calls, and then @ Jun 10, 2015 11:18:02 AM we see the 
first commit executed since hte server started up.  

there's no indication that this commit came from an external request (eg 
/update) so probably was made by some internal request.  One 
possiblility is that it came from DIH finishing -- but i doubt it, i'm 
fairly sure that would have involved more logging then this.  A more 
probably scenerio is that it came from an autoCommit setting -- the fact 
that it is almost exactly 60 seconds after DIH started -- and almost 
exactly 60 seconds after DIH may have done a deleteAll query due to 
clean=true -- makes it seem very likely that this was a 1 minute 
autoCommit)

(but since we don't have either hte data import config, or the 
solrconfig.xml, we have no way of knowing -- it's all just guess work.)

Very importantly, note that this commit is not opening a new searcher...

Jun 10, 2015 11:18:02 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start 
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

Here are some other interesting things to note from the logging 
that comes from the DeletionPolicy when this commit happens...

1) it now notes that there are commits:num=2 on disk
2) the current index dir hasn't changed (index.20150311161021822) so 
some weird replication command didn't swap the world out from under us
3) the newest segment/generation are segFN=segments_1b,generation=47
4) the newest commit has no other files in it besides the segments file.

this means, with out a doubt, there are no documents in this commits view 
of the index.  they have all been deleted by something.


At this point the *old* searcher (for commit generation 46) is still in 
use however -- nothing has done an openSearcher=true.

we see more /dataimport status requests, and other requests that appear to 
come from the Solr UI, and more monitoring queries that still return 
hits=85898 because the same searcher is in use.

At Jun 10, 2015 11:27:04 AM we see another commit happen -- again, no 
indication that this came from an outside /update request, so it might be 
from DIH, or it might be from an autoCommit setting.  the fact that it is 
nearly exactly 10 minutes after DIH started (and probably did a clean=true 
deleteAll query) makes it seem extremely likely this is an autoSoftCommit 
setting kicking in.

Very importantly, note that this softCommit *does* open a new searcher...

Jun 10, 2015 11:27:04 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}


In less then a second, this new searcher is warmed up and the next time we 
see a q=* monitoring query get logged, it returns hits=0.

Note that at no point in the logs, after the DataImporter is started, do 
we see it log anything other then that it has initiated the request to 
MySQL -- we do see some logs starting ~ Jun 10, 2015 11:41:19 AM 
indicating that someone was using the Web UI to look at the dataimport 
handler's status report.  it would be really nice to know what that person 
saw at that point -- because my guess is DIH was still running and was 
staled waiting for MySql, and hadn't even started adding docs to Solr (if 
it had, i'm certian there would have been some log of it).

So instead, the combination of a (probable) DIH clean=true option and a 
(near certainty) autoCommit=60sec and autoSoftCommit=10min ment that a new 
commit was created after the clean, and that commit was

64 matches

Mail list logo