Re: Errors on master after upgrading to 4.10.3
Ahh, makes sense. I did have a feeling I was barking up the wrong tree since it's an Extraction issue, but I thought I'd throw it out there, anyway. Thanks so much for the information! On Wed, Feb 17, 2016 at 4:49 PM, Rachel Lynn Underwood < r.lynn.underw...@gmail.com> wrote: > This is an error being thrown by Apache PDFBox/Tika. You're seeing it now > because Solr 4.x uses a different Tika version than Solr 3.x. > > It looks like this error is thrown when you parse a PDF with Tika, and a > font in that PDF doesn't have a ToUnicode mapping. > https://issues.apache.org/jira/browse/PDFBOX-1408 > > Another user reported that this might be related to special characters, but > PDFBox developers haven't been able to reproduce the bug. > https://issues.apache.org/jira/browse/PDFBOX-1706 > > Since this isn't an issue in the Solr code, if you're concerned about it, > you'll probably have better luck asking the PDFBox developers directly, via > Jira or their mailing list. > > > On Tue, Feb 16, 2016 at 12:08 PM, Joseph Hagerty <joa...@gmail.com> wrote: > > > Does literally nobody else see this error in their logs? I see this error > > hundreds of times per day, in occasional bursts. Should I file this as a > > bug? > > > > On Mon, Feb 15, 2016 at 4:56 PM, Joseph Hagerty <joa...@gmail.com> > wrote: > > > > > After migrating from 3.5 to 4.10.3, I'm seeing the following error with > > > alarming regularity in the master's error log: > > > > > > 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of > the > > > space character using 250 as default > > > I can't seem to glean much information about this one from the web. Has > > > anyone else fought this error? > > > > > > In case this helps, here's some technical/miscellaneous info: > > > > > > - I'm running a master-slave set-up. > > > > > > - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext > > > from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of > > > this, but I don't know the first thing about it. > > > > > > - I have the clients specifying 'autocommit=6s' in their requests, > which > > I > > > realize is a pretty aggressive commit interval, but so far that hasn't > > > caused any problems I couldn't surmount. > > > > > > - There are north of 11 million docs in my index, which is 36 gigs > thick. > > > The storage volume is only 10% full. > > > > > > - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex > due > > > to incompatibility between versions. > > > > > > - Both master and slave are running on AWS instances, C4.4XL's (16 > cores, > > > 30 gigs of RAM). > > > > > > So far, I have been unable to reproduce this error on my own: I can > only > > > observe it in the logs. I haven't been able to tie it to any specific > > > document. > > > > > > Let me know if further information would be helpful. > > > > > > > > > > > > > > > > > > -- > > - Joe > > > -- - Joe
Re: Errors on master after upgrading to 4.10.3
Does literally nobody else see this error in their logs? I see this error hundreds of times per day, in occasional bursts. Should I file this as a bug? On Mon, Feb 15, 2016 at 4:56 PM, Joseph Hagerty <joa...@gmail.com> wrote: > After migrating from 3.5 to 4.10.3, I'm seeing the following error with > alarming regularity in the master's error log: > > 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of the > space character using 250 as default > I can't seem to glean much information about this one from the web. Has > anyone else fought this error? > > In case this helps, here's some technical/miscellaneous info: > > - I'm running a master-slave set-up. > > - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext > from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of > this, but I don't know the first thing about it. > > - I have the clients specifying 'autocommit=6s' in their requests, which I > realize is a pretty aggressive commit interval, but so far that hasn't > caused any problems I couldn't surmount. > > - There are north of 11 million docs in my index, which is 36 gigs thick. > The storage volume is only 10% full. > > - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex due > to incompatibility between versions. > > - Both master and slave are running on AWS instances, C4.4XL's (16 cores, > 30 gigs of RAM). > > So far, I have been unable to reproduce this error on my own: I can only > observe it in the logs. I haven't been able to tie it to any specific > document. > > Let me know if further information would be helpful. > > > > -- - Joe
Errors on master after upgrading to 4.10.3
After migrating from 3.5 to 4.10.3, I'm seeing the following error with alarming regularity in the master's error log: 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of the space character using 250 as default I can't seem to glean much information about this one from the web. Has anyone else fought this error? In case this helps, here's some technical/miscellaneous info: - I'm running a master-slave set-up. - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of this, but I don't know the first thing about it. - I have the clients specifying 'autocommit=6s' in their requests, which I realize is a pretty aggressive commit interval, but so far that hasn't caused any problems I couldn't surmount. - There are north of 11 million docs in my index, which is 36 gigs thick. The storage volume is only 10% full. - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex due to incompatibility between versions. - Both master and slave are running on AWS instances, C4.4XL's (16 cores, 30 gigs of RAM). So far, I have been unable to reproduce this error on my own: I can only observe it in the logs. I haven't been able to tie it to any specific document. Let me know if further information would be helpful.
Re: JVM heap constraints and garbage collection
Thanks, Shawn. This information is actually not all that shocking to me. It's always been in the back of my mind that I was getting away with something in serving from the m1.large. Remarkably, however, it has served me well for nearly two years; also, although the index has not always been 30GB, it has always been much larger than the RAM on the box. As you suggested, I can only suppose that usage patterns and the index schema have in some way facilitated minimal heap usage, up to this point. For now, we're going to increase the heap size on the instance and see where that gets us; if it still doesn't suffice for now, then we'll upgrade to a more powerful instance. Michael, thanks for weighing in. Those i2 instances look delicious indeed. Just curious -- have you struggled with garbage collection pausing at all? On Thu, Jan 30, 2014 at 7:43 PM, Shawn Heisey s...@elyograg.org wrote: On 1/30/2014 3:20 PM, Joseph Hagerty wrote: I'm using Solr 3.5 over Tomcat 6. My index has reached 30G. snip - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM One detail that you did not provide was how much of your 7.5GB RAM you are allocating to the Java heap for Solr, but I actually don't think I need that information, because for your index size, you simply don't have enough. If you're sticking with Amazon, you'll want one of the instances with at least 30GB of RAM, and you might want to consider more memory than that. An ideal RAM size for Solr is equal to the size of on-disk data plus the heap space used by Solr and other programs. This means that if your java heap for Solr is 4GB and there are no other significant programs running on the same server, you'd want a minimum of 34GB of RAM for an ideal setup with your index. 4GB of that would be for Solr itself, the remainder would be for the operating system to fully cache your index in the OS disk cache. Depending on your query patterns and how your schema is arranged, you *might* be able to get away as little as half of your index size just for the OS disk cache, but it's better to make it big enough for the whole index, plus room for growth. http://wiki.apache.org/solr/SolrPerformanceProblems Many people are *shocked* when they are told this information, but if you think about the relative speeds of getting a chunk of data from a hard disk vs. getting the same information from memory, it's not all that shocking. Thanks, Shawn -- - Joe
JVM heap constraints and garbage collection
Greetings esteemed Solr-ites, I'm using Solr 3.5 over Tomcat 6. My index has reached 30G. Since my average load during peak hours is becoming quite high, and since I'm finally starting to notice a little bit of performance degradation and intermittent errors (e.g. Solr returned response 0 on perfectly valid reads during load spikes), I think it's time to tune my Slave box before things get out of control. In particular, *I am curious how others are tuning their JVM heap constraints (xms, xms, etc.) and garbage collection (parallel or concurrent) to meet the needs of Solr*. I am using the Sun JVM Version 6, not the fancy third party offerings. Some more info, FWIW: - Average document size in my index is probably around 6k - Using CentOS - Master-Slave setup. Master gets all the writes, Slave gets all the read requests. It is the *Slave* that is suffering-- the Master seems fine. - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM - DaemonThreads skyrocket during the aforementioned load spikes Thanks for reading, and to the devs: thanks for an excellent product. -- - Joe
ExtractRH: How to strip metadata
Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the content of the document I send to it? For example, I created an MS Word document containing just the word SEARCHWORD and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: str name=meta Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD /str All I want is the body of the document, in this case the word SEARCHWORD. For further reference, here's my extraction handler: requestHandler name=/update/extract startup=lazy class=solr.extraction.ExtractingRequestHandler lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contentmeta/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler (Ironically, meta is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe
Re: ExtractRH: How to strip metadata
I do not. I commented out all of the copyFields provided in the default schema.xml that ships with 3.5. My schema is rather minimal. Here is my fields block, if this helps: fields field name=cust type=stringindexed=true stored=true required=true / field name=assettype=stringindexed=true stored=true required=true / field name=ent type=stringindexed=true stored=true required=true / field name=meta type=text_en indexed=true stored=true required=true / dynamicField name=ignored_* type=ignored multiValued=true/ !--field name=modified type=dateTime indexed=true stored=true required=false /-- /fields On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky j...@basetechnology.comwrote: Check to see if you have a CopyField for a wildcard pattern that copies to meta, which would copy all of the Tika-generated fields to meta. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 9:56 AM To: solr-user@lucene.apache.org Subject: ExtractRH: How to strip metadata Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the content of the document I send to it? For example, I created an MS Word document containing just the word SEARCHWORD and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: str name=meta Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/** phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD /str All I want is the body of the document, in this case the word SEARCHWORD. For further reference, here's my extraction handler: requestHandler name=/update/extract startup=lazy class=solr.extraction.**ExtractingRequestHandler lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contentmeta/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler (Ironically, meta is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe -- - Joe
Re: ExtractRH: How to strip metadata
How interesting! You know, I did at one point consider that perhaps the fieldname meta may be treated specially, but I talked myself out of it. I reasoned that a field name in my local schema should have no bearing on how a plugin such as solr-cell/Tika behaves. I should have tested my hypothesis; even if this phenomenon turns out to be undocumented behavior, I consider myself a victim of my own assumptions. I am running version 3.5. You may have gotten the multivalue errors due to the way your test schema and/or extracting request handler is lain out (my bad). I am using the ignored fieldtype and a dynamicField called ignored_ as a catch-all for extraneous fields delivered by Tika. Thanks for your help! Please keep me posted on any further insights/revelations, and I'll do the same. On Wed, May 2, 2012 at 12:54 PM, Jack Krupansky j...@basetechnology.comwrote: I did some testing, and evidently the meta field is treated specially from the ERH. I copied the example schema, and added both meta and metax fields and set fmap.content=metax, and lo and behold only the doc content appears in metax, but all the doc metadata appears in meta. Although, I did get 400 errors with Solr complaining that meta was not a multivalued field. This is with Solr 3.6. What release of Solr are you using? I was not aware of this undocumented feature. I haven't checked the code yet. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 11:10 AM To: solr-user@lucene.apache.org Subject: Re: ExtractRH: How to strip metadata I do not. I commented out all of the copyFields provided in the default schema.xml that ships with 3.5. My schema is rather minimal. Here is my fields block, if this helps: fields field name=cust type=stringindexed=true stored=true required=true / field name=assettype=stringindexed=true stored=true required=true / field name=ent type=stringindexed=true stored=true required=true / field name=meta type=text_en indexed=true stored=true required=true / dynamicField name=ignored_* type=ignored multiValued=true/ !--field name=modified type=dateTime indexed=true stored=true required=false /-- /fields On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky j...@basetechnology.com* *wrote: Check to see if you have a CopyField for a wildcard pattern that copies to meta, which would copy all of the Tika-generated fields to meta. -- Jack Krupansky -Original Message- From: Joseph Hagerty Sent: Wednesday, May 02, 2012 9:56 AM To: solr-user@lucene.apache.org Subject: ExtractRH: How to strip metadata Greetings Solr folk, How can I instruct the extract request handler to ignore metadata/headers etc. when it constructs the content of the document I send to it? For example, I created an MS Word document containing just the word SEARCHWORD and nothing else. However, when I ship this doc to my solr server, here's what's thrown in the index: str name=meta Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/** phpHCIg7y Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD /str All I want is the body of the document, in this case the word SEARCHWORD. For further reference, here's my extraction handler: requestHandler name=/update/extract startup=lazy class=solr.extraction.ExtractingRequestHandler lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contentmeta/str str name=lowernamestrue/str str name=uprefixignored_/str /lst /requestHandler (Ironically, meta is the field in the solr schema to which I'm attempting to extract the body of the document. Don't ask). Thanks in advance for any pointers you can provide me. -- - Joe -- - Joe -- - Joe