Re: Errors on master after upgrading to 4.10.3

2016-02-17 Thread Joseph Hagerty
Ahh, makes sense. I did have a feeling I was barking up the wrong tree
since it's an Extraction issue, but I thought I'd throw it out there,
anyway.

Thanks so much for the information!

On Wed, Feb 17, 2016 at 4:49 PM, Rachel Lynn Underwood <
r.lynn.underw...@gmail.com> wrote:

> This is an error being thrown by Apache PDFBox/Tika. You're seeing it now
> because Solr 4.x uses a different Tika version than Solr 3.x.
>
> It looks like this error is thrown when you parse a PDF with Tika, and a
> font in that PDF doesn't have a ToUnicode mapping.
> https://issues.apache.org/jira/browse/PDFBOX-1408
>
> Another user reported that this might be related to special characters, but
> PDFBox developers haven't been able to reproduce the bug.
> https://issues.apache.org/jira/browse/PDFBOX-1706
>
> Since this isn't an issue in the Solr code, if you're concerned about it,
> you'll probably have better luck asking the PDFBox developers directly, via
> Jira or their mailing list.
>
>
> On Tue, Feb 16, 2016 at 12:08 PM, Joseph Hagerty <joa...@gmail.com> wrote:
>
> > Does literally nobody else see this error in their logs? I see this error
> > hundreds of times per day, in occasional bursts. Should I file this as a
> > bug?
> >
> > On Mon, Feb 15, 2016 at 4:56 PM, Joseph Hagerty <joa...@gmail.com>
> wrote:
> >
> > > After migrating from 3.5 to 4.10.3, I'm seeing the following error with
> > > alarming regularity in the master's error log:
> > >
> > > 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of
> the
> > > space character using 250 as default
> > > I can't seem to glean much information about this one from the web. Has
> > > anyone else fought this error?
> > >
> > > In case this helps, here's some technical/miscellaneous info:
> > >
> > > - I'm running a master-slave set-up.
> > >
> > > - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext
> > > from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of
> > > this, but I don't know the first thing about it.
> > >
> > > - I have the clients specifying 'autocommit=6s' in their requests,
> which
> > I
> > > realize is a pretty aggressive commit interval, but so far that hasn't
> > > caused any problems I couldn't surmount.
> > >
> > > - There are north of 11 million docs in my index, which is 36 gigs
> thick.
> > > The storage volume is only 10% full.
> > >
> > > - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex
> due
> > > to incompatibility between versions.
> > >
> > > - Both master and slave are running on AWS instances, C4.4XL's (16
> cores,
> > > 30 gigs of RAM).
> > >
> > > So far, I have been unable to reproduce this error on my own: I can
> only
> > > observe it in the logs. I haven't been able to tie it to any specific
> > > document.
> > >
> > > Let me know if further information would be helpful.
> > >
> > >
> > >
> > >
> >
> >
> > --
> > - Joe
> >
>



-- 
- Joe


Re: Errors on master after upgrading to 4.10.3

2016-02-16 Thread Joseph Hagerty
Does literally nobody else see this error in their logs? I see this error
hundreds of times per day, in occasional bursts. Should I file this as a
bug?

On Mon, Feb 15, 2016 at 4:56 PM, Joseph Hagerty <joa...@gmail.com> wrote:

> After migrating from 3.5 to 4.10.3, I'm seeing the following error with
> alarming regularity in the master's error log:
>
> 2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of the
> space character using 250 as default
> I can't seem to glean much information about this one from the web. Has
> anyone else fought this error?
>
> In case this helps, here's some technical/miscellaneous info:
>
> - I'm running a master-slave set-up.
>
> - I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext
> from .docs and .pdfs. I'm guessing that PDSimpleFont is a component of
> this, but I don't know the first thing about it.
>
> - I have the clients specifying 'autocommit=6s' in their requests, which I
> realize is a pretty aggressive commit interval, but so far that hasn't
> caused any problems I couldn't surmount.
>
> - There are north of 11 million docs in my index, which is 36 gigs thick.
> The storage volume is only 10% full.
>
> - When I migrated from 3.5 to 4.10.3, I correctly performed a reindex due
> to incompatibility between versions.
>
> - Both master and slave are running on AWS instances, C4.4XL's (16 cores,
> 30 gigs of RAM).
>
> So far, I have been unable to reproduce this error on my own: I can only
> observe it in the logs. I haven't been able to tie it to any specific
> document.
>
> Let me know if further information would be helpful.
>
>
>
>


-- 
- Joe


Errors on master after upgrading to 4.10.3

2016-02-15 Thread Joseph Hagerty
After migrating from 3.5 to 4.10.3, I'm seeing the following error with
alarming regularity in the master's error log:

2/15/2016, 4:32:22 PM ERROR PDSimpleFont Can't determine the width of the
space character using 250 as default
I can't seem to glean much information about this one from the web. Has
anyone else fought this error?

In case this helps, here's some technical/miscellaneous info:

- I'm running a master-slave set-up.

- I rely on the ERH (tika/solr-cell/whatever) for extracting plaintext from
.docs and .pdfs. I'm guessing that PDSimpleFont is a component of this, but
I don't know the first thing about it.

- I have the clients specifying 'autocommit=6s' in their requests, which I
realize is a pretty aggressive commit interval, but so far that hasn't
caused any problems I couldn't surmount.

- There are north of 11 million docs in my index, which is 36 gigs thick.
The storage volume is only 10% full.

- When I migrated from 3.5 to 4.10.3, I correctly performed a reindex due
to incompatibility between versions.

- Both master and slave are running on AWS instances, C4.4XL's (16 cores,
30 gigs of RAM).

So far, I have been unable to reproduce this error on my own: I can only
observe it in the logs. I haven't been able to tie it to any specific
document.

Let me know if further information would be helpful.


Re: JVM heap constraints and garbage collection

2014-01-31 Thread Joseph Hagerty
Thanks, Shawn. This information is actually not all that shocking to me.
It's always been in the back of my mind that I was getting away with
something in serving from the m1.large. Remarkably, however, it has served
me well for nearly two years; also, although the index has not always been
30GB, it has always been much larger than the RAM on the box. As you
suggested, I can only suppose that usage patterns and the index schema have
in some way facilitated minimal heap usage, up to this point.

For now, we're going to increase the heap size on the instance and see
where that gets us; if it still doesn't suffice for now, then we'll upgrade
to a more powerful instance.

Michael, thanks for weighing in. Those i2 instances look delicious indeed.
Just curious -- have you struggled with garbage collection pausing at all?



On Thu, Jan 30, 2014 at 7:43 PM, Shawn Heisey s...@elyograg.org wrote:

 On 1/30/2014 3:20 PM, Joseph Hagerty wrote:

 I'm using Solr 3.5 over Tomcat 6. My index has reached 30G.


 snip


  - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM


 One detail that you did not provide was how much of your 7.5GB RAM you are
 allocating to the Java heap for Solr, but I actually don't think I need
 that information, because for your index size, you simply don't have
 enough. If you're sticking with Amazon, you'll want one of the instances
 with at least 30GB of RAM, and you might want to consider more memory than
 that.

 An ideal RAM size for Solr is equal to the size of on-disk data plus the
 heap space used by Solr and other programs.  This means that if your java
 heap for Solr is 4GB and there are no other significant programs running on
 the same server, you'd want a minimum of 34GB of RAM for an ideal setup
 with your index.  4GB of that would be for Solr itself, the remainder would
 be for the operating system to fully cache your index in the OS disk cache.

 Depending on your query patterns and how your schema is arranged, you
 *might* be able to get away as little as half of your index size just for
 the OS disk cache, but it's better to make it big enough for the whole
 index, plus room for growth.

 http://wiki.apache.org/solr/SolrPerformanceProblems

 Many people are *shocked* when they are told this information, but if you
 think about the relative speeds of getting a chunk of data from a hard disk
 vs. getting the same information from memory, it's not all that shocking.

 Thanks,
 Shawn




-- 
- Joe


JVM heap constraints and garbage collection

2014-01-30 Thread Joseph Hagerty
Greetings esteemed Solr-ites,

I'm using Solr 3.5 over Tomcat 6. My index has reached 30G.

Since my average load during peak hours is becoming quite high, and since
I'm finally starting to notice a little bit of performance degradation and
intermittent errors (e.g. Solr returned response 0 on perfectly valid
reads during load spikes), I think it's time to tune my Slave box before
things get out of control.

In particular, *I am curious how others are tuning their JVM heap
constraints (xms, xms, etc.) and garbage collection (parallel or
concurrent) to meet the needs of Solr*. I am using the Sun JVM Version 6,
not the fancy third party offerings.

Some more info, FWIW:

- Average document size in my index is probably around 6k
- Using CentOS
- Master-Slave setup. Master gets all the writes, Slave gets all the read
requests. It is the *Slave* that is suffering-- the Master seems fine.
- The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM
- DaemonThreads skyrocket during the aforementioned load spikes

Thanks for reading, and to the devs: thanks for an excellent product.

-- 
- Joe


ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
Greetings Solr folk,

How can I instruct the extract request handler to ignore metadata/headers
etc. when it constructs the content of the document I send to it?

For example, I created an MS Word document containing just the word
SEARCHWORD and nothing else. However, when I ship this doc to my solr
server, here's what's thrown in the index:

str name=meta
Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus
Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
2008-11-05T20:19:00Z stream_content_type application/octet-stream Character
Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y
Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords
Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
/str

All I want is the body of the document, in this case the word SEARCHWORD.

For further reference, here's my extraction handler:

 requestHandler name=/update/extract
  startup=lazy
  class=solr.extraction.ExtractingRequestHandler 
lst name=defaults
  !-- All the main content goes into text... if you need to return
   the extracted text or do highlighting, use a stored field. --
  str name=fmap.contentmeta/str
  str name=lowernamestrue/str
  str name=uprefixignored_/str
/lst
  /requestHandler

(Ironically, meta is the field in the solr schema to which I'm attempting
to extract the body of the document. Don't ask).

Thanks in advance for any pointers you can provide me.

-- 
- Joe


Re: ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
I do not. I commented out all of the copyFields provided in the default
schema.xml that ships with 3.5. My schema is rather minimal. Here is my
fields block, if this helps:

 fields
   field name=cust type=stringindexed=true  stored=true
 required=true  /
   field name=assettype=stringindexed=true  stored=true
 required=true  /
   field name=ent  type=stringindexed=true  stored=true
 required=true  /
   field name=meta type=text_en   indexed=true  stored=true
 required=true  /
   dynamicField name=ignored_* type=ignored multiValued=true/
   !--field name=modified  type=dateTime  indexed=true
 stored=true  required=false /--
 /fields


On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky j...@basetechnology.comwrote:

 Check to see if you have a CopyField for a wildcard pattern that copies to
 meta, which would copy all of the Tika-generated fields to meta.

 -- Jack Krupansky

 -Original Message- From: Joseph Hagerty
 Sent: Wednesday, May 02, 2012 9:56 AM
 To: solr-user@lucene.apache.org
 Subject: ExtractRH: How to strip metadata


 Greetings Solr folk,

 How can I instruct the extract request handler to ignore metadata/headers
 etc. when it constructs the content of the document I send to it?

 For example, I created an MS Word document containing just the word
 SEARCHWORD and nothing else. However, when I ship this doc to my solr
 server, here's what's thrown in the index:

 str name=meta
 Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
 stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
 Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus
 Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character
 Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**
 phpHCIg7y
 Company Parkman Elastomers Pvt Ltd Content-Type application/msword Keywords
 Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
 /str

 All I want is the body of the document, in this case the word SEARCHWORD.

 For further reference, here's my extraction handler:

 requestHandler name=/update/extract
 startup=lazy
 class=solr.extraction.**ExtractingRequestHandler 
   lst name=defaults
 !-- All the main content goes into text... if you need to return
  the extracted text or do highlighting, use a stored field. --
 str name=fmap.contentmeta/str
 str name=lowernamestrue/str
 str name=uprefixignored_/str
   /lst
  /requestHandler

 (Ironically, meta is the field in the solr schema to which I'm attempting
 to extract the body of the document. Don't ask).

 Thanks in advance for any pointers you can provide me.

 --
 - Joe




-- 
- Joe


Re: ExtractRH: How to strip metadata

2012-05-02 Thread Joseph Hagerty
How interesting! You know, I did at one point consider that perhaps the
fieldname meta may be treated specially, but I talked myself out of it. I
reasoned that a field name in my local schema should have no bearing on how
a plugin such as solr-cell/Tika behaves. I should have tested my
hypothesis; even if this phenomenon turns out to be undocumented behavior,
I consider myself a victim of my own assumptions.

I am running version 3.5. You may have gotten the multivalue errors due to
the way your test schema and/or extracting request handler is lain out (my
bad). I am using the ignored fieldtype and a dynamicField called
ignored_ as a catch-all for extraneous fields delivered by Tika.

Thanks for your help! Please keep me posted on any further
insights/revelations, and I'll do the same.

On Wed, May 2, 2012 at 12:54 PM, Jack Krupansky j...@basetechnology.comwrote:

 I did some testing, and evidently the meta field is treated specially
 from the ERH.

 I copied the example schema, and added both meta and metax fields and
 set fmap.content=metax, and lo and behold only the doc content appears in
 metax, but all the doc metadata appears in meta.

 Although, I did get 400 errors with Solr complaining that meta was not a
 multivalued field. This is with Solr 3.6. What release of Solr are you
 using?

 I was not aware of this undocumented feature. I haven't checked the code
 yet.


 -- Jack Krupansky

 -Original Message- From: Joseph Hagerty
 Sent: Wednesday, May 02, 2012 11:10 AM
 To: solr-user@lucene.apache.org
 Subject: Re: ExtractRH: How to strip metadata


 I do not. I commented out all of the copyFields provided in the default
 schema.xml that ships with 3.5. My schema is rather minimal. Here is my
 fields block, if this helps:

 fields
  field name=cust type=stringindexed=true  stored=true
 required=true  /
  field name=assettype=stringindexed=true  stored=true
 required=true  /
  field name=ent  type=stringindexed=true  stored=true
 required=true  /
  field name=meta type=text_en   indexed=true  stored=true
 required=true  /
  dynamicField name=ignored_* type=ignored multiValued=true/
  !--field name=modified  type=dateTime  indexed=true
 stored=true  required=false /--
 /fields


 On Wed, May 2, 2012 at 10:59 AM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  Check to see if you have a CopyField for a wildcard pattern that copies to
 meta, which would copy all of the Tika-generated fields to meta.

 -- Jack Krupansky

 -Original Message- From: Joseph Hagerty
 Sent: Wednesday, May 02, 2012 9:56 AM
 To: solr-user@lucene.apache.org
 Subject: ExtractRH: How to strip metadata


 Greetings Solr folk,

 How can I instruct the extract request handler to ignore metadata/headers
 etc. when it constructs the content of the document I send to it?

 For example, I created an MS Word document containing just the word
 SEARCHWORD and nothing else. However, when I ship this doc to my solr
 server, here's what's thrown in the index:

 str name=meta
 Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments
 stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm
 Page-Count 1 subject Application-Name Microsoft Macintosh Word Author
 Jesus
 Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 1086 Creation-Date
 2008-11-05T20:19:00Z stream_content_type application/octet-stream
 Character
 Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/**

 phpHCIg7y
 Company Parkman Elastomers Pvt Ltd Content-Type application/msword
 Keywords
 Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD
 /str

 All I want is the body of the document, in this case the word
 SEARCHWORD.

 For further reference, here's my extraction handler:

 requestHandler name=/update/extract
startup=lazy
class=solr.extraction.ExtractingRequestHandler 

  lst name=defaults
!-- All the main content goes into text... if you need to return
 the extracted text or do highlighting, use a stored field. --
str name=fmap.contentmeta/str
str name=lowernamestrue/str
str name=uprefixignored_/str
  /lst
  /requestHandler

 (Ironically, meta is the field in the solr schema to which I'm
 attempting
 to extract the body of the document. Don't ask).

 Thanks in advance for any pointers you can provide me.

 --
 - Joe




 --
 - Joe




-- 
- Joe