Re: Indexing PDF Files

Furkan KAMACI Wed, 24 Apr 2013 15:50:41 -0700

Here is my definition for handler:

<requestHandler name="/update/extract" class="solr.extraction.
ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>





2013/4/25 Furkan KAMACI <furkankam...@gmail.com>

> I just want to search on rich documents but I still get same error. I have
> copied example folder into anywhere else at my computer. I have copied dist
> and contrib folders from my build folder into that copy of example folder
> (because solr-cell etc. are within that folders) However I still get same
> error. If any of you could help me you are welcome. Here is my schema:
>
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <!--
> Licensed to the Apache Software Foundation (ASF) under one or more
> contributor license agreements. See the NOTICE file distributed with
> this work for additional information regarding copyright ownership.
> The ASF licenses this file to You under the Apache License, Version 2.0
> (the "License"); you may not use this file except in compliance with
> the License. You may obtain a copy of the License at
>
> http://www.apache.org/licenses/LICENSE-2.0
>
> Unless required by applicable law or agreed to in writing, software
> distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the License for the specific language governing permissions and
> limitations under the License.
> -->
> <!--
> Description: This document contains Solr 4.x schema definition to
> be used with Solr integration currently build into Nutch.
> This schema is not minimal, there are some useful field type definitions
> left,
> and the set of fields and their flags (indexed/stored/term vectors) can be
> further optimized depending on needs. See
>
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup
> for more info.
> -->
>
> <schema name="nutch" version="1.5">
>
> <types>
>
> <!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
> <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>
>
> <!--
> Default numeric field types. For faster range queries, consider the
> tint/tfloat/tlong/tdouble types.
> -->
> <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
> omitNorms="true" positionIncrementGap="0"/>
> <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"
> omitNorms="true" positionIncrementGap="0"/>
> <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
> omitNorms="true" positionIncrementGap="0"/>
> <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0"
> omitNorms="true" positionIncrementGap="0"/>
>
> <!--
> Numeric field types that index each value at various levels of precision
> to accelerate range queries when the number of values between the range
> endpoints is large. See the javadoc for NumericRangeQuery for internal
> implementation details.
>
> Smaller precisionStep values (specified in bits) will lead to more tokens
> indexed per value, slightly larger index size, and faster range queries.
> A precisionStep of 0 disables indexing at different precision levels.
> -->
> <fieldType name="tint" class="solr.TrieIntField" precisionStep="8"
> omitNorms="true" positionIncrementGap="0"/>
> <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8"
> omitNorms="true" positionIncrementGap="0"/>
> <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8"
> omitNorms="true" positionIncrementGap="0"/>
> <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8"
> omitNorms="true" positionIncrementGap="0"/>
>
> <!-- The format for this date field is of the form 1995-12-31T23:59:59Z,
> and
> is a more restricted form of the canonical representation of dateTime
> http://www.w3.org/TR/xmlschema-2/#dateTime
> The trailing "Z" designates UTC time and is mandatory.
> Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
> All other components are mandatory.
>
> Expressions can also be used to denote calculations that should be
> performed relative to "NOW" to determine the value, ie...
>
> NOW/HOUR
> ... Round to the start of the current hour
> NOW-1DAY
> ... Exactly 1 day prior to now
> NOW/DAY+6MONTHS+3DAYS
> ... 6 months and 3 days in the future from the start of
> the current day
>
> Consult the DateField javadocs for more information.
>
> Note: For faster range queries, consider the tdate type
> -->
> <fieldType name="date" class="solr.TrieDateField" omitNorms="true"
> precisionStep="0" positionIncrementGap="0"/>
>
> <!-- A Trie based date field for faster date range queries and date
> faceting. -->
> <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true"
> precisionStep="6" positionIncrementGap="0"/>
>
>
> <!-- solr.TextField allows the specification of custom text analyzers
> specified as a tokenizer and a list of token filters. Different
> analyzers may be specified for indexing and querying.
>
> The optional positionIncrementGap puts space between multiple fields of
> this type on the same document, with the purpose of preventing false phrase
> matching across fields.
>
> For more info on customizing your analyzer chain, please see
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> -->
>
> <!-- A general text field that has reasonable, generic
> cross-language defaults: it tokenizes with StandardTokenizer,
> removes stop words from case-insensitive "stopwords.txt"
> (empty by default), and down cases. At query time only, it
> also applies synonyms. -->
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
> <!-- in this example, we will only use synonyms at query time
> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
> ignoreCase="true" expand="false"/>
> -->
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> <!-- A text field with defaults appropriate for English: it
> tokenizes with StandardTokenizer, removes English stop words
> (stopwords.txt), down cases, protects words from protwords.txt, and
> finally applies Porter's stemming. The query time analyzer
> also applies synonyms from synonyms.txt. -->
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <!-- in this example, we will only use synonyms at query time
> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
> ignoreCase="true" expand="false"/>
> -->
> <!-- Case insensitive stop word removal.
> add enablePositionIncrements=true in both the index and query
> analyzers to leave a 'gap' for more accurate phrase queries.
> -->
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPossessiveFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <!-- Optionally you may want to use this less aggressive stemmer instead
> of PorterStemFilterFactory:
> <filter class="solr.EnglishMinimalStemFilterFactory"/>
> -->
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPossessiveFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <!-- Optionally you may want to use this less aggressive stemmer instead
> of PorterStemFilterFactory:
> <filter class="solr.EnglishMinimalStemFilterFactory"/>
> -->
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> </fieldType>
>
> <!-- A text field with defaults appropriate for English, plus
> aggressive word-splitting and autophrase features enabled.
> This field is just like text_en, except it adds
> WordDelimiterFilter to enable splitting and matching of
> words on case-change, alpha numeric boundaries, and
> non-alphanumeric chars. This means certain compound word
> cases will work, for example query "wi fi" will match
> document "WiFi" or "wi-fi". However, other cases will still
> not match, for example if the query is "wifi" and the
> document is "wi fi" or if the query is "wi-fi" and the
> document is "wifi".
> -->
> <fieldType name="text_en_splitting" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <!-- in this example, we will only use synonyms at query time
> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
> ignoreCase="true" expand="false"/>
> -->
> <!-- Case insensitive stop word removal.
> add enablePositionIncrements=true in both the index and query
> analyzers to leave a 'gap' for more accurate phrase queries.
> -->
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> </fieldType>
>
> <!-- Less flexible matching, but less false matches. Probably not ideal
> for product names,
> but may be good for SKUs. Can insert dashes in the wrong place and still
> match. -->
> <fieldType name="text_en_splitting_tight" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> generateNumberParts="0" catenateWords="1" catenateNumbers="1"
> catenateAll="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> <filter class="solr.EnglishMinimalStemFilterFactory"/>
> <!-- this filter can remove any duplicate tokens that appear at the same
> position - sometimes
> possible with WordDelimiterFilter in conjuncton with stemming. -->
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
> <!-- Just like text_general except it reverses the characters of
> each token, to enable more efficient leading wildcard queries. -->
> <fieldType name="text_general_rev" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
> maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
>
> <fieldtype name="phonetic" stored="false" indexed="true"
> class="solr.TextField" >
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
> </analyzer>
> </fieldtype>
>
> <fieldtype name="payloads" stored="false" indexed="true"
> class="solr.TextField" >
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <!--
> The DelimitedPayloadTokenFilter can put payloads on tokens... for example,
> a token of "foo|1.4" would be indexed as "foo" with a payload of 1.4f
> Attributes of the DelimitedPayloadTokenFilterFactory :
> "delimiter" - a one character delimiter. Default is | (pipe)
> "encoder" - how to encode the following value into a playload
> float -> org.apache.lucene.analysis.payloads.FloatEncoder,
> integer -> o.a.l.a.p.IntegerEncoder
> identity -> o.a.l.a.p.IdentityEncoder
> Fully Qualified class name implementing PayloadEncoder, Encoder must have
> a no arg constructor.
> -->
> <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
> </analyzer>
> </fieldtype>
>
> <!-- lowercases the entire field value, keeping it as a single token. -->
> <fieldType name="lowercase" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
>
> <fieldType name="url" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1"/>
> </analyzer>
> </fieldType>
>
>
> <fieldType name="text_path" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.PathHierarchyTokenizerFactory"/>
> </analyzer>
> </fieldType>
>
> <!-- since fields of this type are by default not stored or indexed,
> any data added to them will be ignored outright. -->
> <fieldtype name="ignored" stored="false" indexed="false"
> multiValued="true" class="solr.StrField" />
>
> </types>
>
> <fields>
> <field name="id" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
> <field name="text" type="text_general" indexed="true" stored="true"/>
> <dynamicField name="attr_*" type="text_general" indexed="true"
> stored="true" multiValued="true"/>
> <!-- Common metadata fields, named specifically to match up with
> SolrCell metadata when parsing rich documents such as Word, PDF.
> Some fields are multiValued only because Tika currently may return
> multiple values for them. Some metadata is parsed from the documents,
> but there are some which come from the client context:
> "content_type": From the HTTP headers of incoming stream
> "resourcename": From SolrCell request param resource.name
> -->
> <field name="title" type="text_general" indexed="true" stored="true"
> multiValued="true"/>
> <field name="subject" type="text_general" indexed="true" stored="true"/>
> <field name="description" type="text_general" indexed="true"
> stored="true"/>
> <field name="comments" type="text_general" indexed="true" stored="true"/>
> <field name="author" type="text_general" indexed="true" stored="true"/>
> <field name="keywords" type="text_general" indexed="true" stored="true"/>
> <field name="category" type="text_general" indexed="true" stored="true"/>
> <field name="resourcename" type="text_general" indexed="true"
> stored="true"/>
> <field name="url" type="text_general" indexed="true" stored="true"/>
> <field name="content_type" type="string" indexed="true" stored="true"
> multiValued="true"/>
> <field name="last_modified" type="date" indexed="true" stored="true"/>
> <field name="links" type="string" indexed="true" stored="true"
> multiValued="true"/>
>
> <!-- Main body of document extracted by SolrCell.
> NOTE: This field is not indexed by default, since it is also copied to
> "text"
> using copyField below. This is to save space. Use this field for returning
> and
> highlighting document content. Use the "text" field to search the content.
> -->
> <field name="content" type="text_general" indexed="false" stored="true"
> multiValued="true"/>
>
> <field name="_version_" type="long" indexed="true" stored="true"/>
>
> <dynamicField name="*" type="ignored" multiValued="true"/>
> </fields>
>
> <uniqueKey>id</uniqueKey>
> <defaultSearchField>text</defaultSearchField>
> <solrQueryParser defaultOperator="OR"/>
>
> <!-- Text fields from SolrCell to search by default in our catch-all field
> -->
> <copyField source="title" dest="text"/>
> <copyField source="author" dest="text"/>
> <copyField source="description" dest="text"/>
> <copyField source="keywords" dest="text"/>
> <copyField source="content" dest="text"/>
> <copyField source="content_type" dest="text"/>
> <copyField source="resourcename" dest="text"/>
> <copyField source="url" dest="text"/>
>
> <!-- Create a string version of author for faceting -->
> <copyField source="author" dest="author_s"/>
>
> </schema>
>
>
>
>
> 2013/4/25 Erik Hatcher <erik.hatc...@gmail.com>
>
>> Did you restart after adding those fields and types?
>>
>> On Apr 24, 2013, at 16:59, Furkan KAMACI <furkankam...@gmail.com> wrote:
>>
>> > I have added that fields:
>> >
>> > <field name="text" type="text_general" indexed="true" stored="true"/>
>> > <dynamicField name="attr_*" type="text_general" indexed="true"
>> > stored="true" multiValued="true"/>
>> > <dynamicField name="ignored_*" type="ignored"/>
>> >
>> > and I have that definition:
>> >
>> > <fieldtype name="ignored" stored="false" indexed="false"
>> multiValued="true"
>> > class="solr.StrField" />
>> >
>> > here is my error:
>> >
>> > <?xml version="1.0" encoding="UTF-8"?>
>> > <response>
>> > <lst name="responseHeader">
>> > <int name="status">400</int>
>> > <int name="QTime">4154</int>
>> > </lst>
>> > <lst name="error">
>> > <str name="msg">ERROR: [doc=1] unknown field 'ignored_meta'</str>
>> > <int name="code">400</int>
>> > </lst>
>> > </response>
>> >
>> > What should I do more?
>> >
>> > 2013/4/24 Erik Hatcher <erik.hatc...@gmail.com>
>> >
>> >> Also, at Solr startup time it logs what it loads from those <lib>
>> >> elements, so you can see whether it is loading the files you intend to
>> or
>> >> not.
>> >>
>> >>        Erik
>> >>
>> >> On Apr 24, 2013, at 10:05 , Alexandre Rafalovitch wrote:
>> >>
>> >>> Have you tried using absolute path to the relevant urls? That will
>> >>> cleanly split the problem into 'still not working' and 'wrong relative
>> >>> path'.
>> >>>
>> >>> Regards,
>> >>>  Alex.
>> >>> On Wed, Apr 24, 2013 at 9:02 AM, Furkan KAMACI <
>> furkankam...@gmail.com>
>> >> wrote:
>> >>>> <lib dir="../../../contrib/extraction/lib" regex=".*\.jar" />
>> >>>> <lib dir="../../../dist/" regex="solr-cell-\d.*\.jar" />
>> >>>
>> >>>
>> >>>
>> >>> Personal blog: http://blog.outerthoughts.com/
>> >>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> >>> - Time is the quality of nature that keeps events from happening all
>> >>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> >>> book)
>> >>
>> >>
>>
>
>

Re: Indexing PDF Files

Reply via email to