If you look at my first email in this thread, it says filtering: false and normalising: false. Even then, it didn’t generate anything.
Here’s my regex-urlfilter.txt file: # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jp$ # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. And here’s my regex-normalize.xml: <?xml version="1.0"?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <!-- This is the configuration file for the RegexUrlNormalize Class. This is intended so that users can specify substitutions to be done on URLs. The regex engine that is used is Perl5 compatible. The rules are applied to URLs in the order they occur in this file. --> <!-- WATCH OUT: an xml parser reads this file an ampersands must be expanded to & --> <!-- The following rules show how to strip out session IDs, default pages, interpage anchors, etc. Order does matter! --> <regex-normalize> <!-- removes session ids from urls (such as jsessionid and PHPSESSID) --> <regex> <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern> <substitution>$4</substitution> </regex> <!-- changes default pages into standard for /index.html, etc. into / <regex> <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&|#|$)</pattern> <substitution>/$3</substitution> </regex> --> <!-- removes interpage href anchors such as site.com#location --> <regex> <pattern>#.*?(\?|&|$)</pattern> <substitution>$1</substitution> </regex> <!-- The following rules show how to strip out session IDs, default pages, interpage anchors, etc. Order does matter! --> <regex-normalize> <!-- cleans ?&var=value into ?var=value --> <regex> <pattern>\?&</pattern> <substitution>\?</substitution> </regex> <!-- cleans multiple sequential ampersands into a single ampersand --> <regex> <pattern>&{2,}</pattern> <substitution>&</substitution> </regex> <!-- removes trailing ? --> <regex> <pattern>[\?&\.]$</pattern> <substitution></substitution> </regex> <!-- removes duplicate slashes --> <regex> <pattern>(?<!:)/{2,}</pattern> <substitution>/</substitution> </regex> </regex-normalize> -- Manikandan Saravanan Architect - Technology TheSocialPeople On 6 June 2014 at 1:54:02 am, Lewis John Mcgibbney ([email protected]) wrote: I suspect that your generator normalization/filtering prevents this URL from getting through On Thu, Jun 5, 2014 at 1:09 PM, Manikandan Saravanan <[email protected]> wrote: 14/06/05 15:59:06 INFO crawl.GeneratorJob: GeneratorJob: filtering: true 14/06/05 15:59:06 INFO crawl.GeneratorJob: GeneratorJob: normalizing: true Map input records in Generator phase is 0... this is incorrect. Lewis

