[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1190: ----------------------------------- Fix Version/s: 1.8 > MoreIndexingFilter refactor: move data formats used to parse "lastModified" > to a config file. > --------------------------------------------------------------------------------------------- > > Key: NUTCH-1190 > URL: https://issues.apache.org/jira/browse/NUTCH-1190 > Project: Nutch > Issue Type: Improvement > Components: indexer > Affects Versions: 1.4 > Environment: jdk6 > Reporter: Zhang JinYan > Fix For: 2.3, 1.8 > > Attachments: date-styles.txt, MoreIndexingFilter.patch, > NUTCH-1190-trunk.patch > > > There many issues about missing date format: > [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871] > [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912] > [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015] > The data formats can be diverse, so why not move those data formats to a > extra config file? > I move all the data formats from "MoreIndexingFilter.java" to a file named > "date-styles.txt"(place in "conf"), which will be load on startup. > {code} > public void setConf(Configuration conf) { > this.conf = conf; > MIME = new MimeUtil(conf); > > URL res = conf.getResource("date-styles.txt"); > if(res==null){ > LOG.error("Can't find resource: date-styles.txt"); > }else{ > try { > List lines = FileUtils.readLines(new File(res.getFile())); > for (int i = 0; i < lines.size(); i++) { > String dateStyle = (String) lines.get(i); > if(StringUtils.isBlank(dateStyle)){ > lines.remove(i); > i--; > continue; > } > dateStyle=StringUtils.trim(dateStyle); > if(dateStyle.startsWith("#")){ > lines.remove(i); > i--; > continue; > } > lines.set(i, dateStyle); > } > dateStyles = new String[lines.size()]; > lines.toArray(dateStyles); > } catch (IOException e) { > LOG.error("Failed to load resource: date-styles.txt"); > } > } > } > {code} > Then parse "lastModified" like this(sample): > {code} > private long getTime(String date, String url) { > ...... > Date parsedDate = DateUtils.parseDate(date, dateStyles); > time = parsedDate.getTime(); > ...... > return time; > } > {code} > This path also contains the "path" of > [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140]. > Find more details in the patch file. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira