[ 
http://issues.apache.org/jira/browse/NUTCH-65?page=comments#action_12320732 ] 

Jerome Charron commented on NUTCH-65:
-------------------------------------

Michael Nebel reports some other date parsing problems on the nutch-dev
(http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00663.html)

*****
    ...can't parse erroneous date: 12.06.2005 22:02:54 GMT
    ...can't parse erroneous date: 14.07.2005 GMT
    ...can't parse erroneous date: 15.10.2003 04:58:08
    ...can't parse erroneous date: 16 6 2005 00:00:00 GMT
    ...can't parse erroneous date: 16.06.2005 10:10:57 GMT
    ...can't parse erroneous date: 2005/06/21 20:51:40.618 GMT+2
    ...can't parse erroneous date: 29.06.2005 GMT
    ...can't parse erroneous date: 31.5.2005; 10:14:49
    ...can't parse erroneous date: 968776128
*****

An so on....

I don't thing using a fixed local (Local.US) is the solution since the format 
of the date can takes various forms (as Michael's logs show it). 
Instead, the solution is perhaps to use Jakarta Commons DateUtils.parseDate 
method:
http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/time/DateUtils.html#parseDate(java.lang.String,%20java.lang.String[])

It will gives something like:

Date parsedDate = DateUtils.parseDate(date,
        new String [] {"yyyy/MM/dd",
                       "yyyy.MM.dd HH:mm:ss",
                       "yyyy-MM-dd HH:mm",
                       ...
                       and so on
                       ...
                       });



> index-more plugin can't parse large set of  modification-date
> -------------------------------------------------------------
>
>          Key: NUTCH-65
>          URL: http://issues.apache.org/jira/browse/NUTCH-65
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>  Environment: nutch 0.7, java 1.5, linux
>     Reporter: Lutischán Ferenc

>
> I found a problem in MoreIndexingFilter.java.
> When I indexing segments, I get large list of error messages:
> can't parse errorenous date: Wed, 10 Sep 2003 11:59:14 or
> can't parse errorenous date: Wed, 10 Sep 2003 11:59:14GMT
> I modifiing source code (I don't make a 'patch'):
> Original (lines 137-138):
> DateFormat df = new SimpleDateFormat("EEE MMM dd HH:mm:ss yyyy zzz");
> Date d = df.parse(date);
> New:
> DateFormat df = new SimpleDateFormat("EEE, MMM dd HH:mm:ss yyyy", Locale.US);
> Date d = df.parse(date.substring(0,25));
> The modified code works fine.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to