[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user asfgit closed the pull request at: https://github.com/apache/lucene-solr/pull/438 --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212807806 --- Diff: solr/core/src/test/org/apache/solr/update/processor/ParsingFieldUpdateProcessorsTest.java --- @@ -992,6 +992,14 @@ public void testLenient() throws IOException { assertParsedDate("Friday Oct 7 13:14:15 2005", Date.from(inst20051007131415()), "parse-date-patterns-default-config"); } + public void testRfc2616() throws Exception { +assertParsedDate("Fri Oct 7 13:14:15 2005" , Date.from(inst20051007131415()), "parse-date-patterns-default-config"); + } + + public void testRfc2616Leniency() throws Exception { --- End diff -- Should be named testAsctimeLeniency (see my previous comment). Glad to see this test works. 'course there are already some tests with similar names... so feel free to combine some. For example the "testLenient" method is an Ansi C test of leniency; by all means combine tests into one test method as appropriate. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212807780 --- Diff: solr/core/src/test/org/apache/solr/update/processor/ParsingFieldUpdateProcessorsTest.java --- @@ -992,6 +992,14 @@ public void testLenient() throws IOException { assertParsedDate("Friday Oct 7 13:14:15 2005", Date.from(inst20051007131415()), "parse-date-patterns-default-config"); } + public void testRfc2616() throws Exception { --- End diff -- Should be named testAsctime since that's the format of this input you are testing. When I mentioned RFC2616, that is a spec that in turn refers to multiple formats (including asctime). Also, like I pointed out, need to test with 2 spaces left of a single digit day. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user barrotsteindev commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212803611 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1141,11 +1141,13 @@ - -MM-dd'T'HH:mm[:ss[.SSS]][z - -MM-dd'T'HH:mm[:ss[,SSS]][z + -MM-dd['T'[HH:mm[:ss[.SSS]][z + -MM-dd['T'[HH:mm[:ss[,SSS]][z -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z - -MM-dd + EEE MMM d [HH:mm:ss ][z ] + , dd-MMM-yy HH:mm:ss [z --- End diff -- >And can you add a test we parse "Sun Nov 6 08:49:37 1994" BTW, I used "Fri Oct 7 13:14:15 2005" instead, since it uses the same date format, but there is a helper method to generate this instant in ParsingFieldUpdateProcessorsTest#inst20051007131415. Hope it is OK. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user barrotsteindev commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212803494 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1141,11 +1141,13 @@ - -MM-dd'T'HH:mm[:ss[.SSS]][z - -MM-dd'T'HH:mm[:ss[,SSS]][z + -MM-dd['T'[HH:mm[:ss[.SSS]][z + -MM-dd['T'[HH:mm[:ss[,SSS]][z -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z - -MM-dd + EEE MMM d [HH:mm:ss ][z ] + , dd-MMM-yy HH:mm:ss [z --- End diff -- Oh, now I recall why I added the optional time zone to asctime. When the optional time zone is added, the pattern covers AKS and EDT date formats as well. IMO, it might be beneficial to keep it in the default config. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user barrotsteindev commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212803184 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1141,11 +1141,13 @@ - -MM-dd'T'HH:mm[:ss[.SSS]][z - -MM-dd'T'HH:mm[:ss[,SSS]][z + -MM-dd['T'[HH:mm[:ss[.SSS]][z + -MM-dd['T'[HH:mm[:ss[,SSS]][z -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z - -MM-dd + EEE MMM d [HH:mm:ss ][z ] + , dd-MMM-yy HH:mm:ss [z --- End diff -- I made time of day optional in asciitime so we can remove this configuration -MM-dd, which uses same without a time of day. I will remove the optional timezone in RFC-1123 & RFC-1036, guess it is my bad. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212800845 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1141,11 +1141,13 @@ - -MM-dd'T'HH:mm[:ss[.SSS]][z - -MM-dd'T'HH:mm[:ss[,SSS]][z + -MM-dd['T'[HH:mm[:ss[.SSS]][z + -MM-dd['T'[HH:mm[:ss[,SSS]][z -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z - -MM-dd + EEE MMM d [HH:mm:ss ][z ] + , dd-MMM-yy HH:mm:ss [z --- End diff -- These last two patterns are RFC-1036 and RFC-1123. Neither should have an optional timezone -- as seen in `ExtractDateUtils.PATTERN_RFC1036`, `ExtractDateUtils.PATTERN_RFC1123`, `DateTimeFormatter.RFC_1123_DATE_TIME`, and I looked at RFC-1036 spec as well. Why did you make the time of day optional in ASCTIME? I don't see that in ExtractDateUtils. And can you add a test we parse "Sun Nov 6 08:49:37 1994" (example date from RFC-2616 which is HTTP/1.1 spec which lists RFC-1123, RFC-1036, and asctime() -- the origin of why we see these particular patterns in ExtractDateUtils, borrowed from ApacheHttpClient). The double-space before the single digit day is deliberate. It may be necessary to use the 'p' (pad modifier) as specified in DateTimeFormatter. I've seen conflicting information from internet searches on the "day" portion of asctime() as either 2-digit or 1; so it'd be good to test that either work. "Leniency" will hopefully ensure one pattern works without needing to add more variations. Good catch on noticing the seconds is optional in RFC-1123! Can you reverse the order of these last 3 patterns? Based on RFC-2616 (HTTP/1.1), this is the order defined by preference. BTW if we really did want RFC-1123 & RFC-1036 patterns to have an optional timezone, then it would need to be specified differently than how you did it. You put the optional start bracket to the right of the space when it would need to be to the left of it. Obviously, all tweaks we do to these patterns need to be redone between * `solr/server/solr/configsets/_default/conf/solrconfig.xml` * `solr/core/src/test- files/solr/collection1/conf/solrconfig-parsing-update-processor-chains.xml` * `solr/contrib/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml` Probably elsewhere; I can check before committing. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user barrotsteindev commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212788729 --- Diff: solr/contrib/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml --- @@ -43,6 +43,25 @@ + + + -MM-dd['T'[HH:mm[:ss[.SSS]][z + -MM-dd['T'[HH:mm[:ss[,SSS]][z + -MM-dd HH:mm[:ss[.SSS]][z + -MM-dd HH:mm[:ss[,SSS]][z + EEE MMM d [HH:mm:ss ][z ] + , dd-MMM-yy HH:mm:ss [z + [EEE, ]dd MMM HH:mm[:ss] [z --- End diff -- > BTW I was looking at DateTimeFormatter.RFC_1123_DATE_TIME and noticed the leading day of week is optional. I also had a little peek and noticed the second of the minute is also optional, here is the code snippet that initializes the default RFC_1123_DATE_TIME DateTimeFromatter: `.optionalStart() .appendLiteral(':') .appendValue(SECOND_OF_MINUTE, 2) .optionalEnd()` --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212041807 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1141,11 +1141,11 @@ - -MM-dd'T'HH:mm[:ss[.SSS]][z --- End diff -- By "the old patterns" do you mean the _current_ patterns we see in the default config? Yes. You can tweak an existing pattern to parse more things, but don't remove the ability to parse something it would have parsed before! And yes add/merge the additional ones from SOLR-12591. By "merge" I mean modify some existing patterns to match some of the ones in SOLR-12591. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user barrotsteindev commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212033718 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1141,11 +1141,11 @@ - -MM-dd'T'HH:mm[:ss[.SSS]][z --- End diff -- So should I keep the old patterns in addition to the ones from SOLR-12591? --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r212031794 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1141,11 +1141,11 @@ - -MM-dd'T'HH:mm[:ss[.SSS]][z --- End diff -- You just changed the default date patterns to not parse patterns it parsed before. We need to be cautious! For example it used to parse a comma to separate the milliseconds. And the 'T' was optional. Whatever patterns exist here ought to be identical in documentation & test configs (which yes is annoying to keep in sync but shouldn't change often). --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r211821295 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1146,6 +1146,11 @@ -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z -MM-dd + EEE, dd MMM HH:mm:ss z + , dd-MMM-yy HH:mm:ss z + EEE MMM d HH:mm:ss + EEE MMM d hh:mm:ss z --- End diff -- Yeah, that's what I thought you were going to do. You've separately tested that the ExtractionDateUtils tests also work on these patterns in the last issue, so we want them to be the same; otherwise due to bugs (like your 'hh') we might inadvertently screw that up. If needed, modify/tweak one of those tests/configs. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r211821003 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1146,6 +1146,11 @@ -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z -MM-dd + EEE, dd MMM HH:mm:ss z + , dd-MMM-yy HH:mm:ss z + EEE MMM d HH:mm:ss + EEE MMM d hh:mm:ss z + -MM-dd hh:mm:ss --- End diff -- Close; it'd be this: `-MM-dd[ hh:mm:ss]` (notice the bracket starts immediately after the 'dd'; otherwise we'd be wrongly insisting on a trailing space. And yes, always use "HH" and not "hh" unless the pattern contains an AM/PM designator (and these don't). --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user barrotsteindev commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r211818326 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1146,6 +1146,11 @@ -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z -MM-dd + EEE, dd MMM HH:mm:ss z + , dd-MMM-yy HH:mm:ss z + EEE MMM d HH:mm:ss + EEE MMM d hh:mm:ss z --- End diff -- Should I just try and copy that part of the config to the default one? --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user barrotsteindev commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r211818226 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1146,6 +1146,11 @@ -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z -MM-dd + EEE, dd MMM HH:mm:ss z + , dd-MMM-yy HH:mm:ss z + EEE MMM d HH:mm:ss + EEE MMM d hh:mm:ss z + -MM-dd hh:mm:ss --- End diff -- Right, it could be joined with "-MM-dd" as "-MM-dd [hh:mm:ss]". Or should I replace the "hh" bit with "HH"? --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r211817633 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1146,6 +1146,11 @@ -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z -MM-dd + EEE, dd MMM HH:mm:ss z + , dd-MMM-yy HH:mm:ss z + EEE MMM d HH:mm:ss + EEE MMM d hh:mm:ss z + -MM-dd hh:mm:ss --- End diff -- This pattern could be removed if we add optional brackets to an existing pattern. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/438#discussion_r211817529 --- Diff: solr/server/solr/configsets/_default/conf/solrconfig.xml --- @@ -1146,6 +1146,11 @@ -MM-dd HH:mm[:ss[.SSS]][z -MM-dd HH:mm[:ss[,SSS]][z -MM-dd + EEE, dd MMM HH:mm:ss z + , dd-MMM-yy HH:mm:ss z + EEE MMM d HH:mm:ss + EEE MMM d hh:mm:ss z --- End diff -- The "hh" is erroneous. I'm surprised this even works; perhaps "lenient" allows os to get away with it? I was hoping to keep the added patterns more minimal. You recently added "parse-date-patterns-from-extract-contrib" to solrconfig-parsing-update-processor-chains which is a better guide. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #438: SOLR-12593: Remove date parsing functionality...
GitHub user barrotsteindev opened a pull request: https://github.com/apache/lucene-solr/pull/438 SOLR-12593: Remove date parsing functionality from extraction contrib You can merge this pull request into a Git repository by running: $ git pull https://github.com/barrotsteindev/lucene-solr SOLR-12593 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/438.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #438 commit 2fb612a4944464851ff59a97f4d1fdba28776a87 Author: Bar Rotstein Date: 2018-08-21T19:21:10Z SOLR-12593: add ExtractionDateUtil default date formats to parse date urp default config commit 163a1d44ff958400c3895a0382c526ae9102e2e6 Author: Bar Rotstein Date: 2018-08-21T19:21:32Z replace extraction date util in SolrContentHandler with URP and update test config --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org