t829702 edited a comment on pull request #2035: URL: https://github.com/apache/arrow/pull/2035#issuecomment-696480501
> Providing a separate utility in Arrow to parse dates I didn't mean to duplicate JS parsing code, but a way to provide a special parser function to the constructor, something like `new arrow.DateMillisecond( str => Date.parse(str) )`, because my date str is already in `Date.toISOString` format, but if somebody else's date str representation is different, they can pass in some sort of `d3.timeParse("...")` after then, somebody can emit stream of objects of `{"date":"2020-09-18T21:42:30.324Z", ...}` directly to `arrow.Builder.throughAsyncIterable(...)` I have tried the Python implementation's json package, it has somewhat _Automatic Type Inference_, but it recognizes only `“YYYY-MM-DD”` and `“YYYY-MM-DD hh:mm:ss”`, only two time string formats, all others are left as strings(Utf8) that's no good, and I don't see it exposed any lower level way of tweaking to tell the types, I have tried to pass in explicit schema but not work as expected; and for my sender and receiver fields, were left over all strings, not helping to try dictionary at all; https://arrow.apache.org/docs/python/json.html#automatic-type-inference Will this be an advanced usage if each DataType can have an optional parser? 1. `new arrow.DateMillisecond( str => Date.parse(str) )` for the JS standard `"2020-09-18T21:42:30.324Z"` time string 2. `new arrow.DateMillisecond( d3.timeParse("%Y-%m-%d %H:%M:%S") )` for others like `"2020-09-18 21:42:30"` the Rust implementation's json package looks like having a nicer `arrow::json::reader::infer_json_schema` I haven't tried yet https://docs.rs/arrow/1.0.1/arrow/json/reader/fn.infer_json_schema.html when I said a `50MB line-json file` it's already in `ndjson` format, each line is a compact and valid JSON, but whole file is not; I have read its code https://github.com/ndjson/ndjson.js/blob/master/index.js#L17 underlying it's same as `JSON.parse` after reading each line, shouldn't be much faster but can save some LOC I have read your csv-to-arrow; thanks again for all questions answered, but one more is why not ship it inside the apache-arrow NPM package, just because all other implementations have packaged one or more helper utils, that's really helpful when working with binary arrow files on the command line, 1. One major feature would be the infer schema, equivalent to Python json's _Automatic Type Inference_ https://github.com/trxcllnt/csv-to-arrow-js/blob/master/index.js#L51 2. and we can also have a `json-to-arrow` helper, with smart inferring schema Would be nice if the `infer schema` (or _Automatic Type Inference_) can have 1. recognize some more popular time string formats, something like in the [`d3.autoType`](https://github.com/d3/d3-dsv/blob/master/src/autoType.js#L9-L11) 2. detect number value's range and precision, use only the minimum to cover all values (say if Int32 can cover all values, don't use Int64; and if Float32 covers all, don't use Float64) 3. try best to use dictionary if a string column have small number of cardinals (say less than 10% of total number of rows) 4. and else? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org