[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610149#comment-17610149 ] Todd Farmer commented on ARROW-13028: - This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489671#comment-17489671 ] Antoine Pitrou commented on ARROW-13028: Well, it's probably worth keeping open for now. But the solution should revolve around a more future-proof setting than the proposed boolean setting. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489594#comment-17489594 ] Nate Clark commented on ARROW-13028: [~apitrou] is it worth keeping this ticket open to discuss this further or should this be closed because there is no interest in implementing this behavior? > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428226#comment-17428226 ] Nate Clark commented on ARROW-13028: [~apitrou] what would you envision as a more generic setting for influencing the type inference? Something like an enum to indicate only infer 64bit, attempt 32bit int, force 32bit float? > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420774#comment-17420774 ] Antoine Pitrou commented on ARROW-13028: I agree with Eduardo that it feels a bit opinionated. If we want to allow users to influence integer inference, at least a more general setting should probably be exposed than a simple boolean option to try 32-bit inference. [~npr] [~icook] what do you think? > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420767#comment-17420767 ] Nate Clark commented on ARROW-13028: I agree that largest type could be considered safest especially for floating point. In theory it could start at int8 and work from there is any interest in that, but signed vs unsigned is probably not as beneficial. For floating point the detection is more difficult since it is already considered an imprecise format so parsers will force values to fit to the size and detection of double vs float would have to be done outside the parser. I did put out the linked MR for int32 detection so that you can see at least that implemented. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417922#comment-17417922 ] Eduardo Ponce commented on ARROW-13028: --- I think that having CSV infer to largest type is more robust/safe and use explicit column types for other conversions. If inference is set to be from smallest to largest, then where does these decisions end? Do we infer first as signed or unsigned integers? Int8 vs. int32, etc? Half-float vs float vs double? We can definitely decide to simply try signed int32 and float as the smallest integral type, but it stills feels a bit opinionated. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397395#comment-17397395 ] Nate Clark commented on ARROW-13028: >> Is this by design or accident? > I would say by accident, but I'm not sure what you mean with "the precision > of the string is too much for a float". Strictly speaking, some very short > decimal numbers are not exactly representable in binary floating-point, for > example "0.3". Should we reject them? I was thinking something like `3.78946546156984798497501e10` can be better represented as double than a float. But as you point out there are values which cannot be fully represented in either, so there might not be a good way to detect when double should be used instead of float. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394197#comment-17394197 ] Antoine Pitrou commented on ARROW-13028: Sorry for not answering earlier: > Is this by design or accident? I would say by accident, but I'm not sure what you mean with "the precision of the string is too much for a float". Strictly speaking, some very short decimal numbers are not exactly representable in binary floating-point, for example "0.3". Should we reject them? > Is there any interest in adding this for at least ints? Potentially. [~npr] [~jonkeane] what do you think? > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394176#comment-17394176 ] Nate Clark commented on ARROW-13028: Is there any interest in adding this for at least ints? If not I can just close this ticket. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360401#comment-17360401 ] Nate Clark commented on ARROW-13028: I was already working on this and discovered something which might be a problem with my idea. It looks like any numeric value can be parsed as a float even if the precision of the string is too much for a float. Because of this the float parse will succeed even if it is better to parse it as a double. Is this by design or accident? > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360323#comment-17360323 ] Nate Clark commented on ARROW-13028: Ideally one would pass the column types if they are known but for my use case I am using the type inference of the reader to know what the types of the columns are. When relying on the reader to get the types the only way to get 32 bit values would be to re-parse the csv forcing the type to a 32bit value and if it isn't a 32 bit value it will fail. It is true that if one of the later blocks did have a 64bit number that would cause a parsing error but the same would be true if the column was inferred as int but it was in fact a float or the column is empty 40% of the time and the first block happens to not have data for the column. This is more of a limitation that the schema is determined by the first block and cannot change after that. One of the reasons that the default is to not try 32bit values is to avoid the potential parse errors on subsequent blocks so this should only really be used if the caller knows all numeric columns can be represented in 32 bit or can handle the parse error. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360301#comment-17360301 ] Weston Pace commented on ARROW-13028: - The problem is that the miss may not be detected until some # of blocks have been processed. The file-based CSV reader handles this by going backwards through all the already-processed blocks and upcasting to the looser type. So it can be a non-trivial performance hit. However, the streaming CSV (used by the datasets API) isn't so lenient. It infers type based on the first block (default 1MB) of data alone. The complexity of doing otherwise is pretty significant. I think could cause an issue here. If the large >32 bit value doesn't happen until after the first block you will get parsing errors. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360302#comment-17360302 ] Antoine Pitrou commented on ARROW-13028: There would be a conversion from int32 to int64 indeed. The conversion is cheap compared to the actual CSV parsing, so should be a minor concern, though. > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360295#comment-17360295 ] Jonathan Keane commented on ARROW-13028: Yeah, I tend to agree that if one needs to / wants to manage that conversion explicit column types is the way to go (and that interface has the benefit of also allowing one to control other types of other columns). This is an empirical question (and almost certainly vary by the data), but what would a miss look like performance wise for trying 32bit and then having to change to 64bit after the fact? That would involve some computation, correct? Or can we do that conversion for free / without rewriting the representation? > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences
[ https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360291#comment-17360291 ] Antoine Pitrou commented on ARROW-13028: I'm unsure how much flexibility we want to add to CSV type inference. You can of course pass column types explicitly if you want to optimize memory footprint. [~npr] [~jonkeane] What do you think? > [C++] CSV add convert option to attempt 32bit number inferences > --- > > Key: ARROW-13028 > URL: https://issues.apache.org/jira/browse/ARROW-13028 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Nate Clark >Assignee: Nate Clark >Priority: Major > > When types are being inferred by CSV the numbers are always 64 bit. For large > data sets it could be better to use 32 bit types to save over all memory. To > do this it would be useful to add an option to ConvertOptions to try 32 bit > numbers before 64 bit. By default this option would be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)