[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2022-09-27 Thread Todd Farmer (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17610149#comment-17610149
 ] 

Todd Farmer commented on ARROW-13028:
-

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2022-02-09 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489671#comment-17489671
 ] 

Antoine Pitrou commented on ARROW-13028:


Well, it's probably worth keeping open for now. But the solution should revolve 
around a more future-proof setting than the proposed boolean setting.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2022-02-09 Thread Nate Clark (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489594#comment-17489594
 ] 

Nate Clark commented on ARROW-13028:


[~apitrou] is it worth keeping this ticket open to discuss this further or 
should this be closed because there is no interest in implementing this 
behavior? 

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-10-13 Thread Nate Clark (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428226#comment-17428226
 ] 

Nate Clark commented on ARROW-13028:


[~apitrou] what would you envision as a more generic setting for influencing 
the type inference? Something like an enum to indicate only infer 64bit, 
attempt 32bit int, force 32bit float?

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-09-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420774#comment-17420774
 ] 

Antoine Pitrou commented on ARROW-13028:


I agree with Eduardo that it feels a bit opinionated. If we want to allow users 
to influence integer inference, at least a more general setting should probably 
be exposed than a simple boolean option to try 32-bit inference.

[~npr] [~icook] what do you think?

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-09-27 Thread Nate Clark (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420767#comment-17420767
 ] 

Nate Clark commented on ARROW-13028:


I agree that largest type could be considered safest especially for floating 
point.
In theory it could start at int8 and work from there is any interest in that, 
but signed vs unsigned is probably not as beneficial.
For floating point the detection is more difficult since it is already 
considered an imprecise format so parsers will force values to fit to the size 
and detection of double vs float would have to be done outside the parser.
I did put out the linked MR for int32 detection so that you can see at least 
that implemented.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-09-20 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417922#comment-17417922
 ] 

Eduardo Ponce commented on ARROW-13028:
---

I think that having CSV infer to largest type is more robust/safe and use 
explicit column types for other conversions.
If inference is set to be from smallest to largest, then where does these 
decisions end?
Do we infer first as signed or unsigned integers? Int8 vs. int32, etc? 
Half-float vs float vs double?
We can definitely decide to simply try signed int32 and float as the smallest 
integral type, but it stills feels a bit opinionated.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-08-11 Thread Nate Clark (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397395#comment-17397395
 ] 

Nate Clark commented on ARROW-13028:


>> Is this by design or accident?

> I would say by accident, but I'm not sure what you mean with "the precision 
> of the string is too much for a float". Strictly speaking, some very short 
> decimal numbers are not exactly representable in binary floating-point, for 
> example "0.3". Should we reject them?

I was thinking something like `3.78946546156984798497501e10` can be better 
represented as double than a float. But as you point out there are values which 
cannot be fully represented in either, so there might not be a good way to 
detect when double should be used instead of float.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-08-05 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394197#comment-17394197
 ] 

Antoine Pitrou commented on ARROW-13028:


Sorry for not answering earlier:

> Is this by design or accident?

I would say by accident, but I'm not sure what you mean with "the precision of 
the string is too much for a float". Strictly speaking, some very short decimal 
numbers are not exactly representable in binary floating-point, for example 
"0.3". Should we reject them?

> Is there any interest in adding this for at least ints? 

Potentially. [~npr] [~jonkeane] what do you think?


> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-08-05 Thread Nate Clark (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394176#comment-17394176
 ] 

Nate Clark commented on ARROW-13028:


Is there any interest in adding this for at least ints? If not I can just close 
this ticket.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-06-09 Thread Nate Clark (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360401#comment-17360401
 ] 

Nate Clark commented on ARROW-13028:


I was already working on this and discovered something which might be a problem 
with my idea. It looks like any numeric value can be parsed as a float even if 
the precision of the string is too much for a float. Because of this the float 
parse will succeed even if it is better to parse it as a double.

Is this by design or accident?

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-06-09 Thread Nate Clark (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360323#comment-17360323
 ] 

Nate Clark commented on ARROW-13028:


Ideally one would pass the column types if they are known but for my use case I 
am using the type inference of the reader to know what the types of the columns 
are. When relying on the reader to get the types the only way to get 32 bit 
values would be to re-parse the csv forcing the type to a 32bit value and if it 
isn't a 32 bit value it will fail.

 

It is true that if one of the later blocks did have a 64bit number that would 
cause a parsing error but the same would be true if the column was inferred as 
int but it was in fact a float or the column is empty 40% of the time and the 
first block happens to not have data for the column. This is more of a 
limitation that the schema is determined by the first block and cannot change 
after that.

 

One of the reasons that the default is to not try 32bit values is to avoid the 
potential parse errors on subsequent blocks so this should only really be used 
if the caller knows all numeric columns can be represented in 32 bit or can 
handle the parse error.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-06-09 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360301#comment-17360301
 ] 

Weston Pace commented on ARROW-13028:
-

The problem is that the miss may not be detected until some # of blocks have 
been processed.  The file-based CSV reader handles this by going backwards 
through all the already-processed blocks and upcasting to the looser type.  So 
it can be a non-trivial performance hit.  However, the streaming CSV (used by 
the datasets API) isn't so lenient.  It infers type based on the first block 
(default 1MB) of data alone.  The complexity of doing otherwise is pretty 
significant.  I think could cause an issue here.  If the large >32 bit value 
doesn't happen until after the first block you will get parsing errors.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-06-09 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360302#comment-17360302
 ] 

Antoine Pitrou commented on ARROW-13028:


There would be a conversion from int32 to int64 indeed. The conversion is cheap 
compared to the actual CSV parsing, so should be a minor concern, though.

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-06-09 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360295#comment-17360295
 ] 

Jonathan Keane commented on ARROW-13028:


Yeah, I tend to agree that if one needs to / wants to manage that conversion 
explicit column types is the way to go (and that interface has the benefit of 
also allowing one to control other types of other columns). 

This is an empirical question (and almost certainly vary by the data), but what 
would a miss look like performance wise for trying 32bit and then having to 
change to 64bit after the fact? That would involve some computation, correct? 
Or can we do that conversion for free / without rewriting the representation?

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13028) [C++] CSV add convert option to attempt 32bit number inferences

2021-06-09 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360291#comment-17360291
 ] 

Antoine Pitrou commented on ARROW-13028:


I'm unsure how much flexibility we want to add to CSV type inference. You can 
of course pass column types explicitly if you want to optimize memory footprint.

[~npr]  [~jonkeane] What do you think?

> [C++] CSV add convert option to attempt 32bit number inferences
> ---
>
> Key: ARROW-13028
> URL: https://issues.apache.org/jira/browse/ARROW-13028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nate Clark
>Assignee: Nate Clark
>Priority: Major
>
> When types are being inferred by CSV the numbers are always 64 bit. For large 
> data sets it could be better to use 32 bit types to save over all memory. To 
> do this it would be useful to add an option to ConvertOptions to try 32 bit 
> numbers before 64 bit. By default this option would be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)