[jira] [Assigned] (ARROW-18395) [C++] Move select-k implementation into separate module

2022-11-25 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-18395:
---

Assignee: Ben Harkins

> [C++] Move select-k implementation into separate module
> ---
>
> Key: ARROW-18395
> URL: https://issues.apache.org/jira/browse/ARROW-18395
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Ben Harkins
>Priority: Minor
>  Labels: good-second-issue
>
> The select-k kernel implementations are currently in {{vector_sort.cc}}, 
> amongst other things.
> To make the code more readable and faster to compiler, we should move them 
> into their own file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18184) [C++] Improve JSON parser benchmarks

2022-10-27 Thread Ben Harkins (Jira)
Ben Harkins created ARROW-18184:
---

 Summary: [C++] Improve JSON parser benchmarks
 Key: ARROW-18184
 URL: https://issues.apache.org/jira/browse/ARROW-18184
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Ben Harkins
Assignee: Ben Harkins


The current JSON parser benchmark suite is fairly limited, as it only really 
tests objects with a couple non-varying fields. To properly measure 
optimizations based on input predictability (i.e. 
[ARROW-4709|https://issues.apache.org/jira/browse/ARROW-4709]) it would be 
beneficial to provide a parameterized way to create schemas with an arbitrary 
number of fields and add benchmarks for input with randomly ordered/omitted 
fields.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-10-20 Thread Ben Harkins (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621277#comment-17621277
 ] 

Ben Harkins commented on ARROW-18106:
-

That is indeed unexpected... especially since it comes back as a plain string 
in the first case. I suspect it's an issue with timestamps specifically (or 
potentially any non-string type with a json string representation). Test 
coverage seems to be lacking in this area.

I'll take a look at it.

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> 
>
> Key: ARROW-18106
> URL: https://issues.apache.org/jira/browse/ARROW-18106
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> 
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

2022-10-20 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-18106:
---

Assignee: Ben Harkins

> [C++] JSON reader ignores explicit schema with default 
> unexpected_field_behavior="infer"
> 
>
> Key: ARROW-18106
> URL: https://issues.apache.org/jira/browse/ARROW-18106
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Harkins
>Priority: Major
>  Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay 
> between two options.
> By default, when reading json, we _infer_ the data type of columns, and when 
> specifying an explicit schema, we _also_ by default infer the type of columns 
> that are not specified in the explicit schema. The docs for 
> {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the 
> columns fails according to that schema, we still fall back to this default of 
> inferring the data type (while I would have expected an error, since we 
> should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", 
> but the explicit schema is ignored, and we get a result with a string column 
> as result:
> {code}
> pyarrow.Table
> column: string
> 
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the 
> expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", 
> pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't 
> parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar 
> issue with eg {{"column": "A"}} and setting the schema to "column" being 
> int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15822) [C++] Cast duration to string (thus CSV writing) not supported

2022-10-14 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-15822:
---

Assignee: Ben Harkins

> [C++] Cast duration to string (thus CSV writing) not supported
> --
>
> Key: ARROW-15822
> URL: https://issues.apache.org/jira/browse/ARROW-15822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 7.0.0, 7.0.2
>Reporter: Carl Boettiger
>Assignee: Ben Harkins
>Priority: Critical
>
> Edit (Dragos Moldovan-Grünfeld): The issue I opened (ARROW-15833) is 
> basically a duplicate of this. It's fundamentally a C++ issue that happened 
> to surface in the R CSV writer. I hope you don't mind, I modified the 
> components to C++
> ===
> Consider this reprex:
> {code:java}
> arrow::write_csv_arrow(data.frame(time = as.difftime(1, units="secs")), 
> "test.csv"){code}
> This errors with:
> Error: NotImplemented: Unsupported cast from duration[s] to utf8 using 
> function cast_string
>  
> Note that readr::write_csv() has no trouble with this (which renders the data 
> as "1" without a unit).  Arguably the readr rendering is lossy, but then we 
> usually assume units are provided in other metadata anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17937) [C++] Building of Arrow C++ (dataset) errors on Windows

2022-10-13 Thread Ben Harkins (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617452#comment-17617452
 ] 

Ben Harkins edited comment on ARROW-17937 at 10/14/22 4:25 AM:
---

Unfortunately, I haven't been able to reproduce this, but FWIW, It looks like 
{{-DARROW_DS_STATIC}} isn't being forwarded to the compiler for some reason. 
Normally, there wouldn't be a linkage discrepancy. See: 
([1|https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/visibility.h#L30])

For reference, my output looks like this:
{code:java}
C:\PROGRA~1\MICROS~2\2022\COMMUN~1\VC\Tools\MSVC\1433~1.316\bin\Hostx64\x64\cl.exe
  /nologo /TP -DARROW_DS_EXPORTING -DARROW_DS_STATIC -DARROW_FLIGHT_SQL_STATIC 
-DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
-DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 
-DARROW_HDFS -DARROW_STATIC -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
-DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
-DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
-DPARQUET_STATIC -DURI_STATIC_BUILD -D_CRT_SECURE_NO_WARNINGS 
-D_ENABLE_EXTENDED_ALIGNED_STORAGE -IC:\Users\Ben\Dev\arrow\cpp\build\src 
-IC:\Users\Ben\Dev\arrow\cpp\src -IC:\Users\Ben\Dev\arrow\cpp\src\generated 
-IC:\Users\Ben\Dev\arrow\cpp\src\parquet 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\flatbuffers\include 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\hadoop\include 
-external:IC:\Users\Ben\miniconda3\envs\pyarrow-dev\Library\include 
-external:W0 /DWIN32 /D_WINDOWS  /GR /EHsc 
/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING   /EHsc /wd5105 /bigobj /utf-8 /W3 
/wd4800 /wd4996 /wd4065  /WX /MP /MD /O2 /Ob2 /DNDEBUG -std:c++17 /showIncludes 
/Fosrc\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.obj{code}


was (Author: JIRAUSER295145):
Unfortunately, I haven't been able to reproduce this, but FWIW, It looks like 
{{-DARROW_DS_STATIC}} isn't being forwarded to the compiler for some reason. 
Normally, there wouldn't be a linkage discrepancy. See: 
[[1|https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/visibility.h#L30]]

For reference, my output looks like this:
{code:java}
C:\PROGRA~1\MICROS~2\2022\COMMUN~1\VC\Tools\MSVC\1433~1.316\bin\Hostx64\x64\cl.exe
  /nologo /TP -DARROW_DS_EXPORTING -DARROW_DS_STATIC -DARROW_FLIGHT_SQL_STATIC 
-DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
-DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 
-DARROW_HDFS -DARROW_STATIC -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
-DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
-DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
-DPARQUET_STATIC -DURI_STATIC_BUILD -D_CRT_SECURE_NO_WARNINGS 
-D_ENABLE_EXTENDED_ALIGNED_STORAGE -IC:\Users\Ben\Dev\arrow\cpp\build\src 
-IC:\Users\Ben\Dev\arrow\cpp\src -IC:\Users\Ben\Dev\arrow\cpp\src\generated 
-IC:\Users\Ben\Dev\arrow\cpp\src\parquet 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\flatbuffers\include 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\hadoop\include 
-external:IC:\Users\Ben\miniconda3\envs\pyarrow-dev\Library\include 
-external:W0 /DWIN32 /D_WINDOWS  /GR /EHsc 
/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING   /EHsc /wd5105 /bigobj /utf-8 /W3 
/wd4800 /wd4996 /wd4065  /WX /MP /MD /O2 /Ob2 /DNDEBUG -std:c++17 /showIncludes 
/Fosrc\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.obj{code}

> [C++] Building of Arrow C++ (dataset) errors on Windows
> ---
>
> Key: ARROW-17937
> URL: https://issues.apache.org/jira/browse/ARROW-17937
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alenka Frim
>Priority: Major
>
> Building of Arrow C++ fails for me on Windows if I keep static build on by 
> default and works with ARROW_STATIC=OFF:
> {code:java}
> (pyarrow-dev310) C:\Users\Alenka\repos\arrow\cpp\build>cmake --build . 
> --target install --config Release[482/590] Building CXX object 
> src\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.objFAILED: 
> src/arrow/dataset/CMakeFiles/arrow_dataset_static.dir/discovery.cc.objC:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\cl.exe
>   /nologo /TP -DARROW_DS_EXPORTING -DARROW_FLIGHT_SQL_STATIC 
> -DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
> -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 
> -DARROW_HDFS -DARROW_STATIC -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
> -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
> -DPARQUET_STATIC -DURI_STATIC_BUILD -D_CRT_SECURE_NO_WARNINGS 
> 

[jira] [Comment Edited] (ARROW-17937) [C++] Building of Arrow C++ (dataset) errors on Windows

2022-10-13 Thread Ben Harkins (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617452#comment-17617452
 ] 

Ben Harkins edited comment on ARROW-17937 at 10/14/22 4:22 AM:
---

Unfortunately, I haven't been able to reproduce this, but FWIW, It looks like 
{{-DARROW_DS_STATIC}} isn't being forwarded to the compiler for some reason. 
Normally, there wouldn't be a linkage discrepancy. See: 
[[1|https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/visibility.h#L30]]

For reference, my output looks like this:
{code:java}
C:\PROGRA~1\MICROS~2\2022\COMMUN~1\VC\Tools\MSVC\1433~1.316\bin\Hostx64\x64\cl.exe
  /nologo /TP -DARROW_DS_EXPORTING -DARROW_DS_STATIC -DARROW_FLIGHT_SQL_STATIC 
-DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
-DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 
-DARROW_HDFS -DARROW_STATIC -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
-DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
-DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
-DPARQUET_STATIC -DURI_STATIC_BUILD -D_CRT_SECURE_NO_WARNINGS 
-D_ENABLE_EXTENDED_ALIGNED_STORAGE -IC:\Users\Ben\Dev\arrow\cpp\build\src 
-IC:\Users\Ben\Dev\arrow\cpp\src -IC:\Users\Ben\Dev\arrow\cpp\src\generated 
-IC:\Users\Ben\Dev\arrow\cpp\src\parquet 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\flatbuffers\include 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\hadoop\include 
-external:IC:\Users\Ben\miniconda3\envs\pyarrow-dev\Library\include 
-external:W0 /DWIN32 /D_WINDOWS  /GR /EHsc 
/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING   /EHsc /wd5105 /bigobj /utf-8 /W3 
/wd4800 /wd4996 /wd4065  /WX /MP /MD /O2 /Ob2 /DNDEBUG -std:c++17 /showIncludes 
/Fosrc\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.obj{code}


was (Author: JIRAUSER295145):
Unfortunately, I haven't been able to reproduce this, but FWIW, It looks like 
{{-DARROW_DS_STATIC}} isn't being forwarded to the compiler for some reason. 
Normally, there wouldn't be a linkage discrepancy 
[[https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/visibility.h#L30]].

For reference, my output looks like this:
{code:java}
C:\PROGRA~1\MICROS~2\2022\COMMUN~1\VC\Tools\MSVC\1433~1.316\bin\Hostx64\x64\cl.exe
  /nologo /TP -DARROW_DS_EXPORTING -DARROW_DS_STATIC -DARROW_FLIGHT_SQL_STATIC 
-DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
-DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 
-DARROW_HDFS -DARROW_STATIC -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
-DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
-DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
-DPARQUET_STATIC -DURI_STATIC_BUILD -D_CRT_SECURE_NO_WARNINGS 
-D_ENABLE_EXTENDED_ALIGNED_STORAGE -IC:\Users\Ben\Dev\arrow\cpp\build\src 
-IC:\Users\Ben\Dev\arrow\cpp\src -IC:\Users\Ben\Dev\arrow\cpp\src\generated 
-IC:\Users\Ben\Dev\arrow\cpp\src\parquet 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\flatbuffers\include 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\hadoop\include 
-external:IC:\Users\Ben\miniconda3\envs\pyarrow-dev\Library\include 
-external:W0 /DWIN32 /D_WINDOWS  /GR /EHsc 
/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING   /EHsc /wd5105 /bigobj /utf-8 /W3 
/wd4800 /wd4996 /wd4065  /WX /MP /MD /O2 /Ob2 /DNDEBUG -std:c++17 /showIncludes 
/Fosrc\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.obj{code}

> [C++] Building of Arrow C++ (dataset) errors on Windows
> ---
>
> Key: ARROW-17937
> URL: https://issues.apache.org/jira/browse/ARROW-17937
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alenka Frim
>Priority: Major
>
> Building of Arrow C++ fails for me on Windows if I keep static build on by 
> default and works with ARROW_STATIC=OFF:
> {code:java}
> (pyarrow-dev310) C:\Users\Alenka\repos\arrow\cpp\build>cmake --build . 
> --target install --config Release[482/590] Building CXX object 
> src\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.objFAILED: 
> src/arrow/dataset/CMakeFiles/arrow_dataset_static.dir/discovery.cc.objC:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\cl.exe
>   /nologo /TP -DARROW_DS_EXPORTING -DARROW_FLIGHT_SQL_STATIC 
> -DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
> -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 
> -DARROW_HDFS -DARROW_STATIC -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
> -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
> -DPARQUET_STATIC -DURI_STATIC_BUILD -D_CRT_SECURE_NO_WARNINGS 
> 

[jira] [Commented] (ARROW-17937) [C++] Building of Arrow C++ (dataset) errors on Windows

2022-10-13 Thread Ben Harkins (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617452#comment-17617452
 ] 

Ben Harkins commented on ARROW-17937:
-

Unfortunately, I haven't been able to reproduce this, but FWIW, It looks like 
{{-DARROW_DS_STATIC}} isn't being forwarded to the compiler for some reason. 
Normally, there wouldn't be a linkage discrepancy 
[[https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/visibility.h#L30]].

For reference, my output looks like this:
{code:java}
C:\PROGRA~1\MICROS~2\2022\COMMUN~1\VC\Tools\MSVC\1433~1.316\bin\Hostx64\x64\cl.exe
  /nologo /TP -DARROW_DS_EXPORTING -DARROW_DS_STATIC -DARROW_FLIGHT_SQL_STATIC 
-DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
-DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 
-DARROW_HDFS -DARROW_STATIC -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
-DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
-DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
-DPARQUET_STATIC -DURI_STATIC_BUILD -D_CRT_SECURE_NO_WARNINGS 
-D_ENABLE_EXTENDED_ALIGNED_STORAGE -IC:\Users\Ben\Dev\arrow\cpp\build\src 
-IC:\Users\Ben\Dev\arrow\cpp\src -IC:\Users\Ben\Dev\arrow\cpp\src\generated 
-IC:\Users\Ben\Dev\arrow\cpp\src\parquet 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\flatbuffers\include 
-external:IC:\Users\Ben\Dev\arrow\cpp\thirdparty\hadoop\include 
-external:IC:\Users\Ben\miniconda3\envs\pyarrow-dev\Library\include 
-external:W0 /DWIN32 /D_WINDOWS  /GR /EHsc 
/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING   /EHsc /wd5105 /bigobj /utf-8 /W3 
/wd4800 /wd4996 /wd4065  /WX /MP /MD /O2 /Ob2 /DNDEBUG -std:c++17 /showIncludes 
/Fosrc\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.obj{code}

> [C++] Building of Arrow C++ (dataset) errors on Windows
> ---
>
> Key: ARROW-17937
> URL: https://issues.apache.org/jira/browse/ARROW-17937
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alenka Frim
>Priority: Major
>
> Building of Arrow C++ fails for me on Windows if I keep static build on by 
> default and works with ARROW_STATIC=OFF:
> {code:java}
> (pyarrow-dev310) C:\Users\Alenka\repos\arrow\cpp\build>cmake --build . 
> --target install --config Release[482/590] Building CXX object 
> src\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.objFAILED: 
> src/arrow/dataset/CMakeFiles/arrow_dataset_static.dir/discovery.cc.objC:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\cl.exe
>   /nologo /TP -DARROW_DS_EXPORTING -DARROW_FLIGHT_SQL_STATIC 
> -DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 
> -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 
> -DARROW_HDFS -DARROW_STATIC -DARROW_WITH_LZ4 -DARROW_WITH_RE2 
> -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC 
> -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_DYN_LINK -DBOOST_ALL_NO_LIB 
> -DPARQUET_STATIC -DURI_STATIC_BUILD -D_CRT_SECURE_NO_WARNINGS 
> -D_ENABLE_EXTENDED_ALIGNED_STORAGE 
> -IC:\Users\Alenka\repos\arrow\cpp\build\src 
> -IC:\Users\Alenka\repos\arrow\cpp\src 
> -IC:\Users\Alenka\repos\arrow\cpp\src\generated 
> -IC:\Users\Alenka\repos\arrow\cpp\src\parquet 
> -IC:\Users\Alenka\repos\arrow\cpp\thirdparty\flatbuffers\include 
> -IC:\Users\Alenka\repos\arrow\cpp\thirdparty\hadoop\include 
> -IC:\Users\Alenka\anaconda3\envs\pyarrow-dev310\Library\include /DWIN32 
> /D_WINDOWS  /GR /EHsc /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING   /EHsc 
> /wd5105 /bigobj /utf-8 /W3 /wd4800 /wd4996 /wd4065  /WX /MP /MD /O2 /Ob2 
> /DNDEBUG /showIncludes 
> /Fosrc\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\discovery.cc.obj 
> /Fdsrc\arrow\dataset\CMakeFiles\arrow_dataset_static.dir\arrow_dataset_static.pdb
>  /FS -c 
> C:\Users\Alenka\repos\arrow\cpp\src\arrow\dataset\discovery.ccC:\Users\Alenka\repos\arrow\cpp\src\arrow/dataset/scanner.h(427):
>  error C2220: warning treated as error - no 'object' file 
> generatedC:\Users\Alenka\repos\arrow\cpp\src\arrow/dataset/scanner.h(427): 
> warning C4275: non dll-interface class 'arrow::compute::ExecNodeOptions' used 
> as base for dll-interface class 
> 'arrow::dataset::ScanNodeOptions'C:\Users\Alenka\repos\arrow\cpp\src\arrow/compute/exec/options.h(42):
>  note: see declaration of 
> 'arrow::compute::ExecNodeOptions'C:\Users\Alenka\repos\arrow\cpp\src\arrow/dataset/scanner.h(427):
>  note: see declaration of 
> 'arrow::dataset::ScanNodeOptions'C:\Users\Alenka\repos\arrow\cpp\src\arrow/dataset/file_base.h(422):
>  warning C4275: non dll-interface class 'arrow::compute::ExecNodeOptions' 
> used as base for dll-interface class 
> 

[jira] [Assigned] (ARROW-17930) [CI][C++] Valgrind failure in PrintValue

2022-10-04 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-17930:
---

Assignee: Ben Harkins  (was: Weston Pace)

> [CI][C++] Valgrind failure in PrintValue
> ---
>
> Key: ARROW-17930
> URL: https://issues.apache.org/jira/browse/ARROW-17930
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Ben Harkins
>Priority: Blocker
> Fix For: 10.0.0
>
>
> See 
> https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=36513=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=3863
> {code}
> ==19602== Use of uninitialised value of size 8
> ==19602==at 0x68277E1: _itoa_word (_itoa.c:180)
> ==19602==by 0x682AEDD: vfprintf (vfprintf.c:1642)
> ==19602==by 0x68578AF: vsnprintf (vsnprintf.c:114)
> ==19602==by 0x6833F9E: snprintf (snprintf.c:33)
> ==19602==by 0x4C3E46D: testing::(anonymous 
> namespace)::PrintByteSegmentInObjectTo(unsigned char const*, unsigned long, 
> unsigned long, std::ostream*) (gtest-printers.cc:82)
> ==19602==by 0x4C3E50D: testing::(anonymous 
> namespace)::PrintBytesInObjectToImpl(unsigned char const*, unsigned long, 
> std::ostream*) (gtest-printers.cc:99)
> ==19602==by 0x4C3E5B1: testing::internal::PrintBytesInObjectTo(unsigned 
> char const*, unsigned long, std::ostream*) (gtest-printers.cc:131)
> ==19602==by 0x1EB2B2: PrintValue 
> (gtest-printers.h:270)
> [ etc. ]
> {code}
> This is probably trivial to fix but needs fixing nevertheless :-)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17932) [C++] Implement streaming RecordBatchReader for JSON

2022-10-04 Thread Ben Harkins (Jira)
Ben Harkins created ARROW-17932:
---

 Summary: [C++] Implement streaming RecordBatchReader for JSON
 Key: ARROW-17932
 URL: https://issues.apache.org/jira/browse/ARROW-17932
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Ben Harkins
Assignee: Ben Harkins


We don't currently support incremental RecordBatch reading from JSON streams, 
which is needed to properly implement JSON support in Dataset. The existing CSV 
StreamingReader API can be used as a model.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15075) [C++][Dataset] Implement Dataset for JSON format

2022-09-26 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-15075:
---

Assignee: Ben Harkins

> [C++][Dataset] Implement Dataset for JSON format
> 
>
> Key: ARROW-15075
> URL: https://issues.apache.org/jira/browse/ARROW-15075
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Will Jones
>Assignee: Ben Harkins
>Priority: Major
>  Labels: dataset
>
> We already have support for reading individual files, but not yet for reading 
> datasets. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-4709) [C++] Optimize for ordered JSON fields

2022-09-06 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-4709:
--

Assignee: Ben Harkins

> [C++] Optimize for ordered JSON fields
> --
>
> Key: ARROW-4709
> URL: https://issues.apache.org/jira/browse/ARROW-4709
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Harkins
>Priority: Minor
>  Labels: good-second-issue
>
> Fields appear consistently ordered in most JSON data in the wild, but the 
> JSON parser currently looks fields up in a hash table. The ordering can 
> probably be exploited to yield better performance when looking up field 
> indices



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16226) [C++] Add better coverage for filesystem tell.

2022-09-04 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-16226:
---

Assignee: Ben Harkins

> [C++] Add better coverage for filesystem tell.
> --
>
> Key: ARROW-16226
> URL: https://issues.apache.org/jira/browse/ARROW-16226
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Ben Harkins
>Priority: Major
>  Labels: good-first-issue
>
> Add a C++ generic file system test that writes wrote N bytes to a file. then 
> seeks to N/2 and and read the remainder.  Verify the remainder bytes are N/2 
> and expected from the bytes writter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method

2022-09-02 Thread Ben Harkins (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599617#comment-17599617
 ] 

Ben Harkins commented on ARROW-6772:


I'm currently working on this one - planning on adding 
{{util::EqualityComparable}} to {{{}DataType{}}}, {{{}Field{}}}, and 
{{{}FieldRef{}}}. Should additional comparison tests be added (in addition to 
AssertXXXEqual) to type_test.cc or would that be considered redundant?

> [C++] Add operator== for interfaces with an Equals() method
> ---
>
> Key: ARROW-6772
> URL: https://issues.apache.org/jira/browse/ARROW-6772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Harkins
>Priority: Major
>  Labels: good-first-issue
>
> A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The 
> addition of overloaded equality operators will allow this o be written 
> {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will 
> allow more informative assertion failure messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-6772) [C++] Add operator== for interfaces with an Equals() method

2022-09-02 Thread Ben Harkins (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Harkins reassigned ARROW-6772:
--

Assignee: Ben Harkins

> [C++] Add operator== for interfaces with an Equals() method
> ---
>
> Key: ARROW-6772
> URL: https://issues.apache.org/jira/browse/ARROW-6772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Harkins
>Priority: Major
>  Labels: good-first-issue
>
> A common pattern in tests is {{ASSERT_TRUE(schm->Equals(*other)}}. The 
> addition of overloaded equality operators will allow this o be written 
> {{ASSERT_EQ(*schm, *other)}}, which is more idiomatic GTEST usage and will 
> allow more informative assertion failure messages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)