[
https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250682#comment-16250682
]
ASF GitHub Bot commented on ARROW-1693:
---------------------------------------
trxcllnt commented on a change in pull request #1294: ARROW-1693: [JS] Fix
reading C++ dictionary-encoded vectors
URL: https://github.com/apache/arrow/pull/1294#discussion_r150721453
##########
File path: js/test/arrows/json/datetime.json
##########
@@ -0,0 +1,1091 @@
+{
Review comment:
@wesm yes, that would be ideal.
<details><summary>I generated the JSON from the python integration tests
like this (click to expand)</summary>
```python
from integration_test import generate_nested_case
from integration_test import generate_decimal_case
from integration_test import generate_datetime_case
from integration_test import generate_primitive_case
from integration_test import generate_dictionary_case
generate_nested_case().write("../js/test/arrows/json/nested.json")
generate_decimal_case().write("../js/test/arrows/json/decimal.json")
generate_datetime_case().write("../js/test/arrows/json/datetime.json")
generate_dictionary_case().write("../js/test/arrows/json/dictionary.json")
generate_primitive_case([7,
10]).write("../js/test/arrows/json/primitive.json")
generate_primitive_case([0, 0,
0]).write("../js/test/arrows/json/primitive-empty.json")
```
</details>
<details><summary>Then I had to manually edit the "primitive.json" file to
remove the "binary_nullable" and "binary_nonnullable" columns, because the C++
command fails if they're present (click to expand)
</summary>
```sh
$ ../cpp/build/release/json-integration-test \
--integration --mode=JSON_TO_ARROW \
--json=./test/arrows/json/primitive.json \
--arrow=./test/arrows/cpp/file/primitive.arrow
Found schema: bool_nullable: bool
bool_nonnullable: bool not null
int8_nullable: int8
int8_nonnullable: int8 not null
int16_nullable: int16
int16_nonnullable: int16 not null
int32_nullable: int32
int32_nonnullable: int32 not null
int64_nullable: int64
int64_nonnullable: int64 not null
uint8_nullable: uint8
uint8_nonnullable: uint8 not null
uint16_nullable: uint16
uint16_nonnullable: uint16 not null
uint32_nullable: uint32
uint32_nonnullable: uint32 not null
uint64_nullable: uint64
uint64_nonnullable: uint64 not null
float32_nullable: float
float32_nonnullable: float not null
float64_nullable: double
float64_nonnullable: double not null
binary_nullable: binary
binary_nonnullable: binary not null
utf8_nullable: string
utf8_nonnullable: string not null
Error message: Invalid: Encountered non-hex digit
```
</details>
The unit tests rely heavily on [snapshot
testing](https://facebook.github.io/jest/docs/en/snapshot-testing.html) to
validate the actual values in the vectors. I manually validated the data in the
snapshots against the buffers using pyarrow and pandas, but that approach won't
scale. Typically the snapshot files get checked into version control, but now
that we have 11k snapshots, the snapshot files are around 21mb. I removed them
from the repo b/c we don't want huge files. Now the CI server generates the
snapshots once up front, then validates the compilation targets against those.
This will catch any cases where compiling the JS to different targets leads
to failures (e.g. if the minifiers mangle names they weren't supposed to), but
since we're not checking in the snapshot files, the CI server won't be able to
tell us if a new PR causes a snapshot test to break. We _can_ know that if we
run the tests locally, but we can't rely us running the tests for each PR
locally before merging.
I'm a bit torn here. On the one hand, I don't want to check in 21mb worth of
tests to source control. On the other hand, I don't want to hand-write the 11k
assertions that the snapshot tests represent (and would also presumably be
many-MBs worth of tests anyway).
I believe git compresses files across the network? And if space-on-disk is
an issue, I could add a post-clone script to automatically compress the
snapshot files after checkout (about 3mb gzipped). Jest doesn't work with
compressed snapshot files out of the box, but I could add some steps to the
test runner to decompress the snapshots before running.
To your point about using the C++/Java writers to convert the JSON to Arrow
buffers on the fly, we should 100% do that. This PR is marginally better since
we can at least regenerate the arrow files easily enough, but ideally we don't
have them at all and we can pipe them to the node process on the fly, or at a
minimum, write to files then clean up after. We'll want a mode for local dev
that skips this step, as incurring the JVM overhead to convert JSON to Arrow
files is painful for debugging.
I left the code in there (commented out) to draw attention to this idea. I
was on a plane when I checked this in and didn't have bandwidth to debug the CI
scripts, otherwise I would have. We can pass the paths to the C++ and Java
executables as CLI and/or env variables no problem.
I wasn't sure if the executables were already available in CI, or whether I
would need to build them before running the JS tests. Building the C++ and Java
projects in order to run the JS tests doesn't seem ideal, but I'm fine doing it
if we have to.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [JS] Error reading dictionary-encoded integration test files
> ------------------------------------------------------------
>
> Key: ARROW-1693
> URL: https://issues.apache.org/jira/browse/ARROW-1693
> Project: Apache Arrow
> Issue Type: Bug
> Components: JavaScript
> Reporter: Brian Hulette
> Assignee: Brian Hulette
> Labels: pull-request-available
> Fix For: 0.8.0
>
> Attachments: dictionary-cpp.arrow, dictionary-java.arrow,
> dictionary.json
>
>
> The JS implementation crashes when reading the dictionary test case from the
> integration tests.
> To replicate, first generate the test files with java and cpp impls:
> {code}
> $ cd ${ARROW_HOME}/integration/
> $ python -c 'from integration_test import generate_dictionary_case;
> generate_dictionary_case().write("dictionary.json")'
> $ ../cpp/debug/debug/json-integration-test --integration
> --json=dictionary.json --arrow=dictionary-cpp.arrow --mode=JSON_TO_ARROW
> $ java -cp
> ../java/tools/target/arrow-tools-0.8.0-SNAPSHOT-jar-with-dependencies.jar
> org.apache.arrow.tools.Integration -c JSON_TO_ARROW -a dictionary-java.arrow
> -j dictionary.json
> {code}
> Attempt to read the files with the JS impl:
> {code}
> $ cd ${ARROW_HOME}/js/
> $ ./bin/arrow2csv.js -s dict1_0 -f ../integration/dictionary-{java,cpp}.arrow
> {code}
> Both files result in an error for me on
> [a8f51858|https://github.com/apache/arrow/commit/a8f518588fda471b2e3cc8e0f0064e7c4bb99899]:
> {{TypeError: Cannot read property 'buffer' of undefined}}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)