[ 
https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250682#comment-16250682
 ] 

ASF GitHub Bot commented on ARROW-1693:
---------------------------------------

trxcllnt commented on a change in pull request #1294: ARROW-1693: [JS] Fix 
reading C++ dictionary-encoded vectors
URL: https://github.com/apache/arrow/pull/1294#discussion_r150721453
 
 

 ##########
 File path: js/test/arrows/json/datetime.json
 ##########
 @@ -0,0 +1,1091 @@
+{
 
 Review comment:
   @wesm yes, that would be ideal.
   
   <details><summary>I generated the JSON from the python integration tests 
like this (click to expand)</summary>
   
   ```python
   from integration_test import generate_nested_case
   from integration_test import generate_decimal_case
   from integration_test import generate_datetime_case
   from integration_test import generate_primitive_case
   from integration_test import generate_dictionary_case
   
   generate_nested_case().write("../js/test/arrows/json/nested.json")
   generate_decimal_case().write("../js/test/arrows/json/decimal.json")
   generate_datetime_case().write("../js/test/arrows/json/datetime.json")
   generate_dictionary_case().write("../js/test/arrows/json/dictionary.json")
   generate_primitive_case([7, 
10]).write("../js/test/arrows/json/primitive.json")
   generate_primitive_case([0, 0, 
0]).write("../js/test/arrows/json/primitive-empty.json")
   ```
   </details>
   
   <details><summary>Then I had to manually edit the "primitive.json" file to 
remove the "binary_nullable" and "binary_nonnullable" columns, because the C++ 
command fails if they're present (click to expand)
   </summary>
   
   ```sh
   $ ../cpp/build/release/json-integration-test \
     --integration --mode=JSON_TO_ARROW \
     --json=./test/arrows/json/primitive.json \
     --arrow=./test/arrows/cpp/file/primitive.arrow
   Found schema: bool_nullable: bool
   bool_nonnullable: bool not null
   int8_nullable: int8
   int8_nonnullable: int8 not null
   int16_nullable: int16
   int16_nonnullable: int16 not null
   int32_nullable: int32
   int32_nonnullable: int32 not null
   int64_nullable: int64
   int64_nonnullable: int64 not null
   uint8_nullable: uint8
   uint8_nonnullable: uint8 not null
   uint16_nullable: uint16
   uint16_nonnullable: uint16 not null
   uint32_nullable: uint32
   uint32_nonnullable: uint32 not null
   uint64_nullable: uint64
   uint64_nonnullable: uint64 not null
   float32_nullable: float
   float32_nonnullable: float not null
   float64_nullable: double
   float64_nonnullable: double not null
   binary_nullable: binary
   binary_nonnullable: binary not null
   utf8_nullable: string
   utf8_nonnullable: string not null
   Error message: Invalid: Encountered non-hex digit
   ```
   </details>
   
   The unit tests rely heavily on [snapshot 
testing](https://facebook.github.io/jest/docs/en/snapshot-testing.html) to 
validate the actual values in the vectors. I manually validated the data in the 
snapshots against the buffers using pyarrow and pandas, but that approach won't 
scale. Typically the snapshot files get checked into version control, but now 
that we have 11k snapshots, the snapshot files are around 21mb. I removed them 
from the repo b/c we don't want huge files. Now the CI server generates the 
snapshots once up front, then validates the compilation targets against those.
   
   This will catch any cases where compiling the JS to different targets leads 
to failures (e.g. if the minifiers mangle names they weren't supposed to), but 
since we're not checking in the snapshot files, the CI server won't be able to 
tell us if a new PR causes a snapshot test to break. We _can_ know that if we 
run the tests locally, but we can't rely us running the tests for each PR 
locally before merging.
   
   I'm a bit torn here. On the one hand, I don't want to check in 21mb worth of 
tests to source control. On the other hand, I don't want to hand-write the 11k 
assertions that the snapshot tests represent (and would also presumably be 
many-MBs worth of tests anyway).
   
   I believe git compresses files across the network? And if space-on-disk is 
an issue, I could add a post-clone script to automatically compress the 
snapshot files after checkout (about 3mb gzipped). Jest doesn't work with 
compressed snapshot files out of the box, but I could add some steps to the 
test runner to decompress the snapshots before running.
   
   To your point about using the C++/Java writers to convert the JSON to Arrow 
buffers on the fly, we should 100% do that. This PR is marginally better since 
we can at least regenerate the arrow files easily enough, but ideally we don't 
have them at all and we can pipe them to the node process on the fly, or at a 
minimum, write to files then clean up after. We'll want a mode for local dev 
that skips this step, as incurring the JVM overhead to convert JSON to Arrow 
files is painful for debugging.
   
   I left the code in there (commented out) to draw attention to this idea. I 
was on a plane when I checked this in and didn't have bandwidth to debug the CI 
scripts, otherwise I would have. We can pass the paths to the C++ and Java 
executables as CLI and/or env variables no problem.
   
   I wasn't sure if the executables were already available in CI, or whether I 
would need to build them before running the JS tests. Building the C++ and Java 
projects in order to run the JS tests doesn't seem ideal, but I'm fine doing it 
if we have to.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [JS] Error reading dictionary-encoded integration test files
> ------------------------------------------------------------
>
>                 Key: ARROW-1693
>                 URL: https://issues.apache.org/jira/browse/ARROW-1693
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: JavaScript
>            Reporter: Brian Hulette
>            Assignee: Brian Hulette
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>         Attachments: dictionary-cpp.arrow, dictionary-java.arrow, 
> dictionary.json
>
>
> The JS implementation crashes when reading the dictionary test case from the 
> integration tests.
> To replicate, first generate the test files with java and cpp impls:
> {code}
> $ cd ${ARROW_HOME}/integration/
> $ python -c 'from integration_test import generate_dictionary_case; 
> generate_dictionary_case().write("dictionary.json")'
> $ ../cpp/debug/debug/json-integration-test --integration 
> --json=dictionary.json --arrow=dictionary-cpp.arrow --mode=JSON_TO_ARROW
> $ java -cp 
> ../java/tools/target/arrow-tools-0.8.0-SNAPSHOT-jar-with-dependencies.jar 
> org.apache.arrow.tools.Integration -c JSON_TO_ARROW -a dictionary-java.arrow 
> -j dictionary.json
> {code}
> Attempt to read the files with the JS impl:
> {code}
> $ cd ${ARROW_HOME}/js/
> $ ./bin/arrow2csv.js -s dict1_0 -f ../integration/dictionary-{java,cpp}.arrow
> {code}
> Both files result in an error for me on 
> [a8f51858|https://github.com/apache/arrow/commit/a8f518588fda471b2e3cc8e0f0064e7c4bb99899]:
> {{TypeError: Cannot read property 'buffer' of undefined}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to