[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

GitBox Fri, 12 Feb 2021 08:06:52 -0800


jorisvandenbossche commented on a change in pull request #9323:
URL: https://github.com/apache/arrow/pull/9323#discussion_r575328856




##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])
 @pytest.mark.parametrize('partition_keys', [
     (["A", "B", "C"], [1, 2, 3]),
     ([1, 2, 3], ["A", "B", "C"]),
     (["A", "B", "C"], ["D", "E", "F"]),
     ([1, 2, 3], [4, 5, 6]),
+    ([1, None, 3], ["A", "B", "C"]),
+    ([1, 2, 3], ["A", None, "C"]),
+    ([None, 2, 3], [None, 2, 3]),
 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-                                                  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(
+    tempdir, partitioning, null_fallback, partition_keys
+):
     # ARROW-9288 / ARROW-9476
     import pyarrow.parquet as pq
-    table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5})
+
+    table = pa.table({'a': range(9), 'b': [0.0] * 4 + [1.0] * 5})
+
+    if None in partition_keys[0] or None in partition_keys[1]:
+        # Directory partitioning can't handle the first part being null
+        return

Review comment:
       only return here if `partitioning == "directory"` ?

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])

Review comment:
       What does `null_fallback=None` mean? (based on the docstring above it 
seems it can only be a string?)

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -74,15 +74,26 @@ Status KeyValuePartitioning::SetDefaultValuesFromKeys(const 
Expression& expr,
                                                       RecordBatchProjector* 
projector) {
   ARROW_ASSIGN_OR_RAISE(auto known_values, ExtractKnownFieldValues(expr));
   for (const auto& ref_value : known_values) {
-    if (!ref_value.second.is_scalar()) {
-      return Status::Invalid("non-scalar partition key ", 
ref_value.second.ToString());
+    const auto& known_value = ref_value.second;
+    if (known_value.concrete() && !known_value.datum.is_scalar()) {
+      return Status::Invalid("non-scalar partition key ", 
known_value.datum.ToString());
     }
 
     ARROW_ASSIGN_OR_RAISE(auto match,
                           ref_value.first.FindOneOrNone(*projector->schema()));
 
     if (match.empty()) continue;
-    RETURN_NOT_OK(projector->SetDefaultValue(match, 
ref_value.second.scalar()));
+
+    const auto& field = projector->schema()->field(match[0]);
+    if (known_value.concrete()) {
+      RETURN_NOT_OK(projector->SetDefaultValue(match, 
known_value.datum.scalar()));
+    } else if (known_value.valid) {
+      return Status::Invalid(
+          "Partition expression not defined enough to set default value for ",

Review comment:
       What does "not defined enough" in practice mean? (or what would be an 
example?)

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1587,33 +1587,54 @@ def test_open_dataset_non_existing_file():
 
 @pytest.mark.parquet
 @pytest.mark.parametrize('partitioning', ["directory", "hive"])
+@pytest.mark.parametrize('null_fallback', ['xyz', None])
 @pytest.mark.parametrize('partition_keys', [
     (["A", "B", "C"], [1, 2, 3]),
     ([1, 2, 3], ["A", "B", "C"]),
     (["A", "B", "C"], ["D", "E", "F"]),
     ([1, 2, 3], [4, 5, 6]),
+    ([1, None, 3], ["A", "B", "C"]),
+    ([1, 2, 3], ["A", None, "C"]),
+    ([None, 2, 3], [None, 2, 3]),
 ])
-def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning,
-                                                  partition_keys):
+def test_open_dataset_partitioned_dictionary_type(

Review comment:
       you added this to a test that is specifically about reading partitioned 
datasets while inferring the partition fields as dictionary. Which is fine (as 
this case also needs to be able to hand that), but this should also work (and 
so be tested) in the default case not inferring dictionary type? 
   And should we also have a test for the writing part? (this one only tests 
reading)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9323: ARROW-10438: [C++][Dataset] Partitioning::Format on nulls

Reply via email to