lidavidm commented on a change in pull request #9670:
URL: https://github.com/apache/arrow/pull/9670#discussion_r591666392
##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -137,16 +132,174 @@ std::string FileSystemDataset::ToString() const {
return repr;
}
-Result<FragmentIterator> FileSystemDataset::GetFragmentsImpl(Expression
predicate) {
- FragmentVector fragments;
+namespace {
- for (const auto& fragment : fragments_) {
- ARROW_ASSIGN_OR_RAISE(
- auto simplified,
- SimplifyWithGuarantee(predicate, fragment->partition_expression()));
- if (simplified.IsSatisfiable()) {
- fragments.push_back(fragment);
+// Helper class for efficiently detecting subtrees given fragment partition
expressions.
+// Partition expressions are broken into conjunction members and each member
dictionary
+// encoded to impose a sortable ordering. In addition, subtrees are generated
which span
+// groups of fragments and nested subtrees. After encoding each fragment is
guaranteed to
+// be a descendant of at least one subtree. For example, given fragments in a
+// HivePartitioning with paths:
+//
+// /num=0/al=eh/dat.par
+// /num=0/al=be/dat.par
+// /num=1/al=eh/dat.par
+// /num=1/al=be/dat.par
+//
+// The following subtrees will be introduced:
+//
+// /num=0/
+// /num=0/al=eh/
+// /num=0/al=eh/dat.par
+// /num=0/al=be/
+// /num=0/al=be/dat.par
+// /num=1/
+// /num=1/al=eh/
+// /num=1/al=eh/dat.par
+// /num=1/al=be/
+// /num=1/al=be/dat.par
+struct SubtreeImpl {
+ using expression_code = char32_t;
Review comment:
I guess it's to work with std::basic_string, but then I'm curious why
you're favoring that over std::vector.
##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -137,16 +132,174 @@ std::string FileSystemDataset::ToString() const {
return repr;
}
-Result<FragmentIterator> FileSystemDataset::GetFragmentsImpl(Expression
predicate) {
- FragmentVector fragments;
+namespace {
- for (const auto& fragment : fragments_) {
- ARROW_ASSIGN_OR_RAISE(
- auto simplified,
- SimplifyWithGuarantee(predicate, fragment->partition_expression()));
- if (simplified.IsSatisfiable()) {
- fragments.push_back(fragment);
+// Helper class for efficiently detecting subtrees given fragment partition
expressions.
+// Partition expressions are broken into conjunction members and each member
dictionary
+// encoded to impose a sortable ordering. In addition, subtrees are generated
which span
+// groups of fragments and nested subtrees. After encoding each fragment is
guaranteed to
+// be a descendant of at least one subtree. For example, given fragments in a
+// HivePartitioning with paths:
+//
+// /num=0/al=eh/dat.par
+// /num=0/al=be/dat.par
+// /num=1/al=eh/dat.par
+// /num=1/al=be/dat.par
+//
+// The following subtrees will be introduced:
+//
+// /num=0/
+// /num=0/al=eh/
+// /num=0/al=eh/dat.par
+// /num=0/al=be/
+// /num=0/al=be/dat.par
+// /num=1/
+// /num=1/al=eh/
+// /num=1/al=eh/dat.par
+// /num=1/al=be/
+// /num=1/al=be/dat.par
+struct SubtreeImpl {
+ using expression_code = char32_t;
Review comment:
nit: any particular reason for char32_t over say uint32_t or size_t
(since it is a vector index)? And in any case, it doesn't match the static_cast
below on line 176.
##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -137,16 +132,174 @@ std::string FileSystemDataset::ToString() const {
return repr;
}
-Result<FragmentIterator> FileSystemDataset::GetFragmentsImpl(Expression
predicate) {
- FragmentVector fragments;
+namespace {
- for (const auto& fragment : fragments_) {
- ARROW_ASSIGN_OR_RAISE(
- auto simplified,
- SimplifyWithGuarantee(predicate, fragment->partition_expression()));
- if (simplified.IsSatisfiable()) {
- fragments.push_back(fragment);
+// Helper class for efficiently detecting subtrees given fragment partition
expressions.
+// Partition expressions are broken into conjunction members and each member
dictionary
+// encoded to impose a sortable ordering. In addition, subtrees are generated
which span
+// groups of fragments and nested subtrees. After encoding each fragment is
guaranteed to
+// be a descendant of at least one subtree. For example, given fragments in a
+// HivePartitioning with paths:
+//
+// /num=0/al=eh/dat.par
+// /num=0/al=be/dat.par
+// /num=1/al=eh/dat.par
+// /num=1/al=be/dat.par
+//
+// The following subtrees will be introduced:
+//
+// /num=0/
+// /num=0/al=eh/
+// /num=0/al=eh/dat.par
+// /num=0/al=be/
+// /num=0/al=be/dat.par
+// /num=1/
+// /num=1/al=eh/
+// /num=1/al=eh/dat.par
+// /num=1/al=be/
+// /num=1/al=be/dat.par
+struct SubtreeImpl {
+ using expression_code = char32_t;
Review comment:
Ah - to get the lexicographic sort down below. I think future readers
might appreciate a note about that since it's not immediately obvious + is an
unusual choice of types otherwise.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]