Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-22 Thread via GitHub


shfshihuafeng commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1462854154


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();
+  throw UserException.memoryError(oom)
+  .message("OutOfMemory while allocate memory for hash partition.")

Review Comment:
   i resubmit pr and supply test step



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-22 Thread via GitHub


paul-rogers commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1462817821


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();
+  throw UserException.memoryError(oom)
+  .message("OutOfMemory while allocate memory for hash partition.")

Review Comment:
   Suggested: `"Failed to allocate hash partition."`
   
   The `memoryError()` already indicates that it is an OOM error.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-22 Thread via GitHub


paul-rogers commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1462817821


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();
+  throw UserException.memoryError(oom)
+  .message("OutOfMemory while allocate memory for hash partition.")

Review Comment:
   Suggested: `"Failed to allocate hash partition."`
   
   The `memoryError()` already indicate it is an OOM error.
   



##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/AbstractHashBinaryRecordBatch.java:
##
@@ -1312,7 +1313,9 @@ private void cleanup() {
 }
 // clean (and deallocate) each partition, and delete its spill file
 for (HashPartition partn : partitions) {
-  partn.close();
+  if (Objects.nonNull(partn)) {

Review Comment:
   Simpler `if (partn != null) {`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-22 Thread via GitHub


shfshihuafeng opened a new pull request, #2875:
URL: https://github.com/apache/drill/pull/2875

   # [DRILL-](https://issues.apache.org/jira/browse/DRILL-): PR Title
   
   DRILL-8478. HashPartition memory leak when it allocate memory exception with 
OutOfMemoryException (#2874)
   
   ## Description
   
when allocating memory for hashParttion with OutOfMemoryException,it cause 
memory leak.
beacuase hashpartiton object  cannot be created successfully, so it cannot 
be cleaned up In the closing phase.
   
   
   ## Documentation
   (Please describe user-visible changes similar to what should appear in the 
Drill documentation.)
   
   ## Testing
   (Please describe how this PR has been tested.)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (DRILL-8478) mergejion memory leak when exception

2024-01-22 Thread shihuafeng (Jira)
shihuafeng created DRILL-8478:
-

 Summary: mergejion memory leak when  exception
 Key: DRILL-8478
 URL: https://issues.apache.org/jira/browse/DRILL-8478
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.21.1
Reporter: shihuafeng
 Fix For: 1.21.2


*Describe the bug*
hashpartition leak when allocate memory exception with OutOfMemoryException

*To Reproduce*
Steps to reproduce the behavior:
 # prepare data for tpch 1s
 # 20 concurrent for tpch sql8
 # set direct memory 5g
 # when it had OutOfMemoryException , stopped all sql.
 # finding memory leak

*Expected behavior*
(1)i set \{DRILL_MAX_DIRECT_MEMORY:-"5G"}
(2) i run sql8 (sql detail as Additional context) with 20 concurrent
(3) it had OutOfMemoryException when create hashPartion

*Error detail, log output or screenshots*
Unable to allocate buffer of size 262144 (rounded from 262140) due to memory 
limit (41943040). Current allocation: 20447232

 

sql 
{code:java}
// code placeholder
{code}
select o_year, sum(case when nation = 'CHINA' then volume else 0 end) / 
sum(volume) as mkt_share from ( select extract(year from o_orderdate) as 
o_year, l_extendedprice * 1.0 as volume, n2.n_name as nation from 
hive.tpch1s.part, hive.tpch1s.supplier, hive.tpch1s.lineitem, 
hive.tpch1s.orders, hive.tpch1s.customer, hive.tpch1s.nation n1, 
hive.tpch1s.nation n2, hive.tpch1s.region where p_partkey = l_partkey and 
s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and 
c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 
'ASIA' and s_nationkey = n2.n_nationkey and o_orderdate between date 
'1995-01-01' and date '1996-12-31' and p_type = 'LARGE BRUSHED BRASS') as 
all_nations group by o_year order by o_year



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] hashpartition leak (drill)

2024-01-22 Thread via GitHub


shfshihuafeng commented on issue #2874:
URL: https://github.com/apache/drill/issues/2874#issuecomment-1905301573

   ok @cgivre 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] the mergejoin memery leak (drill)

2024-01-22 Thread via GitHub


shfshihuafeng commented on issue #2871:
URL: https://github.com/apache/drill/issues/2871#issuecomment-1905300921

   > @weijunlu, you may have found a bug. The behavior you described is not 
expected.
   > 
   > Just to verify I understand: you ran your query once. Did the query run 
successfully the first time? Or, did your first query fail? If the query 
_worked_ the first time, we have one situation. If the query _failed_ the first 
time, we have another situation.
   > 
   > Then, you ran the _same_ query a second time? This time you got an OOM 
error?
   > 
   > I think we may have up to three distinct issues.
   > 
   > First, I missed one important piece of information when I first read this 
report. You reported an exception: `Memory was leaked by query. Memory leaked: 
(249856)`. This messages indicates an actual bug. As I explained, Drill won't 
release the memory back to the OS. But, the query should have released the 
memory back to Netty. The error message says that it did not. This would be a 
bug. Even if the query fails, Drill is required to return memory to Netty. Such 
errors are, unfortunately, hard to track down. Our unit tests have many such 
checks, but your query appears to have uncovered a case that the unit tests do 
not check.
   > 
   > The stack trace suggests that the memory leak was detected as Drill was 
trying to shut down a failed query. So, I suspect your query run may not have 
actually worked. If it did work, you should see in the UI a very detailed 
report on all the operators in the query. Did you get that detailed report?
   > 
   > Second, I wonder if Drill has sufficient memory for the query you are 
trying to run. The TPCH queries tend to be memory-hungry. This query has six 
concurrent sorts. Then, it has five merge joins, plus more sorts. While the 
query runs, all the needed data will be in memory. (It is in memory because 
Drill uses memory to buffer data to keep things fast.)
   > 
   > I do not recall the size of a SF1 TPCH data set. What is the total size of 
the files in your data set?
   > 
   > Back when we used to run TPCH tests regularly, we would use a cluster of 5 
or 10 Drillbits, each with a generous amount of memory, so that Drill can hold 
the data in memory. I suspect that 2 GB of direct memory, on a single Drillbit, 
is probably not enough for this kind of query.
   > 
   > So, I suspect you need more direct memory. The UI says that you are using 
only 34% of a 4GB heap, but ~100% of 2GB direct memory. Try changing this 
allocation. Try allocating 3GB to heap. This will raise heap usage to 45%. 
Then, give as much memory as your machine has to direct memory. If you are 
running SqlLine on the same machine, please run it on another machine instead. 
If you have other services on your machine (a database, an IDE, etc.), I would 
recommend moving Drill to a dedicated machine. You may even consider using a 
cluster of machines to provide sufficient resources.
   > 
   > Once you get a query that succeeds, you can look at the detailed query 
profile to find out how much memory it required. Then, you can reduce the 
direct memory given to Drill to a lower amount, if the query allows.
   > 
   > Once you find a memory level that allows the query to run, you will 
probably avoid the memory leak error. The query will succeed and memory should 
be freed back to Netty as normal. TPCH SF1 queries used to pass. They should 
still work, unless something changed in the last couple of years in one of the 
operators.
   > 
   > Just to be clear, Drill works well on a laptop with the standard amount of 
memory _if you run simple queries on small datasets_. But, if you do classic 
TPCH "big data" queries, you need a cluster with enough capacity — that's why 
we use Hadoop.
   > 
   > The third issue is a minor point: the UI screenshot is a bit odd: it says 
you are at 94% of direct memory. I suspect you are actually at 100%, and the UI 
is dividing by the wrong number. (1000 vs. 1024 or some such).
   
   i think it  leak ponit , when sql stop ,tmp memory should be clearup .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hashpartition leak (drill)

2024-01-22 Thread via GitHub


cgivre commented on issue #2874:
URL: https://github.com/apache/drill/issues/2874#issuecomment-1905299722

   @shfshihuafeng  Can you please submit this as a pull request?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] hashpartition leak (drill)

2024-01-22 Thread via GitHub


shfshihuafeng commented on issue #2874:
URL: https://github.com/apache/drill/issues/2874#issuecomment-1905298953

   I fixed the bug see attachment 
   
[0001-bug-10227615-hashPartition-leak.patch](https://github.com/apache/drill/files/14018533/0001-bug-10227615-hashPartition-leak.patch)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] hashpartition leak (drill)

2024-01-22 Thread via GitHub


shfshihuafeng opened a new issue, #2874:
URL: https://github.com/apache/drill/issues/2874

   Before submitting a bug report, please verify that you are using the most 
current version of Drill.  
   
   **Describe the bug**
   hashpartition leak when allocate  memory exception with OutOfMemoryException
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. prepare data for tpch 1s
   2. 20 concurrent for tpch sql8 
   3. set direct memory 5g
   4. when it had OutOfMemoryException , stopped all sql.
   5. finding memory leak
   
   **Expected behavior**
(1)i set {DRILL_MAX_DIRECT_MEMORY:-"5G"}
(2) i run sql8 (sql detail as Additional context) with 20 concurrent
(3) it had OutOfMemoryException when create hashPartion

   **Error detail, log output or screenshots**
   Unable to allocate buffer of size 262144 (rounded from 262140) due to memory 
limit (41943040). Current allocation: 20447232
   
   **Drill version**
   drill 2.0
   
   **Additional context**
   Add any other context about the problem here.
   ```
   select  
   o_year,  
   sum(case when nation = 'CHINA' then volume else 0 end) / sum(volume) as 
mkt_share  
   from ( 
   select  
   extract(year from o_orderdate) as o_year,  
   l_extendedprice * 1.0 as volume,  
   n2.n_name as nation  
   from hive.tpch1s.part, hive.tpch1s.supplier, hive.tpch1s.lineitem, 
hive.tpch1s.orders, hive.tpch1s.customer, hive.tpch1s.nation n1, 
hive.tpch1s.nation n2, hive.tpch1s.region 
   where  
   p_partkey = l_partkey  
   and s_suppkey = l_suppkey  
   and l_orderkey = o_orderkey  
   and o_custkey = c_custkey  
   and c_nationkey = n1.n_nationkey  
   and n1.n_regionkey = r_regionkey  
   and r_name = 'ASIA'  
   and s_nationkey = n2.n_nationkey  
   and o_orderdate between date '1995-01-01'  
   and date '1996-12-31'  
   and p_type = 'LARGE BRUSHED BRASS') as all_nations  
   group by o_year  
   order by o_year
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] union duplicate key (drill)

2024-01-22 Thread via GitHub


yaozhu commented on issue #2873:
URL: https://github.com/apache/drill/issues/2873#issuecomment-1905217272

   This is a test case written by testers, without actual usage scenarios. What 
do you mean by combining them?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] union duplicate key (drill)

2024-01-22 Thread via GitHub


yaozhu commented on issue #2873:
URL: https://github.com/apache/drill/issues/2873#issuecomment-1905216330

   > I'm not yet convinced there is a bug in Drill here. I think the issue may 
actually be in your query. My hunch here is that there may be an issue in the 
`ORDER BY` statement. Your inner tables each have columns called `k1`. Then you 
are calling a sort on that without specifying which one it is. You might first 
try adding table names to the `ORDER BY` clause.
   > 
   > Secondly, I'm not trying to be rude but what are you trying to achieve 
with this query? It looks like you are performing a union of a left and right 
join of the same table? Why not combine these? Then you have nested `SELECT *` 
statements that don't really seem to serve any purpose.
   This is a test case written by testers, without actual usage scenarios. What 
do you mean by combining them?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] union duplicate key (drill)

2024-01-22 Thread via GitHub


cgivre commented on issue #2873:
URL: https://github.com/apache/drill/issues/2873#issuecomment-1905132740

   I'm not yet convinced there is a bug in Drill here.  I think the issue may 
actually be in your query.  My hunch here is that there may be an issue in the 
`ORDER BY` statement.  Your inner tables each have columns called `k1`.  Then 
you are calling a sort on that without specifying which one it is.  You might 
first try adding table names to the `ORDER BY` clause.
   
   Secondly, I'm not trying to be rude but what are you trying to achieve with 
this query?   It looks like you are performing a union of a left and right join 
of the same table?  Why not combine these?Then you have nested `SELECT *` 
statements that don't really seem to serve any purpose. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] union duplicate key (drill)

2024-01-22 Thread via GitHub


yaozhu commented on issue #2873:
URL: https://github.com/apache/drill/issues/2873#issuecomment-1905126799

   @cgivre are u clear?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] union duplicate key (drill)

2024-01-22 Thread via GitHub


yaozhu commented on issue #2873:
URL: https://github.com/apache/drill/issues/2873#issuecomment-1905105700

   > Can you please provide a screenshot or some explanation that is more 
clear? This is really difficult to follow.
   
   
![image](https://github.com/apache/drill/assets/13361571/82b08db4-1d6a-48aa-9f0e-1db3c7018417)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] union duplicate key (drill)

2024-01-22 Thread via GitHub


cgivre commented on issue #2873:
URL: https://github.com/apache/drill/issues/2873#issuecomment-1905101608

   Can you please provide a screenshot or some explanation that is more clear?  
This is really difficult to follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] union duplicate key (drill)

2024-01-22 Thread via GitHub


yaozhu opened a new issue, #2873:
URL: https://github.com/apache/drill/issues/2873

   mysql table baseall and table test has the same tale struct and recored
   
   show create table baseall;
   
+-+-+
   | Table   | Create Table 




   |
   
+-+-+
   | baseall | CREATE TABLE `baseall` (
 `k0` tinyint(1) DEFAULT NULL,
 `k1` tinyint DEFAULT NULL,
 `k2` smallint DEFAULT NULL,
 `k3` int DEFAULT NULL,
 `k4` bigint DEFAULT NULL,
 `k5` decimal(9,3) DEFAULT NULL,
 `k6` char(5) DEFAULT NULL,
 `k10` date DEFAULT NULL,
 `k11` datetime DEFAULT NULL,
 `k7` varchar(20) DEFAULT NULL,
 `k8` double DEFAULT NULL,
 `k9` float DEFAULT NULL
   ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
   
+-+-+
   1 row in set (0.00 sec)
   
   
   
   
   show create table test;
   
+---+--+
   | Table | Create Table   




  |
   
+---+--+
   | test  | CREATE TABLE `test` (
 `k0` tinyint(1) DEFAULT NULL,
 `k1` tinyint DEFAULT NULL,
 `k2` smallint DEFAULT NULL,
 `k3` int DEFAULT NULL,
 `k4` bigint DEFAULT NULL,
 `k5` decimal(9,3) DEFAULT NULL,
 `k6` char(5) DEFAULT NULL,
 `k10` date DEFAULT NULL,
 `k11` datetime DEFAULT NULL,
 `k7` varchar(20) DEFAULT NULL,
 `k8` double DEFAULT NULL,
 `k9` float DEFAULT NULL
   ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
   
+---+--+
   1 row in set (0.00 sec)
   
   
   
   
   
   

Re: Parquet files with microsecond columns

2024-01-22 Thread Paul Rogers
Hi Peter,

It sounds like you are on the right track: the new option is the quick
short-term solution. The best long-term solution is to generalize Drill's
date/time type, but that would take much more work. (Drill also has a bug
where the treatment of timezones is incorrect, which forces Drill to run in
the UTC time zone -- something that will also require difficult work.)

Given that JDBC works, the problem must be in the web interface, not in
your Parquet implementation. You've solved the problem with a new session
option. The web interface, however, has no sessions: if you set an option
in one call, and do your query in another, Drill will have "forgotten" your
option. Instead, there is a way to attach options to each query. Are you
using that feature?

As I recall, the JSON message to submit a query has an additional field to
hold session options. I do not recall, however, if the web UI added that
feature. Does anyone else know? Two workarounds. First, use your favorite
JSON request tool to submit a query with the option set. Second, set your
option as a system option so it is available to all sessions: ALTER SYSTEM
SET...

Thanks,

- Paul

On Mon, Jan 22, 2024 at 1:38 AM Peter Franzen  wrote:

> Hi,
>
> I am using Drill to query Parquet files that have fields of type
> timestamp_micros. By default, Drill truncates those microsecond
> values to milliseconds when reading the Parquet files in order to convert
> them to SQL timestamps.
>
> In some of my use cases I need to read the original microsecond values (as
> 64-bit values, not SQL timestamps) through Drill, but
> this doesn’t seem to be possible (unless I’ve missed something).
>
> I have explored a possible solution to this, and would like to run it by
> some developers more experienced with the Drill code base
> before I create a pull request.
>
> My idea is to add tow options similar to
> “store.parquet.reader.int96_as_timestamp" to control whether or not
> microsecond
> times and timestamps are truncated to milliseconds. These options would be
> added to “org.apache.drill.exec.ExecConstants" and
> "org.apache.drill.exec.server.options.SystemOptionManager", and to
> drill-module.conf:
>
> store.parquet.reader.time_micros_as_int64: false,
> store.parquet.reader.timestamp_micros_as_int64: false,
>
> These options would then be used in the same places as
> “store.parquet.reader.int96_as_timestamp”:
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter
>
> to create an int64 reader instead of a time/timestamp reader when the
> correspodning option is set to true.
>
> In addition to this,
> “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must
> be altered to _not_ truncate the min and max
> values for time_micros/timestamp_micros if the corresponding option is
> true. This class doesn’t have a reference to an OptionManager, so
> my guess is that the two new options must be extractred from the
> OptionManager when the ParquetReaderConfig instance is created.
>
> Filtering on microsecond columns would be done using 64-bit values rather
> than TIME/TIMESTAMP values, e.g.
>
> select *  from  where  = 1705914906694751;
>
> I’ve tested the solution outlined above, and it seems to work when using
> sqlline and with the JDBC driver, but not with the web based interface.
> Any pointers to the relevent code for that would be appreciated.
>
> An alternative solution to the above could be to intercept all reading of
> the Parquet schemas and modifying the schema to report the
> microsecond columns as int64 columns, i.e. to completely discard the
> information that the columns contain time/timestamp values.
> This could potentially make parts of the code where it is not obvious that
> the time/timestamp properties of columns are used behave
> as expected. However, this variant would not align with how INT96
> timestamps are handled.
>
> Any thoughts on this idea for how to access microsecond values would be
> highly appreciated.
>
> Thanks,
>
> /Peter
>
>


Re: DFDL Standard approved for ISO JTC1 PAS (Pre-Approved Standard)

2024-01-22 Thread Charles Givre
Nice!

> On Jan 22, 2024, at 08:44, Mike Beckerle  wrote:
> 
> I received notice today that the DFDL OGF standard is officially headed to
> become an ISO standard. The ballot within ISO JTC1 passed 100%.
> 
> I will let you know more info when I get it about how we can propagate this
> information, any branding considerations ISO JTC1 requires about it, their
> trademark rules, etc.
> At that point it will make sense to send this info to our users lists, make
> a blog post, etc.
> 
> Mike Beckerle
> Apache Daffodil PMC | daffodil.apache.org
> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
> Owl Cyber Defense | www.owlcyberdefense.com



DFDL Standard approved for ISO JTC1 PAS (Pre-Approved Standard)

2024-01-22 Thread Mike Beckerle
I received notice today that the DFDL OGF standard is officially headed to
become an ISO standard. The ballot within ISO JTC1 passed 100%.

I will let you know more info when I get it about how we can propagate this
information, any branding considerations ISO JTC1 requires about it, their
trademark rules, etc.
At that point it will make sense to send this info to our users lists, make
a blog post, etc.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


Parquet files with microsecond columns

2024-01-22 Thread Peter Franzen
Hi,

I am using Drill to query Parquet files that have fields of type 
timestamp_micros. By default, Drill truncates those microsecond
values to milliseconds when reading the Parquet files in order to convert them 
to SQL timestamps.

In some of my use cases I need to read the original microsecond values (as 
64-bit values, not SQL timestamps) through Drill, but
this doesn’t seem to be possible (unless I’ve missed something).

I have explored a possible solution to this, and would like to run it by some 
developers more experienced with the Drill code base
before I create a pull request.

My idea is to add tow options similar to 
“store.parquet.reader.int96_as_timestamp" to control whether or not microsecond
times and timestamps are truncated to milliseconds. These options would be 
added to “org.apache.drill.exec.ExecConstants" and
"org.apache.drill.exec.server.options.SystemOptionManager", and to 
drill-module.conf:

store.parquet.reader.time_micros_as_int64: false,
store.parquet.reader.timestamp_micros_as_int64: false,

These options would then be used in the same places as 
“store.parquet.reader.int96_as_timestamp”:

org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter

to create an int64 reader instead of a time/timestamp reader when the 
correspodning option is set to true.

In addition to this, 
“org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must be 
altered to _not_ truncate the min and max
values for time_micros/timestamp_micros if the corresponding option is true. 
This class doesn’t have a reference to an OptionManager, so
my guess is that the two new options must be extractred from the OptionManager 
when the ParquetReaderConfig instance is created.

Filtering on microsecond columns would be done using 64-bit values rather than 
TIME/TIMESTAMP values, e.g.

select *  from  where  = 1705914906694751;

I’ve tested the solution outlined above, and it seems to work when using 
sqlline and with the JDBC driver, but not with the web based interface.
Any pointers to the relevent code for that would be appreciated.

An alternative solution to the above could be to intercept all reading of the 
Parquet schemas and modifying the schema to report the
microsecond columns as int64 columns, i.e. to completely discard the 
information that the columns contain time/timestamp values.
This could potentially make parts of the code where it is not obvious that the 
time/timestamp properties of columns are used behave
as expected. However, this variant would not align with how INT96 timestamps 
are handled.

Any thoughts on this idea for how to access microsecond values would be highly 
appreciated.

Thanks,

/Peter