date:20240123

[PR] DRILL-8479: mergejoin leak when Depleting incoming batches throw exception… (drill)

2024-01-23 Thread via GitHub



shfshihuafeng opened a new pull request, #2878:
URL: https://github.com/apache/drill/pull/2878

   … (#2876)
   
   # [DRILL-8479](https://issues.apache.org/jira/browse/DRILL-8479):  mergejoin 
leak when Depleting incoming batches throw exception
   
   ## Description
   
   when fragment failed, it call close() from MergeJoinBatch. but if  
leftIterator.close() throw exception, we could not call  rightIterator.close() 
to release memory。
   
   ## Documentation
   (Please describe user-visible changes similar to what should appear in the 
Drill documentation.)
   
   ## Testing
   
   The test method is the same with link, only one parameter needs to be 
modified,
   set planner.enable_hashjoin =false  to  ensure use mergejoin operator
   [](https://github.com/apache/drill/pull/2875)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Merge join leak (drill)

2024-01-23 Thread via GitHub



shfshihuafeng closed pull request #2877: Merge join leak
URL: https://github.com/apache/drill/pull/2877


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Merge join leak (drill)

2024-01-23 Thread via GitHub



shfshihuafeng opened a new pull request, #2877:
URL: https://github.com/apache/drill/pull/2877

   # [DRILL-](https://issues.apache.org/jira/browse/DRILL-): PR Title
   
   (Please replace `PR Title` with actual PR Title)
   
   ## Description
   
   (Please describe the change. If more than one ticket is fixed, include a 
reference to those tickets.)
   
   ## Documentation
   (Please describe user-visible changes similar to what should appear in the 
Drill documentation.)
   
   ## Testing
   (Please describe how this PR has been tested.)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] mergejoin memory leak when Depleting incoming batches throw exception (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on issue #2876:
URL: https://github.com/apache/drill/issues/2876#issuecomment-1907424021

   scenario  is reliably repeated by above test ,i fixed 。i can not find leak


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] mergejoin memory leak when Depleting incoming batches throw exception (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on issue #2876:
URL: https://github.com/apache/drill/issues/2876#issuecomment-1907421433

   [https://github.com/apache/drill/issues/2871](url)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] mergejoin memory leak when Depleting incoming batches throw exception (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on issue #2876:
URL: https://github.com/apache/drill/issues/2876#issuecomment-1907419612

   i fixed it see attachment
   
[0001-mergejoin-leak.patch](https://github.com/apache/drill/files/14033613/0001-mergejoin-leak.patch)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] mergejoin memory leak when Depleting incoming batches throw exception (drill)

2024-01-23 Thread via GitHub



shfshihuafeng opened a new issue, #2876:
URL: https://github.com/apache/drill/issues/2876

   Before submitting a bug report, please verify that you are using the most 
current version of Drill.  
   
   **Describe the bug**
   mergejoin memory leak when Depleting incoming batches throw exception. 
   because we could close rightIterator when leftIterator throw exception
   **To Reproduce**
   Steps to reproduce the behavior:
   1. prepare data for tpch 1s
   2. 20 concurrent for tpch sql8
   3. set direct memory 5g
   4. when it had OutOfMemoryException , stopped all sql.
   5.finding memory leak
   
   
   **Expected behavior**
when all  sql sop , we should find direct memory is 0 AND  could not find 
leak log like following.
   
   `Allocator(op:2:0:11:MergeJoinPOP) 100/73728/4874240/100 
(res/actual/peak/limit)`
   
   **Error detail, log output or screenshots**
   `Unable to allocate buffer of size XX (rounded from XX) due to memory limit 
(). Current allocation: xx`
   
   **Drill version**
   The version of Drill you encountered the issue in.
   
   **Additional context**
   // code placeholder
   `select o_year, sum(case when nation = 'CHINA' then volume else 0 end) / 
sum(volume) as mkt_share from ( select extract(year from o_orderdate) as 
o_year, l_extendedprice * 1.0 as volume, n2.n_name as nation from 
hive.tpch1s.part, hive.tpch1s.supplier, hive.tpch1s.lineitem, 
hive.tpch1s.orders, hive.tpch1s.customer, hive.tpch1s.nation n1, 
hive.tpch1s.nation n2, hive.tpch1s.region where p_partkey = l_partkey and 
s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and 
c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 
'ASIA' and s_nationkey = n2.n_nationkey and o_orderdate between date 
'1995-01-01' and date '1996-12-31' and p_type = 'LARGE BRUSHED BRASS') as 
all_nations group by o_year order by o_year`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (DRILL-8479) mergejion memory leak when exception

2024-01-23 Thread shihuafeng (Jira)

shihuafeng created DRILL-8479:
-

 Summary: mergejion memory leak when  exception
 Key: DRILL-8479
 URL: https://issues.apache.org/jira/browse/DRILL-8479
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.21.1
Reporter: shihuafeng
 Attachments: 0001-mergejoin-leak.patch

*Describe the bug*
megerjoin  leak when RecordIterator allocate memory exception with 
OutOfMemoryException{*}{*}
{*}Steps to reproduce the behavior{*}:
 # prepare data for tpch 1s
 # set direct memory 5g
 #  set planner.enable_hashjoin =false  to  ensure use mergejoin operator。
 #  set drill.memory.debug.allocator =true (Check for memory leaks )
 # 20 concurrent for tpch sql8
 # when it had OutOfMemoryException or null EXCEPTION , stopped all sql.
 # finding memory leak

*Expected behavior*

      when all  sql sop , we should find direct memory is 0 AND  could not find 
leak log like following.
{code:java}
Allocator(op:2:0:11:MergeJoinPOP) 100/73728/4874240/100 
(res/actual/peak/limit){code}
*Error detail, log output or screenshots*
{code:java}
Unable to allocate buffer of size XX (rounded from XX) due to memory limit (). 
Current allocation: xx{code}
[^0001-mergejoin-leak.patch]

sql 
{code:java}
// code placeholder
select o_year, sum(case when nation = 'CHINA' then volume else 0 end) / 
sum(volume) as mkt_share from ( select extract(year from o_orderdate) as 
o_year, l_extendedprice * 1.0 as volume, n2.n_name as nation from 
hive.tpch1s.part, hive.tpch1s.supplier, hive.tpch1s.lineitem, 
hive.tpch1s.orders, hive.tpch1s.customer, hive.tpch1s.nation n1, 
hive.tpch1s.nation n2, hive.tpch1s.region where p_partkey = l_partkey and 
s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and 
c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 
'ASIA' and s_nationkey = n2.n_nationkey and o_orderdate between date 
'1995-01-01' and date '1996-12-31' and p_type = 'LARGE BRUSHED BRASS') as 
all_nations group by o_year order by o_year

{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Parquet files with microsecond columns

2024-01-23 Thread James Turton

A reply on your actual topic now. I think that following implementation 
of the int96_as_timestamp option will result in the type conversion 
being done "deeply enough" for Drill. I sympathise a lot with the design 
thinking in your second option but I'd personally go the first route and 
only consider the second route if something wasn't working.


On 2024/01/22 11:36, Peter Franzen wrote:

Hi,

I am using Drill to query Parquet files that have fields of type 
timestamp_micros. By default, Drill truncates those microsecond
values to milliseconds when reading the Parquet files in order to convert them 
to SQL timestamps.

In some of my use cases I need to read the original microsecond values (as 
64-bit values, not SQL timestamps) through Drill, but
this doesn’t seem to be possible (unless I’ve missed something).

I have explored a possible solution to this, and would like to run it by some 
developers more experienced with the Drill code base
before I create a pull request.

My idea is to add tow options similar to 
“store.parquet.reader.int96_as_timestamp" to control whether or not microsecond
times and timestamps are truncated to milliseconds. These options would be added to 
“org.apache.drill.exec.ExecConstants" and
"org.apache.drill.exec.server.options.SystemOptionManager", and to 
drill-module.conf:

 store.parquet.reader.time_micros_as_int64: false,
 store.parquet.reader.timestamp_micros_as_int64: false,

These options would then be used in the same places as 
“store.parquet.reader.int96_as_timestamp”:

org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter

to create an int64 reader instead of a time/timestamp reader when the 
correspodning option is set to true.

In addition to this, 
“org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must be 
altered to _not_ truncate the min and max
values for time_micros/timestamp_micros if the corresponding option is true. 
This class doesn’t have a reference to an OptionManager, so
my guess is that the two new options must be extractred from the OptionManager 
when the ParquetReaderConfig instance is created.

Filtering on microsecond columns would be done using 64-bit values rather than 
TIME/TIMESTAMP values, e.g.

select *  from  where  = 1705914906694751;

I’ve tested the solution outlined above, and it seems to work when using 
sqlline and with the JDBC driver, but not with the web based interface.
Any pointers to the relevent code for that would be appreciated.

An alternative solution to the above could be to intercept all reading of the 
Parquet schemas and modifying the schema to report the
microsecond columns as int64 columns, i.e. to completely discard the 
information that the columns contain time/timestamp values.
This could potentially make parts of the code where it is not obvious that the 
time/timestamp properties of columns are used behave
as expected. However, this variant would not align with how INT96 timestamps 
are handled.

Any thoughts on this idea for how to access microsecond values would be highly 
appreciated.

Thanks,

/Peter

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1464222619


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/AbstractHashBinaryRecordBatch.java:
##
@@ -1312,7 +1312,9 @@ private void cleanup() {
 }
 // clean (and deallocate) each partition, and delete its spill file
 for (HashPartition partn : partitions) {
-  partn.close();
+  if (partn != null) {
+partn.close();
+  }

Review Comment:
(partn != null)  are necessary ,see Comment above on 1. fix idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1464211148


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();

Review Comment:
   ### 1. fix idea
   The design is any operator fails, the entire operator stack is closed. but 
partitions is array which is initialed by null。if hashPartition object is not 
created successfully, it throw exception. so partitions array  data after index 
which is null。
   
   `  for (int part = 0; part < numPartitions; part++) {
 partitions[part] = new HashPartition(context, allocator, baseHashTable,
 buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
 spilledState.getCycle(), numPartitions);
   }`
   
   for example
   
   partitions array length is 32, numPartitions =32 when numPartitions =10 
,throw except. partitions[11-31]  will be null 
object which index  numPartitions =10 was created  failed ,but it had 
allocater memory.
   
   when calling close() , hashpartion  object which numPartitions =10 can not 
call close,beacause it is null。
   
   
   ### 2. another fix idea
   
 we do  not  throw exception and do not call  close, but catch. we can 
create hash partiotn object . thus when calling close() , we can release。
   
   ```
   //1. add isException parameter when construct HashPartition object
   
   HashPartition(FragmentContext context, BufferAllocator allocator, 
ChainedHashTable baseHashTable,
  RecordBatch buildBatch, RecordBatch probeBatch, 
boolean semiJoin,
  int recordsPerBatch, SpillSet spillSet, int partNum, 
int cycleNum, int numPartitions , boolean **isException** )
   //2. catch except to ensure  HashPartition object has been created
 } catch (OutOfMemoryException oom) {
//do not call  close ,do  not  throw except
 isException =true;
   }
   //3.deal with exception
   AbstractHashBinaryRecordBatch#initializeBuild
   boolean isException = false;
   try {
 for (int part = 0; part < numPartitions; part++) {
   if (isException) {
 break;
   }
   partitions[part] = new HashPartition(context, allocator, 
baseHashTable,
   buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
   spilledState.getCycle(), numPartitions,**isException** );
 }
   } catch (Exception e) {
 isException = true;
   }
   if (isException ){
 throw UserException.memoryError(exceptions[0])
 .message("Failed to allocate hash partition.")
 .build(logger);
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1464211148


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();

Review Comment:
   ### 1. fix idea
   The design is any operator fails, the entire operator stack is closed. but 
partitions is array which is initialed by null。if hashPartition object is not 
created successfully, it throw exception. so partitions array  data after index 
which is null。
   
   `  for (int part = 0; part < numPartitions; part++) {
 partitions[part] = new HashPartition(context, allocator, baseHashTable,
 buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
 spilledState.getCycle(), numPartitions);
   }`
   
   for example
   
   partitions array length is 32, numPartitions =32 when numPartitions =10 
,throw except. partitions[11-31]  will be null 
object which index  numPartitions =10 was created  failed ,but it had 
allocater memory.
   
   when calling close() , hashpartion  object which numPartitions =10 can not 
call close,beacause it is null。
   
   
   ### 2. another fix idea
   
 we do  not  throw exception and do not call  close, but catch. we can 
create hash partiotn object . thus when calling close() , we can release。
   
   ```
   //add isException parameter when construct HashPartition object
   
   HashPartition(FragmentContext context, BufferAllocator allocator, 
ChainedHashTable baseHashTable,
  RecordBatch buildBatch, RecordBatch probeBatch, 
boolean semiJoin,
  int recordsPerBatch, SpillSet spillSet, int partNum, 
int cycleNum, int numPartitions , boolean **isException** )
   
 } catch (OutOfMemoryException oom) {
//do not call  close ,do  not  throw except
 isException =true;
   }
   
   AbstractHashBinaryRecordBatch#initializeBuild
   boolean isException = false;
   try {
 for (int part = 0; part < numPartitions; part++) {
   if (isException) {
 break;
   }
   partitions[part] = new HashPartition(context, allocator, 
baseHashTable,
   buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
   spilledState.getCycle(), numPartitions,**isException** );
 }
   } catch (Exception e) {
 isException = true;
   }
   if (isException ){
 throw UserException.memoryError(exceptions[0])
 .message("Failed to allocate hash partition.")
 .build(logger);
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1464211148


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();

Review Comment:
   ### 1. fix idea
   The design is any operator fails, the entire operator stack is closed. but 
partitions is array which is initialed by null。if hashPartition object is not 
created successfully, it throw exception. so partitions array  data after index 
which is null。
   
   `  for (int part = 0; part < numPartitions; part++) {
 partitions[part] = new HashPartition(context, allocator, baseHashTable,
 buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
 spilledState.getCycle(), numPartitions);
   }`
   
   for example
   
   partitions array length is 32, numPartitions =32 when numPartitions =10 
,throw except. partitions[11-31]  will be null 
object which index  numPartitions =10 was created  failed ,but it had 
allocater memory.
   
   when calling close() , hashpartion  object which numPartitions =10 can not 
call close,beacause it is null。
   
   
   ### 2. another fix idea
   
 we do  not  throw exception and do not call  close, but catch. we can 
create hash partiotn object . thus when calling close() , we can release。
   but if 
   
   ```
   //add isException parameter when construct HashPartition object
   
   HashPartition(FragmentContext context, BufferAllocator allocator, 
ChainedHashTable baseHashTable,
  RecordBatch buildBatch, RecordBatch probeBatch, 
boolean semiJoin,
  int recordsPerBatch, SpillSet spillSet, int partNum, 
int cycleNum, int numPartitions , boolean **isException** )
   
 } catch (OutOfMemoryException oom) {
//do not call  close ,do  not  throw except
 isException =true;
   }
   
   AbstractHashBinaryRecordBatch#initializeBuild
   boolean isException = false;
   try {
 for (int part = 0; part < numPartitions; part++) {
   if (isException) {
 break;
   }
   partitions[part] = new HashPartition(context, allocator, 
baseHashTable,
   buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
   spilledState.getCycle(), numPartitions,**isException** );
 }
   } catch (Exception e) {
 isException = true;
   }
   if (isException ){
 throw UserException.memoryError(exceptions[0])
 .message("Failed to allocate hash partition.")
 .build(logger);
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1464211148


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();

Review Comment:
   ### 1. fix idea
   The design is any operator fails, the entire operator stack is closed. but 
partitions is array which is initialed by null。if hashPartition object is not 
created successfully, it throw exception. so partitions array  data after index 
which is null。
   
   `  for (int part = 0; part < numPartitions; part++) {
 partitions[part] = new HashPartition(context, allocator, baseHashTable,
 buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
 spilledState.getCycle(), numPartitions);
   }`
   
   for example
   
   partitions array length is 32, numPartitions =32 when numPartitions =10 
,throw except. partitions[11-31]  will be null 
object which index  numPartitions =10 was created  failed ,but it had 
allocater memory.
   
   when calling close() , hashpartion  object which numPartitions =10 can not 
call close,beacause it is null。
   
   
   ### 2. another fix idea
   
 we do  not  throw exception and do not call  close, but catch. we can 
create hash partiotn object . thus when calling close() , we can release。
   but if 
   
   ```
   HashPartition(FragmentContext context, BufferAllocator allocator, 
ChainedHashTable baseHashTable,
  RecordBatch buildBatch, RecordBatch probeBatch, 
boolean semiJoin,
  int recordsPerBatch, SpillSet spillSet, int partNum, 
int cycleNum, int numPartitions , boolean **isException** )
   
 } catch (OutOfMemoryException oom) {
//do not call  close ,do  not  throw except
 isException =false;
   }
   
   AbstractHashBinaryRecordBatch#initializeBuild
   boolean isException = false;
   try {
 for (int part = 0; part < numPartitions; part++) {
   if (isException) {
 break;
   }
   partitions[part] = new HashPartition(context, allocator, 
baseHashTable,
   buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
   spilledState.getCycle(), numPartitions,**isException** );
 }
   } catch (Exception e) {
 isException = true;
   }
   if (isException ){
 throw UserException.memoryError(exceptions[0])
 .message("Failed to allocate hash partition.")
 .build(logger);
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1464211148


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();

Review Comment:
   ### 1. fix idea
   The design is any operator fails, the entire operator stack is closed. but 
partitions is array which is initialed by null。if hashPartition object is not 
created successfully, it throw exception. so partitions array  data after index 
which is null。
   
   `  for (int part = 0; part < numPartitions; part++) {
 partitions[part] = new HashPartition(context, allocator, baseHashTable,
 buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
 spilledState.getCycle(), numPartitions);
   }`
   
   for example
   
   partitions array length is 32, numPartitions =32 when numPartitions =10 
,throw except. partitions[11-31]  will be null 
object which index  numPartitions =10 was created  failed ,but it had 
allocater memory.
   
   when calling close() , hashpartion  object which numPartitions =10 can not 
call close,beacause it is null。
   
   
   ### 2. another fix idea
   
 we do  not  throw exception and do not call  close, but catch. we can 
create hash partiotn object . thus when calling close() , we can release。
   but if 
   
   ```
 } catch (OutOfMemoryException oom) {
//do not call  close ,only throw except
 throw UserException.memoryError(oom)
 .message("Failed to allocate hash partition.")
 .build(logger);
   }
   
   AbstractHashBinaryRecordBatch#initializeBuild
   boolean isException = false;
   try {
 for (int part = 0; part < numPartitions; part++) {
   if (isException) {
 break;
   }
   partitions[part] = new HashPartition(context, allocator, 
baseHashTable,
   buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
   spilledState.getCycle(), numPartitions);
 }
   } catch (Exception e) {
 isException = true;
   }
   if (isException ){
 throw UserException.memoryError(exceptions[0])
 .message("Failed to allocate hash partition.")
 .build(logger);
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1464211148


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();

Review Comment:
   ### 1. fix idea
   The design is any operator fails, the entire operator stack is closed. but 
partitions is array which is initialed by null。if hashPartition object is not 
created successfully, it throw exception. so partitions array  data after index 
which is null。
   
   `  for (int part = 0; part < numPartitions; part++) {
 partitions[part] = new HashPartition(context, allocator, baseHashTable,
 buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
 spilledState.getCycle(), numPartitions);
   }`
   
   for example
   
   partitions array length is 32, numPartitions =32 when numPartitions =10 
,throw except. partitions[11-31]  will be null 
object which index  numPartitions =10 was created  failed ,but it had 
allocater memory.
   
   when calling close() , hashpartion  object which numPartitions =10 can not 
call close,beacause it is null。
   
   
   2. another fix idea
   
 we do  not  throw exception and do not call  close, but catch. we can 
create hash partiotn object . thus when calling close() , we can release。
   but if 
   
   ```
 } catch (OutOfMemoryException oom) {
//do not call  close ,only throw except
 throw UserException.memoryError(oom)
 .message("Failed to allocate hash partition.")
 .build(logger);
   }
   
   AbstractHashBinaryRecordBatch#initializeBuild
   boolean isException = false;
   try {
 for (int part = 0; part < numPartitions; part++) {
   if (isException) {
 break;
   }
   partitions[part] = new HashPartition(context, allocator, 
baseHashTable,
   buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, 
part,
   spilledState.getCycle(), numPartitions);
 }
   } catch (Exception e) {
 isException = true;
   }
   if (isException ){
 throw UserException.memoryError(exceptions[0])
 .message("Failed to allocate hash partition.")
 .build(logger);
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8474: Add Daffodil Format Plugin (drill)

2024-01-23 Thread via GitHub



mbeckerle commented on PR #2836:
URL: https://github.com/apache/drill/pull/2836#issuecomment-1906827568

   Ok, so the geo-ip UDF stuff has no special mechanisms or description about 
those resource files, so the generic code that "scans" must find them and drag 
them along automatically. 
   
   That's the behavior I want. 
   
   What is "Drill's 3rd Party Jar folder"? 
   
   If a magic folder just gets dragged over to all nodes, and drill uses a 
class loader that arranges for jars in that folder to be searched, then there 
is very little to do, since a DFDL schema can be just a set of jar files 
containing related resources, and the classes for Daffodil's own UDFs and 
layers which are java code extensions of its own kind. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



paul-rogers commented on code in PR #2875:
URL: https://github.com/apache/drill/pull/2875#discussion_r1463921977


##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/AbstractHashBinaryRecordBatch.java:
##
@@ -1312,7 +1312,9 @@ private void cleanup() {
 }
 // clean (and deallocate) each partition, and delete its spill file
 for (HashPartition partn : partitions) {
-  partn.close();
+  if (partn != null) {
+partn.close();
+  }

Review Comment:
   The above is OK as a work-around. I wonder, however, where the code added a 
null pointer to the partition list. That should never happen. If it does, it 
should be fixed at the point where the null pointer is added to the list. 
Fixing it here is incomplete: there are other places where we loop through the 
list, and those will also fail.



##
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java:
##
@@ -157,11 +162,11 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 .build(logger);
 } catch (SchemaChangeException sce) {
   throw new IllegalStateException("Unexpected Schema Change while creating 
a hash table",sce);
-}
-this.hjHelper = semiJoin ? null : new HashJoinHelper(context, allocator);
-tmpBatchesList = new ArrayList<>();
-if (numPartitions > 1) {
-  allocateNewCurrentBatchAndHV();
+} catch (OutOfMemoryException oom) {
+  close();

Review Comment:
   This call is _probably_ fine. However, the design is that if any operator 
fails, the entire operator stack is closed. So, `close()` should be called by 
the fragment executor. There is probably no harm in calling `close()` here, as 
long as the `close()` method is safe to call twice.
   
   If the fragment executor _does not_ call close when the failure occurs 
during setup, then there is a bug since failing to call `close()` results in 
just this kind of error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8474: Add Daffodil Format Plugin (drill)

2024-01-23 Thread via GitHub



cgivre commented on PR #2836:
URL: https://github.com/apache/drill/pull/2836#issuecomment-1906689793

   > > > @cgivre @paul-rogers is there an example of a Drill UDF that is not 
part of the drill repository tree?
   > > > I'd like to understand the mechanisms for distributing any jar files 
and dependencies of the UDF that drill uses. I can't find any such in the 
quasi-USFs that are in the Drill tree, because well, since they are part of 
Drill, and so are their dependencies, this problem doesn't exist.
   > > 
   > > 
   > > @mbeckerle Here's an example: 
https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't 
able to connect last week.
   > 
   > If I understand this correctly, if a jar is on the classpath and has 
drill-module.conf in its root dir, then drill will find it and read that HOCON 
file to get the package to add to drill.classpath.scanning.packages.
   
   I believe that is correct.
   
   > 
   > Drill then appears to scan jars for class files for those packages. Not 
sure what it is doing with the class files. I imagine it is repackaging them 
somehow so Drill can use them on the drill distributed nodes. But it isn't yet 
clear to me how this aspect works. Do these classes just get loaded on the 
distributed drill nodes? Or is the classpath augmented in some way on the drill 
nodes so that they see a jar that contains all these classes?
   > 
   > I have two questions:
   > 
   > (1) what about dependencies? The UDF may depend on libraries which depend 
on other libraries, etc.
   
   So UDFs are a bit of a special case, but if they do have dependencies, you 
have to also include those JAR files in the UDF directory, or in Drill's 3rd 
party JAR folder.   I'm not that good with maven, but I've often wondered about 
making a so-called fat-JAR which includes the dependencies as part of the UDF 
JAR file.
   
   > 
   > (2) what about non-class files, e.g., things under src/main/resources of 
the project that go into the jar, but aren't "class" files? How do those things 
also get moved? How would code running in the drill node access these? The 
usual method is to call getResource(URL) with a URL that gives the path within 
a jar file to the resource in question.
   
   Take a look at this UDF. 
https://github.com/datadistillr/drill-geoip-functions
   This UDF has a few external resources including a CSV file and the MaxMind 
databases.
   
   
   > 
   > Thanks for any info.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8474: Add Daffodil Format Plugin (drill)

2024-01-23 Thread via GitHub



mbeckerle commented on PR #2836:
URL: https://github.com/apache/drill/pull/2836#issuecomment-1906561549

   > > @cgivre @paul-rogers is there an example of a Drill UDF that is not part 
of the drill repository tree?
   > > I'd like to understand the mechanisms for distributing any jar files and 
dependencies of the UDF that drill uses. I can't find any such in the 
quasi-USFs that are in the Drill tree, because well, since they are part of 
Drill, and so are their dependencies, this problem doesn't exist.
   > 
   > @mbeckerle Here's an example: 
https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't 
able to connect last week.
   
   If I understand this correctly, if a jar is on the classpath and has 
drill-module.conf in its root dir, then drill will find it and read that HOCON 
file to get the package to add to drill.classpath.scanning.packages. 
   
   Drill then appears to scan jars for class files for those packages. Not sure 
what it is doing with the class files. I imagine it is repackaging them somehow 
so Drill can use them on the drill distributed nodes. But it isn't yet clear to 
me how this aspect works. Do these classes just get loaded on the distributed 
drill nodes? Or is the classpath augmented in some way on the drill nodes so 
that they see a jar that contains all these classes?
   
   I have two questions: 
   
   (1) what about dependencies? The UDF may depend on libraries which depend on 
other libraries, etc. 
   
   (2) what about non-class files, e.g., things under src/main/resources of the 
project that go into the jar, but aren't "class" files? How do those things 
also get moved? How would code running in the drill node access these? The 
usual method is to call getResource(URL) with a URL that gives the path within 
a jar file to the resource in question. 
   
   Thanks for any info. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: Parquet files with microsecond columns

2024-01-23 Thread James Turton

A quick test with authentication enabled, each statement issued 
separately in the web UI.


set `window.enable` = false; -- disable window functions

select row_number() over (order by 1);

org.apache.drill.common.exceptions.UserRemoteException: 
UNSUPPORTED_OPERATION ERROR: Window functions are disabled See Apache 
Drill JIRA: DRILL-2559


On 2024/01/23 17:56, James Turton wrote:
My experience has been that if you've switched on authentication in 
Drill then web UI /does/ sustain a session. If not then it doesn't.


On 2024/01/23 09:48, Peter Franzen wrote:

Hi Paul,

Thanks for your comments.

I wasn’t aware that the Web UI doesn’t have sessions; when setting the option 
at the system
level the Web UI behaves as expected.

I’ll go ahead and create a pull request within the next few days.

/Peter


On 22 Jan 2024, at 21:40, Paul Rogers  wrote:

Hi Peter,

It sounds like you are on the right track: the new option is the quick
short-term solution. The best long-term solution is to generalize Drill's
date/time type, but that would take much more work. (Drill also has a bug
where the treatment of timezones is incorrect, which forces Drill to run in
the UTC time zone -- something that will also require difficult work.)

Given that JDBC works, the problem must be in the web interface, not in
your Parquet implementation. You've solved the problem with a new session
option. The web interface, however, has no sessions: if you set an option
in one call, and do your query in another, Drill will have "forgotten" your
option. Instead, there is a way to attach options to each query. Are you
using that feature?

As I recall, the JSON message to submit a query has an additional field to
hold session options. I do not recall, however, if the web UI added that
feature. Does anyone else know? Two workarounds. First, use your favorite
JSON request tool to submit a query with the option set. Second, set your
option as a system option so it is available to all sessions: ALTER SYSTEM
SET...

Thanks,

- Paul

On Mon, Jan 22, 2024 at 1:38 AM Peter Franzen  wrote:


Hi,

I am using Drill to query Parquet files that have fields of type
timestamp_micros. By default, Drill truncates those microsecond
values to milliseconds when reading the Parquet files in order to convert
them to SQL timestamps.

In some of my use cases I need to read the original microsecond values (as
64-bit values, not SQL timestamps) through Drill, but
this doesn’t seem to be possible (unless I’ve missed something).

I have explored a possible solution to this, and would like to run it by
some developers more experienced with the Drill code base
before I create a pull request.

My idea is to add tow options similar to
“store.parquet.reader.int96_as_timestamp" to control whether or not
microsecond
times and timestamps are truncated to milliseconds. These options would be
added to “org.apache.drill.exec.ExecConstants" and
"org.apache.drill.exec.server.options.SystemOptionManager", and to
drill-module.conf:

store.parquet.reader.time_micros_as_int64: false,
store.parquet.reader.timestamp_micros_as_int64: false,

These options would then be used in the same places as
“store.parquet.reader.int96_as_timestamp”:

org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory

org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter

to create an int64 reader instead of a time/timestamp reader when the
correspodning option is set to true.

In addition to this,
“org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must
be altered to _not_ truncate the min and max
values for time_micros/timestamp_micros if the corresponding option is
true. This class doesn’t have a reference to an OptionManager, so
my guess is that the two new options must be extractred from the
OptionManager when the ParquetReaderConfig instance is created.

Filtering on microsecond columns would be done using 64-bit values rather
than TIME/TIMESTAMP values, e.g.

select *  from  where  = 1705914906694751;

I’ve tested the solution outlined above, and it seems to work when using
sqlline and with the JDBC driver, but not with the web based interface.
Any pointers to the relevent code for that would be appreciated.

An alternative solution to the above could be to intercept all reading of
the Parquet schemas and modifying the schema to report the
microsecond columns as int64 columns, i.e. to completely discard the
information that the columns contain time/timestamp values.
This could potentially make parts of the code where it is not obvious that
the time/timestamp properties of columns are used behave
as expected. However, this variant would not align with how INT96
timestamps are handled.

Any thoughts on this idea for how to access microsecond values would be
highly appreciated.

Thanks,

/Peter

Re: Parquet files with microsecond columns

2024-01-23 Thread James Turton

My experience has been that if you've switched on authentication in 
Drill then web UI /does/ sustain a session. If not then it doesn't.


On 2024/01/23 09:48, Peter Franzen wrote:

Hi Paul,

Thanks for your comments.

I wasn’t aware that the Web UI doesn’t have sessions; when setting the option 
at the system
level the Web UI behaves as expected.

I’ll go ahead and create a pull request within the next few days.

/Peter


On 22 Jan 2024, at 21:40, Paul Rogers  wrote:

Hi Peter,

It sounds like you are on the right track: the new option is the quick
short-term solution. The best long-term solution is to generalize Drill's
date/time type, but that would take much more work. (Drill also has a bug
where the treatment of timezones is incorrect, which forces Drill to run in
the UTC time zone -- something that will also require difficult work.)

Given that JDBC works, the problem must be in the web interface, not in
your Parquet implementation. You've solved the problem with a new session
option. The web interface, however, has no sessions: if you set an option
in one call, and do your query in another, Drill will have "forgotten" your
option. Instead, there is a way to attach options to each query. Are you
using that feature?

As I recall, the JSON message to submit a query has an additional field to
hold session options. I do not recall, however, if the web UI added that
feature. Does anyone else know? Two workarounds. First, use your favorite
JSON request tool to submit a query with the option set. Second, set your
option as a system option so it is available to all sessions: ALTER SYSTEM
SET...

Thanks,

- Paul

On Mon, Jan 22, 2024 at 1:38 AM Peter Franzen  wrote:


Hi,

I am using Drill to query Parquet files that have fields of type
timestamp_micros. By default, Drill truncates those microsecond
values to milliseconds when reading the Parquet files in order to convert
them to SQL timestamps.

In some of my use cases I need to read the original microsecond values (as
64-bit values, not SQL timestamps) through Drill, but
this doesn’t seem to be possible (unless I’ve missed something).

I have explored a possible solution to this, and would like to run it by
some developers more experienced with the Drill code base
before I create a pull request.

My idea is to add tow options similar to
“store.parquet.reader.int96_as_timestamp" to control whether or not
microsecond
times and timestamps are truncated to milliseconds. These options would be
added to “org.apache.drill.exec.ExecConstants" and
"org.apache.drill.exec.server.options.SystemOptionManager", and to
drill-module.conf:

store.parquet.reader.time_micros_as_int64: false,
store.parquet.reader.timestamp_micros_as_int64: false,

These options would then be used in the same places as
“store.parquet.reader.int96_as_timestamp”:

org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory

org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter

to create an int64 reader instead of a time/timestamp reader when the
correspodning option is set to true.

In addition to this,
“org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must
be altered to _not_ truncate the min and max
values for time_micros/timestamp_micros if the corresponding option is
true. This class doesn’t have a reference to an OptionManager, so
my guess is that the two new options must be extractred from the
OptionManager when the ParquetReaderConfig instance is created.

Filtering on microsecond columns would be done using 64-bit values rather
than TIME/TIMESTAMP values, e.g.

select *  from  where  = 1705914906694751;

I’ve tested the solution outlined above, and it seems to work when using
sqlline and with the JDBC driver, but not with the web based interface.
Any pointers to the relevent code for that would be appreciated.

An alternative solution to the above could be to intercept all reading of
the Parquet schemas and modifying the schema to report the
microsecond columns as int64 columns, i.e. to completely discard the
information that the columns contain time/timestamp values.
This could potentially make parts of the code where it is not obvious that
the time/timestamp properties of columns are used behave
as expected. However, this variant would not align with how INT96
timestamps are handled.

Any thoughts on this idea for how to access microsecond values would be
highly appreciated.

Thanks,

/Peter

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



jnturton merged PR #2875:
URL: https://github.com/apache/drill/pull/2875


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: Parquet files with microsecond columns

2024-01-23 Thread Peter Franzen

Hi Paul,

Thanks for your comments.

I wasn’t aware that the Web UI doesn’t have sessions; when setting the option 
at the system
level the Web UI behaves as expected.

I’ll go ahead and create a pull request within the next few days.

/Peter

> On 22 Jan 2024, at 21:40, Paul Rogers  wrote:
> 
> Hi Peter,
> 
> It sounds like you are on the right track: the new option is the quick
> short-term solution. The best long-term solution is to generalize Drill's
> date/time type, but that would take much more work. (Drill also has a bug
> where the treatment of timezones is incorrect, which forces Drill to run in
> the UTC time zone -- something that will also require difficult work.)
> 
> Given that JDBC works, the problem must be in the web interface, not in
> your Parquet implementation. You've solved the problem with a new session
> option. The web interface, however, has no sessions: if you set an option
> in one call, and do your query in another, Drill will have "forgotten" your
> option. Instead, there is a way to attach options to each query. Are you
> using that feature?
> 
> As I recall, the JSON message to submit a query has an additional field to
> hold session options. I do not recall, however, if the web UI added that
> feature. Does anyone else know? Two workarounds. First, use your favorite
> JSON request tool to submit a query with the option set. Second, set your
> option as a system option so it is available to all sessions: ALTER SYSTEM
> SET...
> 
> Thanks,
> 
> - Paul
> 
> On Mon, Jan 22, 2024 at 1:38 AM Peter Franzen  wrote:
> 
>> Hi,
>> 
>> I am using Drill to query Parquet files that have fields of type
>> timestamp_micros. By default, Drill truncates those microsecond
>> values to milliseconds when reading the Parquet files in order to convert
>> them to SQL timestamps.
>> 
>> In some of my use cases I need to read the original microsecond values (as
>> 64-bit values, not SQL timestamps) through Drill, but
>> this doesn’t seem to be possible (unless I’ve missed something).
>> 
>> I have explored a possible solution to this, and would like to run it by
>> some developers more experienced with the Drill code base
>> before I create a pull request.
>> 
>> My idea is to add tow options similar to
>> “store.parquet.reader.int96_as_timestamp" to control whether or not
>> microsecond
>> times and timestamps are truncated to milliseconds. These options would be
>> added to “org.apache.drill.exec.ExecConstants" and
>> "org.apache.drill.exec.server.options.SystemOptionManager", and to
>> drill-module.conf:
>> 
>>store.parquet.reader.time_micros_as_int64: false,
>>store.parquet.reader.timestamp_micros_as_int64: false,
>> 
>> These options would then be used in the same places as
>> “store.parquet.reader.int96_as_timestamp”:
>> 
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
>> 
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
>> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter
>> 
>> to create an int64 reader instead of a time/timestamp reader when the
>> correspodning option is set to true.
>> 
>> In addition to this,
>> “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must
>> be altered to _not_ truncate the min and max
>> values for time_micros/timestamp_micros if the corresponding option is
>> true. This class doesn’t have a reference to an OptionManager, so
>> my guess is that the two new options must be extractred from the
>> OptionManager when the ParquetReaderConfig instance is created.
>> 
>> Filtering on microsecond columns would be done using 64-bit values rather
>> than TIME/TIMESTAMP values, e.g.
>> 
>> select *  from  where  = 1705914906694751;
>> 
>> I’ve tested the solution outlined above, and it seems to work when using
>> sqlline and with the JDBC driver, but not with the web based interface.
>> Any pointers to the relevent code for that would be appreciated.
>> 
>> An alternative solution to the above could be to intercept all reading of
>> the Parquet schemas and modifying the schema to report the
>> microsecond columns as int64 columns, i.e. to completely discard the
>> information that the columns contain time/timestamp values.
>> This could potentially make parts of the code where it is not obvious that
>> the time/timestamp properties of columns are used behave
>> as expected. However, this variant would not align with how INT96
>> timestamps are handled.
>> 
>> Any thoughts on this idea for how to access microsecond values would be
>> highly appreciated.
>> 
>> Thanks,
>> 
>> /Peter
>> 
>>

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



shfshihuafeng commented on PR #2875:
URL: https://github.com/apache/drill/pull/2875#issuecomment-1905599592

   > [An unsued import crept 
in](https://github.com/apache/drill/actions/runs/7622586264/job/20762475705#step:6:1277),
 could you remove it please?
   
   removed it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

2024-01-23 Thread via GitHub



jnturton commented on PR #2875:
URL: https://github.com/apache/drill/pull/2875#issuecomment-1905598192

   [An unsued import crept 
in](https://github.com/apache/drill/actions/runs/7622586264/job/20762475705#step:6:1277),
 could you remove it please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] DRILL-8479: mergejoin leak when Depleting incoming batches throw exception… (drill)

Re: [PR] Merge join leak (drill)

[PR] Merge join leak (drill)

Re: [I] mergejoin memory leak when Depleting incoming batches throw exception (drill)

Re: [I] mergejoin memory leak when Depleting incoming batches throw exception (drill)

Re: [I] mergejoin memory leak when Depleting incoming batches throw exception (drill)

[I] mergejoin memory leak when Depleting incoming batches throw exception (drill)

[jira] [Created] (DRILL-8479) mergejion memory leak when exception

Re: Parquet files with microsecond columns

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8474: Add Daffodil Format Plugin (drill)

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8474: Add Daffodil Format Plugin (drill)

Re: [PR] DRILL-8474: Add Daffodil Format Plugin (drill)

Re: Parquet files with microsecond columns

Re: Parquet files with microsecond columns

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: Parquet files with microsecond columns

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

Re: [PR] DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) (drill)

26 matches

Site Navigation

Mail list logo

Footer information