Re: Running cartesian joins on Drill

2017-05-11 Thread Aman Sinha
Muhammad,
The join condition  ‘a = b or (a is null && b is null)’ works.  Internally, 
this is converted to  ‘a is not distinct from b’ which is processed by Drill.
For some reason, if the second form is directly supplied in the user query, it 
is not working and ends up with the Cartesian join condition.  Drill leverages 
Calcite for this (you can see CALCITE-1200 for some background).
Can you file a JIRA for this ?

-Aman

From: "Aman Sinha (asi...@mapr.com)" 
Date: Thursday, May 11, 2017 at 4:29 PM
To: dev , user 
Cc: Shadi Khalifa 
Subject: Re: Running cartesian joins on Drill


I think Muhammad may be trying to run his original query with IS NOT DISTINCT 
FROM.   That discussion got side-tracked into Cartesian joins because his query 
was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the 
rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS 
NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong"  wrote:



I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my 
output:



0: jdbc:drill:zk=local> alter session set 
`planner.enable_nljoin_for_scalar_only` = false;

+---+-+

|  ok   | summary |

+---+-+

| true  | planner.enable_nljoin_for_scalar_only updated.  |

+---+-+

1 row selected (0.137 seconds)

0: jdbc:drill:zk=local> explain plan for select * from 
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

+--+--+

| text | json |

+--+--+

| 00-00Screen

00-01  ProjectAllowDup(*=[$0], *0=[$1])

00-02NestedLoopJoin(condition=[true], joinType=[inner])

00-04  Project(T2¦¦*=[$0])

00-06Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])

00-03  Project(T3¦¦*=[$0])

00-05Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])



-- Zelaine



On 5/11/17, 3:17 PM, "Muhammad Gelbana"  wrote:



​But the query I provided failed to be planned because it's a cartesian

join, although I've set the option you mentioned to false. Is there a

reason why wouldn't Drill rules physically implement the logical join 
in my

query to a nested loop join ?



*-*

*Muhammad Gelbana*

http://www.linkedin.com/in/mgelbana



On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong  wrote:



> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

> without an explicit join condition, the query should use the Cartesian

> join/nested loop join.

>

> -- Zelaine

>

> On 5/11/17, 4:20 AM, "Anup Tiwari"  wrote:

>

> Hi,

>

> I have one question here.. so if we have to use Cartesian join in 
Drill

> then do we have to follow some workaround like Shadi mention : 
adding a

> dummy column on the fly that has the value 1 in both tables and 
then

> join

> on that column leading to having a match of every row of the first

> table

> with every row of the second table, hence do a Cartesian product?

> OR

> If we just don't specify join condition like :

> select a.*, b.* from tt1 as a, tt2 b; then will it internally 
treat

> this

> query as Cartesian join.

>

> Regards,

> *Anup Tiwari*

>

> On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong  
wrote:

>

> > Cartesian joins in Drill are implemented as nested loop joins, 
and I

> think

> > you should see that reflected in the resultant query plan when 
you

> run

> > explain plan on the query.

> >

> > Yes, Cartesian joins/nested loop joins are expensive because 
you’re

> > effectively doing an MxN read of your tables.  There are more

> efficient

> > ways of processing a nested loop join, e.g., by creating an 
index on

> the


[jira] [Created] (DRILL-5506) Apache Drill Querying data from compressed .zip file

2017-05-11 Thread john li (JIRA)
john li created DRILL-5506:
--

 Summary: Apache Drill Querying data from compressed .zip file
 Key: DRILL-5506
 URL: https://issues.apache.org/jira/browse/DRILL-5506
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.10.0
Reporter: john li


Referring to the previous issue
https://issues.apache.org/jira/browse/DRILL-2806

According to the remarks from  Steven Phillips added a comment - 16/Apr/15 
21:50 
"The only compression codecs that work with Drill out of the box are gz, and 
bz2. Additional codecs can be added by including the relevant libraries in the 
Drill classpath."

I would like to learn how to use Apache Drill to query data from compressed 
.zip file.
 
However , the only default compression codecs that work with Apache Drill are 
gz, and bz2.
 
Assuming that Additional codecs can be added by including the relevant 
libraries in the Drill classpath.
 
Please kindly show me the step by step instructions so that I can understand 
how exactly to add the "zip" codec and how to include the relevant libraries in 
the Drill classpath ?




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115860398
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
+ *entire one RPC message / 
encrypted chunk along with the length of the message
+ *  lengthBytesLength   - out param - bytes of bufWithLen which 
contains the length of the entire RPC message or
+ *encrypted chunk
+ *  lengthDecodeHandler - in param  - function pointer with length 
decoder to use. For encrypted chunk we use
+ *lengthDecode and for plain RPC 
message we use rpcLengthDecode.
+ *  Return:
+ *  status_t- QRY_SUCCESS- In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - In cases of error.
+ */
+status_t DrillClientImpl::readLenBytesFromSocket(ByteBuf_t bufWithLen, 
AllocatedBufferPtr* bufferWithLenBytes,
+   uint32_t& lengthBytesLength, lengthDecoder lengthDecodeHandler) 
{
+
+uint32_t rmsgLen = 0;
+size_t bytes_read = 0;
+size_t leftover = 0;
+boost::system::error_code error;
+*bufferWithLenBytes = NULL;
+size_t bufferWithLenBytesSize = 0;
+
+bytes_read = (this->*lengthDecodeHandler)(bufWithLen, rmsgLen);
+lengthBytesLength = bytes_read;
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Length bytes = " << bytes_read 
<< std::endl;)
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Msg Length = " << rmsgLen << 
std::endl;)
+
+if(rmsgLen>0){
+leftover = LEN_PREFIX_BUFLEN - bytes_read;
+
+// Allocate a buffer for reading all the bytes in bufWithLen and 
length number of bytes
+   bufferWithLenBytesSize = rmsgLen + bytes_read;
+*bufferWithLenBytes = new AllocatedBuffer(bufferWithLenBytesSize);
+
+if(*bufferWithLenBytes == NULL){
+return handleQryError(QRY_CLIENT_OUTOFMEM, 
getMessage(ERR_QRY_OUTOFMEM), NULL);
+}
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << 
"DrillClientImpl::readLenBytesFromSocket: Allocated and locked buffer: [ "
+  << *bufferWithLenBytes << ", 
size = " << bufferWithLenBytesSize << " ]\n";)
+
+// Copy the memory of bufWithLen into bufferWithLenBytesSize
+memcpy((*bufferWithLenBytes)->m_pBuffer, bufWithLen, 
LEN_PREFIX_BUFLEN);
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Copied bufWithLen into 
bufferWithLenBytes. "
+  << "Now reading data (rmsgLen - 
leftover) : " << (rmsgLen - leftover)
+  << std::endl;)
+
+// Read the entire data left from socket and copy to currentBuffer.
+ByteBuf_t b = (*bufferWithLenBytes)->m_pBuffer + LEN_PREFIX_BUFLEN;
+size_t bytesToRead = rmsgLen - leftover;
+
+while(1){
+bytes_read = this->m_socket.read_some(boost::asio::buffer(b, 
bytesToRead), error);
+if(error) break;
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Data Message: actual 
bytes read = " << bytes_read << std::endl;)
+if(bytes_read == bytesToRead) break;
+bytesToRead -= bytes_read;
+b += bytes_read;
+}
+} else {
+return handleQryError(QRY_INTERNAL_ERROR, 
getMessage(ERR_QRY_INVREADLEN), NULL);
+}
+
+return error ? handleQryError(QRY_COMM_ERROR, 
getMessage(ERR_QRY_COMMERR, error.message().c_str()), NULL)
+ : QRY_SUCCESS;
+}
+
+
+/*
+ *  Function to read entire RPC message from socket and decode it to 
InboundRpcMessage
+ *  Parameters:
+ *  _buf- in param  - Buffer containing the length bytes.
+ *  allocatedBuffer - out param - Buffer containing the length bytes 
and entire RPC message bytes.
+ *  msg - out param - Decoded InBoundRpcMessage from the 
bytes in allocatedBuffer
+ *  Return:
+ *  status_t- QRY_SUCCESS   - In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - 

[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115614489
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
+ *entire one RPC message / 
encrypted chunk along with the length of the message
+ *  lengthBytesLength   - out param - bytes of bufWithLen which 
contains the length of the entire RPC message or
+ *encrypted chunk
+ *  lengthDecodeHandler - in param  - function pointer with length 
decoder to use. For encrypted chunk we use
+ *lengthDecode and for plain RPC 
message we use rpcLengthDecode.
+ *  Return:
+ *  status_t- QRY_SUCCESS- In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - In cases of error.
+ */
+status_t DrillClientImpl::readLenBytesFromSocket(ByteBuf_t bufWithLen, 
AllocatedBufferPtr* bufferWithLenBytes,
+   uint32_t& lengthBytesLength, lengthDecoder lengthDecodeHandler) 
{
+
+uint32_t rmsgLen = 0;
+size_t bytes_read = 0;
+size_t leftover = 0;
+boost::system::error_code error;
+*bufferWithLenBytes = NULL;
+size_t bufferWithLenBytesSize = 0;
+
+bytes_read = (this->*lengthDecodeHandler)(bufWithLen, rmsgLen);
+lengthBytesLength = bytes_read;
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Length bytes = " << bytes_read 
<< std::endl;)
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Msg Length = " << rmsgLen << 
std::endl;)
+
+if(rmsgLen>0){
+leftover = LEN_PREFIX_BUFLEN - bytes_read;
+
+// Allocate a buffer for reading all the bytes in bufWithLen and 
length number of bytes
+   bufferWithLenBytesSize = rmsgLen + bytes_read;
+*bufferWithLenBytes = new AllocatedBuffer(bufferWithLenBytesSize);
+
+if(*bufferWithLenBytes == NULL){
+return handleQryError(QRY_CLIENT_OUTOFMEM, 
getMessage(ERR_QRY_OUTOFMEM), NULL);
+}
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << 
"DrillClientImpl::readLenBytesFromSocket: Allocated and locked buffer: [ "
+  << *bufferWithLenBytes << ", 
size = " << bufferWithLenBytesSize << " ]\n";)
+
+// Copy the memory of bufWithLen into bufferWithLenBytesSize
+memcpy((*bufferWithLenBytes)->m_pBuffer, bufWithLen, 
LEN_PREFIX_BUFLEN);
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Copied bufWithLen into 
bufferWithLenBytes. "
+  << "Now reading data (rmsgLen - 
leftover) : " << (rmsgLen - leftover)
+  << std::endl;)
+
+// Read the entire data left from socket and copy to currentBuffer.
+ByteBuf_t b = (*bufferWithLenBytes)->m_pBuffer + LEN_PREFIX_BUFLEN;
+size_t bytesToRead = rmsgLen - leftover;
+
+while(1){
+bytes_read = this->m_socket.read_some(boost::asio::buffer(b, 
bytesToRead), error);
+if(error) break;
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Data Message: actual 
bytes read = " << bytes_read << std::endl;)
+if(bytes_read == bytesToRead) break;
+bytesToRead -= bytes_read;
+b += bytes_read;
+}
+} else {
+return handleQryError(QRY_INTERNAL_ERROR, 
getMessage(ERR_QRY_INVREADLEN), NULL);
+}
+
+return error ? handleQryError(QRY_COMM_ERROR, 
getMessage(ERR_QRY_COMMERR, error.message().c_str()), NULL)
+ : QRY_SUCCESS;
+}
+
+
+/*
+ *  Function to read entire RPC message from socket and decode it to 
InboundRpcMessage
+ *  Parameters:
+ *  _buf- in param  - Buffer containing the length bytes.
+ *  allocatedBuffer - out param - Buffer containing the length bytes 
and entire RPC message bytes.
+ *  msg - out param - Decoded InBoundRpcMessage from the 
bytes in allocatedBuffer
+ *  Return:
+ *  status_t- QRY_SUCCESS   - In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - 

[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115877434
  
--- Diff: contrib/native/client/src/clientlib/utils.cpp ---
@@ -111,4 +111,52 @@ AllocatedBuffer::~AllocatedBuffer(){
 m_bufSize = 0;
 }
 
+EncryptionContext::EncryptionContext(const bool& encryptionReqd, const 
int& wrapChunkSize, const int& rawSendSize) {
+this->m_bEncryptionReqd = encryptionReqd;
+this->m_maxWrapChunkSize = wrapChunkSize;
+this->m_rawWrapSendSize = rawSendSize;
+}
+
+EncryptionContext::EncryptionContext() {
+this->m_bEncryptionReqd = false;
+// SASL Framework only allows 3 octet length field during negotiation 
so maximum wrap message
+// length can be 16MB i.e. 0XFF
--- End diff --

Nice catch!. I am changing it to be 65536 now based on recent findings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115612998
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
+ *entire one RPC message / 
encrypted chunk along with the length of the message
+ *  lengthBytesLength   - out param - bytes of bufWithLen which 
contains the length of the entire RPC message or
+ *encrypted chunk
+ *  lengthDecodeHandler - in param  - function pointer with length 
decoder to use. For encrypted chunk we use
+ *lengthDecode and for plain RPC 
message we use rpcLengthDecode.
+ *  Return:
+ *  status_t- QRY_SUCCESS- In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - In cases of error.
+ */
+status_t DrillClientImpl::readLenBytesFromSocket(ByteBuf_t bufWithLen, 
AllocatedBufferPtr* bufferWithLenBytes,
+   uint32_t& lengthBytesLength, lengthDecoder lengthDecodeHandler) 
{
+
+uint32_t rmsgLen = 0;
+size_t bytes_read = 0;
+size_t leftover = 0;
+boost::system::error_code error;
+*bufferWithLenBytes = NULL;
+size_t bufferWithLenBytesSize = 0;
+
+bytes_read = (this->*lengthDecodeHandler)(bufWithLen, rmsgLen);
+lengthBytesLength = bytes_read;
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Length bytes = " << bytes_read 
<< std::endl;)
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Msg Length = " << rmsgLen << 
std::endl;)
+
+if(rmsgLen>0){
+leftover = LEN_PREFIX_BUFLEN - bytes_read;
+
+// Allocate a buffer for reading all the bytes in bufWithLen and 
length number of bytes
+   bufferWithLenBytesSize = rmsgLen + bytes_read;
+*bufferWithLenBytes = new AllocatedBuffer(bufferWithLenBytesSize);
+
+if(*bufferWithLenBytes == NULL){
+return handleQryError(QRY_CLIENT_OUTOFMEM, 
getMessage(ERR_QRY_OUTOFMEM), NULL);
+}
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << 
"DrillClientImpl::readLenBytesFromSocket: Allocated and locked buffer: [ "
+  << *bufferWithLenBytes << ", 
size = " << bufferWithLenBytesSize << " ]\n";)
+
+// Copy the memory of bufWithLen into bufferWithLenBytesSize
+memcpy((*bufferWithLenBytes)->m_pBuffer, bufWithLen, 
LEN_PREFIX_BUFLEN);
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Copied bufWithLen into 
bufferWithLenBytes. "
+  << "Now reading data (rmsgLen - 
leftover) : " << (rmsgLen - leftover)
+  << std::endl;)
+
+// Read the entire data left from socket and copy to currentBuffer.
+ByteBuf_t b = (*bufferWithLenBytes)->m_pBuffer + LEN_PREFIX_BUFLEN;
+size_t bytesToRead = rmsgLen - leftover;
+
+while(1){
+bytes_read = this->m_socket.read_some(boost::asio::buffer(b, 
bytesToRead), error);
+if(error) break;
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Data Message: actual 
bytes read = " << bytes_read << std::endl;)
+if(bytes_read == bytesToRead) break;
+bytesToRead -= bytes_read;
+b += bytes_read;
+}
+} else {
+return handleQryError(QRY_INTERNAL_ERROR, 
getMessage(ERR_QRY_INVREADLEN), NULL);
+}
+
+return error ? handleQryError(QRY_COMM_ERROR, 
getMessage(ERR_QRY_COMMERR, error.message().c_str()), NULL)
+ : QRY_SUCCESS;
+}
+
+
+/*
+ *  Function to read entire RPC message from socket and decode it to 
InboundRpcMessage
+ *  Parameters:
+ *  _buf- in param  - Buffer containing the length bytes.
+ *  allocatedBuffer - out param - Buffer containing the length bytes 
and entire RPC message bytes.
+ *  msg - out param - Decoded InBoundRpcMessage from the 
bytes in allocatedBuffer
+ *  Return:
+ *  status_t- QRY_SUCCESS   - In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - 

[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115557153
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
+ *entire one RPC message / 
encrypted chunk along with the length of the message
+ *  lengthBytesLength   - out param - bytes of bufWithLen which 
contains the length of the entire RPC message or
+ *encrypted chunk
+ *  lengthDecodeHandler - in param  - function pointer with length 
decoder to use. For encrypted chunk we use
+ *lengthDecode and for plain RPC 
message we use rpcLengthDecode.
+ *  Return:
+ *  status_t- QRY_SUCCESS- In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - In cases of error.
+ */
+status_t DrillClientImpl::readLenBytesFromSocket(ByteBuf_t bufWithLen, 
AllocatedBufferPtr* bufferWithLenBytes,
+   uint32_t& lengthBytesLength, lengthDecoder lengthDecodeHandler) 
{
+
+uint32_t rmsgLen = 0;
+size_t bytes_read = 0;
+size_t leftover = 0;
+boost::system::error_code error;
+*bufferWithLenBytes = NULL;
+size_t bufferWithLenBytesSize = 0;
+
+bytes_read = (this->*lengthDecodeHandler)(bufWithLen, rmsgLen);
--- End diff --

_lengthDecoder_ is a function pointer signature that accepts function of 
DrillClientImpl class with that signature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115532000
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
+ *entire one RPC message / 
encrypted chunk along with the length of the message
+ *  lengthBytesLength   - out param - bytes of bufWithLen which 
contains the length of the entire RPC message or
--- End diff --

Changed to "_lengthFieldLength_" as on Java side which is borrowed from 
LengthFieldBasedFrameDecoder


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115614854
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
+ *entire one RPC message / 
encrypted chunk along with the length of the message
+ *  lengthBytesLength   - out param - bytes of bufWithLen which 
contains the length of the entire RPC message or
+ *encrypted chunk
+ *  lengthDecodeHandler - in param  - function pointer with length 
decoder to use. For encrypted chunk we use
+ *lengthDecode and for plain RPC 
message we use rpcLengthDecode.
+ *  Return:
+ *  status_t- QRY_SUCCESS- In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - In cases of error.
+ */
+status_t DrillClientImpl::readLenBytesFromSocket(ByteBuf_t bufWithLen, 
AllocatedBufferPtr* bufferWithLenBytes,
+   uint32_t& lengthBytesLength, lengthDecoder lengthDecodeHandler) 
{
+
+uint32_t rmsgLen = 0;
+size_t bytes_read = 0;
+size_t leftover = 0;
+boost::system::error_code error;
+*bufferWithLenBytes = NULL;
+size_t bufferWithLenBytesSize = 0;
+
+bytes_read = (this->*lengthDecodeHandler)(bufWithLen, rmsgLen);
+lengthBytesLength = bytes_read;
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Length bytes = " << bytes_read 
<< std::endl;)
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Msg Length = " << rmsgLen << 
std::endl;)
+
+if(rmsgLen>0){
+leftover = LEN_PREFIX_BUFLEN - bytes_read;
+
+// Allocate a buffer for reading all the bytes in bufWithLen and 
length number of bytes
+   bufferWithLenBytesSize = rmsgLen + bytes_read;
+*bufferWithLenBytes = new AllocatedBuffer(bufferWithLenBytesSize);
+
+if(*bufferWithLenBytes == NULL){
+return handleQryError(QRY_CLIENT_OUTOFMEM, 
getMessage(ERR_QRY_OUTOFMEM), NULL);
+}
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << 
"DrillClientImpl::readLenBytesFromSocket: Allocated and locked buffer: [ "
+  << *bufferWithLenBytes << ", 
size = " << bufferWithLenBytesSize << " ]\n";)
+
+// Copy the memory of bufWithLen into bufferWithLenBytesSize
+memcpy((*bufferWithLenBytes)->m_pBuffer, bufWithLen, 
LEN_PREFIX_BUFLEN);
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Copied bufWithLen into 
bufferWithLenBytes. "
+  << "Now reading data (rmsgLen - 
leftover) : " << (rmsgLen - leftover)
+  << std::endl;)
+
+// Read the entire data left from socket and copy to currentBuffer.
+ByteBuf_t b = (*bufferWithLenBytes)->m_pBuffer + LEN_PREFIX_BUFLEN;
+size_t bytesToRead = rmsgLen - leftover;
+
+while(1){
+bytes_read = this->m_socket.read_some(boost::asio::buffer(b, 
bytesToRead), error);
+if(error) break;
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Data Message: actual 
bytes read = " << bytes_read << std::endl;)
+if(bytes_read == bytesToRead) break;
+bytesToRead -= bytes_read;
+b += bytes_read;
+}
+} else {
+return handleQryError(QRY_INTERNAL_ERROR, 
getMessage(ERR_QRY_INVREADLEN), NULL);
+}
+
+return error ? handleQryError(QRY_COMM_ERROR, 
getMessage(ERR_QRY_COMMERR, error.message().c_str()), NULL)
+ : QRY_SUCCESS;
+}
+
+
+/*
+ *  Function to read entire RPC message from socket and decode it to 
InboundRpcMessage
+ *  Parameters:
+ *  _buf- in param  - Buffer containing the length bytes.
+ *  allocatedBuffer - out param - Buffer containing the length bytes 
and entire RPC message bytes.
+ *  msg - out param - Decoded InBoundRpcMessage from the 
bytes in allocatedBuffer
+ *  Return:
+ *  status_t- QRY_SUCCESS   - In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - 

[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115527833
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -370,6 +453,33 @@ void DrillClientImpl::handleHShakeReadTimeout(const 
boost::system::error_code &
 return;
 }
 
+/*
+ * Check's if client has explicitly expressed interest in encrypted 
connections only. It looks for USERPROP_ENCRYPTION
+ * connection string property. If set to true then returns true else 
returns false
+ */
+bool DrillClientImpl::clientNeedsEncryption(const DrillUserProperties* 
userProperties) {
+bool needsEncryption = false;
+// check if userProperties is null
+if(!userProperties) {
+return needsEncryption;
+}
+
+// Loop through the property to find USERPROP_ENCRYPTION and it's value
+for (size_t i = 0; i < userProperties->size(); i++) {
--- End diff --

DrillUserProperties has a _map_ and a _vector_. The _vector_ stores the 
actual prop key/value pair and in _map_ we store the prop key and corresponding 
bits indicating _USERPROP_FLAGS_SERVERPROP|USERPROP_FLAGS_STRING_. This is 
later used in handshake message to filter out the properties passed by client 
and the one which is needed for server side to send along with handshake msg.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115612782
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
+ *entire one RPC message / 
encrypted chunk along with the length of the message
+ *  lengthBytesLength   - out param - bytes of bufWithLen which 
contains the length of the entire RPC message or
+ *encrypted chunk
+ *  lengthDecodeHandler - in param  - function pointer with length 
decoder to use. For encrypted chunk we use
+ *lengthDecode and for plain RPC 
message we use rpcLengthDecode.
+ *  Return:
+ *  status_t- QRY_SUCCESS- In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - In cases of error.
+ */
+status_t DrillClientImpl::readLenBytesFromSocket(ByteBuf_t bufWithLen, 
AllocatedBufferPtr* bufferWithLenBytes,
+   uint32_t& lengthBytesLength, lengthDecoder lengthDecodeHandler) 
{
+
+uint32_t rmsgLen = 0;
+size_t bytes_read = 0;
+size_t leftover = 0;
+boost::system::error_code error;
+*bufferWithLenBytes = NULL;
+size_t bufferWithLenBytesSize = 0;
+
+bytes_read = (this->*lengthDecodeHandler)(bufWithLen, rmsgLen);
+lengthBytesLength = bytes_read;
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Length bytes = " << bytes_read 
<< std::endl;)
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Msg Length = " << rmsgLen << 
std::endl;)
+
+if(rmsgLen>0){
+leftover = LEN_PREFIX_BUFLEN - bytes_read;
+
+// Allocate a buffer for reading all the bytes in bufWithLen and 
length number of bytes
+   bufferWithLenBytesSize = rmsgLen + bytes_read;
+*bufferWithLenBytes = new AllocatedBuffer(bufferWithLenBytesSize);
+
+if(*bufferWithLenBytes == NULL){
+return handleQryError(QRY_CLIENT_OUTOFMEM, 
getMessage(ERR_QRY_OUTOFMEM), NULL);
+}
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << 
"DrillClientImpl::readLenBytesFromSocket: Allocated and locked buffer: [ "
+  << *bufferWithLenBytes << ", 
size = " << bufferWithLenBytesSize << " ]\n";)
+
+// Copy the memory of bufWithLen into bufferWithLenBytesSize
+memcpy((*bufferWithLenBytes)->m_pBuffer, bufWithLen, 
LEN_PREFIX_BUFLEN);
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Copied bufWithLen into 
bufferWithLenBytes. "
+  << "Now reading data (rmsgLen - 
leftover) : " << (rmsgLen - leftover)
+  << std::endl;)
+
+// Read the entire data left from socket and copy to currentBuffer.
+ByteBuf_t b = (*bufferWithLenBytes)->m_pBuffer + LEN_PREFIX_BUFLEN;
+size_t bytesToRead = rmsgLen - leftover;
+
+while(1){
+bytes_read = this->m_socket.read_some(boost::asio::buffer(b, 
bytesToRead), error);
+if(error) break;
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Data Message: actual 
bytes read = " << bytes_read << std::endl;)
+if(bytes_read == bytesToRead) break;
+bytesToRead -= bytes_read;
+b += bytes_read;
+}
+} else {
+return handleQryError(QRY_INTERNAL_ERROR, 
getMessage(ERR_QRY_INVREADLEN), NULL);
+}
+
+return error ? handleQryError(QRY_COMM_ERROR, 
getMessage(ERR_QRY_COMMERR, error.message().c_str()), NULL)
--- End diff --

Memory for bufferWithLenBytes is deallocated by caller. Added a comment in 
the function header.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r11119
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
--- End diff --

bufferWithLenBytes may not be a complete RPC Msg. Changed to _bufWithLen 
--> bufWithLenField_ and _bufferWithLenBytes --> bufWithDataAndLenBytes_


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115530271
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -495,26 +612,45 @@ connectionStatus_t 
DrillClientImpl::handleAuthentication(const DrillUserProperti
 }
 }
 
+std::stringstream errorMsg;
--- End diff --

Fixed name and a bug where in success case with auth only it was printing 
wrong error message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #809: Drill-4335: C++ client changes for supporting encry...

2017-05-11 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/809#discussion_r115861013
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -854,75 +990,328 @@ void DrillClientImpl::waitForResults(){
 }
 }
 
-status_t DrillClientImpl::readMsg(ByteBuf_t _buf,
-AllocatedBufferPtr* allocatedBuffer,
+/*
+ *  Decode the length of the message from bufWithLen and then read entire 
message from the socket.
+ *  Parameters:
+ *  bufWithLen  - in param  - buffer containing the bytes 
which has length of the RPC message/encrypted chunk
+ *  bufferWithLenBytes  - out param - buffer pointer which points to 
memory allocated in this function and has the
+ *entire one RPC message / 
encrypted chunk along with the length of the message
+ *  lengthBytesLength   - out param - bytes of bufWithLen which 
contains the length of the entire RPC message or
+ *encrypted chunk
+ *  lengthDecodeHandler - in param  - function pointer with length 
decoder to use. For encrypted chunk we use
+ *lengthDecode and for plain RPC 
message we use rpcLengthDecode.
+ *  Return:
+ *  status_t- QRY_SUCCESS- In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - In cases of error.
+ */
+status_t DrillClientImpl::readLenBytesFromSocket(ByteBuf_t bufWithLen, 
AllocatedBufferPtr* bufferWithLenBytes,
+   uint32_t& lengthBytesLength, lengthDecoder lengthDecodeHandler) 
{
+
+uint32_t rmsgLen = 0;
+size_t bytes_read = 0;
+size_t leftover = 0;
+boost::system::error_code error;
+*bufferWithLenBytes = NULL;
+size_t bufferWithLenBytesSize = 0;
+
+bytes_read = (this->*lengthDecodeHandler)(bufWithLen, rmsgLen);
+lengthBytesLength = bytes_read;
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Length bytes = " << bytes_read 
<< std::endl;)
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Msg Length = " << rmsgLen << 
std::endl;)
+
+if(rmsgLen>0){
+leftover = LEN_PREFIX_BUFLEN - bytes_read;
+
+// Allocate a buffer for reading all the bytes in bufWithLen and 
length number of bytes
+   bufferWithLenBytesSize = rmsgLen + bytes_read;
+*bufferWithLenBytes = new AllocatedBuffer(bufferWithLenBytesSize);
+
+if(*bufferWithLenBytes == NULL){
+return handleQryError(QRY_CLIENT_OUTOFMEM, 
getMessage(ERR_QRY_OUTOFMEM), NULL);
+}
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << 
"DrillClientImpl::readLenBytesFromSocket: Allocated and locked buffer: [ "
+  << *bufferWithLenBytes << ", 
size = " << bufferWithLenBytesSize << " ]\n";)
+
+// Copy the memory of bufWithLen into bufferWithLenBytesSize
+memcpy((*bufferWithLenBytes)->m_pBuffer, bufWithLen, 
LEN_PREFIX_BUFLEN);
+
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Copied bufWithLen into 
bufferWithLenBytes. "
+  << "Now reading data (rmsgLen - 
leftover) : " << (rmsgLen - leftover)
+  << std::endl;)
+
+// Read the entire data left from socket and copy to currentBuffer.
+ByteBuf_t b = (*bufferWithLenBytes)->m_pBuffer + LEN_PREFIX_BUFLEN;
+size_t bytesToRead = rmsgLen - leftover;
+
+while(1){
+bytes_read = this->m_socket.read_some(boost::asio::buffer(b, 
bytesToRead), error);
+if(error) break;
+DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Data Message: actual 
bytes read = " << bytes_read << std::endl;)
+if(bytes_read == bytesToRead) break;
+bytesToRead -= bytes_read;
+b += bytes_read;
+}
+} else {
+return handleQryError(QRY_INTERNAL_ERROR, 
getMessage(ERR_QRY_INVREADLEN), NULL);
+}
+
+return error ? handleQryError(QRY_COMM_ERROR, 
getMessage(ERR_QRY_COMMERR, error.message().c_str()), NULL)
+ : QRY_SUCCESS;
+}
+
+
+/*
+ *  Function to read entire RPC message from socket and decode it to 
InboundRpcMessage
+ *  Parameters:
+ *  _buf- in param  - Buffer containing the length bytes.
+ *  allocatedBuffer - out param - Buffer containing the length bytes 
and entire RPC message bytes.
+ *  msg - out param - Decoded InBoundRpcMessage from the 
bytes in allocatedBuffer
+ *  Return:
+ *  status_t- QRY_SUCCESS   - In case of success.
+ *  - 
QRY_COMM_ERROR/QRY_INTERNAL_ERROR/QRY_CLIENT_OUTOFMEM - 

Re: Running cartesian joins on Drill

2017-05-11 Thread Aman Sinha
I think Muhammad may be trying to run his original query with IS NOT DISTINCT 
FROM.   That discussion got side-tracked into Cartesian joins because his query 
was not getting planned and the error was about Cartesian join.

Muhammad,  can you try with the equivalent version below ?  You mentioned the 
rewrite but did you try the rewritten version ?



SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc

LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM

`dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (

​​

`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID` IS 
NULL) )



On 5/11/17, 3:23 PM, "Zelaine Fong"  wrote:



I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my 
output:



0: jdbc:drill:zk=local> alter session set 
`planner.enable_nljoin_for_scalar_only` = false;

+---+-+

|  ok   | summary |

+---+-+

| true  | planner.enable_nljoin_for_scalar_only updated.  |

+---+-+

1 row selected (0.137 seconds)

0: jdbc:drill:zk=local> explain plan for select * from 
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;

+--+--+

| text | json |

+--+--+

| 00-00Screen

00-01  ProjectAllowDup(*=[$0], *0=[$1])

00-02NestedLoopJoin(condition=[true], joinType=[inner])

00-04  Project(T2¦¦*=[$0])

00-06Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])

00-03  Project(T3¦¦*=[$0])

00-05Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])



-- Zelaine



On 5/11/17, 3:17 PM, "Muhammad Gelbana"  wrote:



​But the query I provided failed to be planned because it's a cartesian

join, although I've set the option you mentioned to false. Is there a

reason why wouldn't Drill rules physically implement the logical join 
in my

query to a nested loop join ?



*-*

*Muhammad Gelbana*

http://www.linkedin.com/in/mgelbana



On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong  wrote:



> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even

> without an explicit join condition, the query should use the Cartesian

> join/nested loop join.

>

> -- Zelaine

>

> On 5/11/17, 4:20 AM, "Anup Tiwari"  wrote:

>

> Hi,

>

> I have one question here.. so if we have to use Cartesian join in 
Drill

> then do we have to follow some workaround like Shadi mention : 
adding a

> dummy column on the fly that has the value 1 in both tables and 
then

> join

> on that column leading to having a match of every row of the first

> table

> with every row of the second table, hence do a Cartesian product?

> OR

> If we just don't specify join condition like :

> select a.*, b.* from tt1 as a, tt2 b; then will it internally 
treat

> this

> query as Cartesian join.

>

> Regards,

> *Anup Tiwari*

>

> On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong  
wrote:

>

> > Cartesian joins in Drill are implemented as nested loop joins, 
and I

> think

> > you should see that reflected in the resultant query plan when 
you

> run

> > explain plan on the query.

> >

> > Yes, Cartesian joins/nested loop joins are expensive because 
you’re

> > effectively doing an MxN read of your tables.  There are more

> efficient

> > ways of processing a nested loop join, e.g., by creating an 
index on

> the

> > larger table in the join and then using that index to do lookups

> into that

> > table.  That way, the nested loop join cost is the cost of 
creating

> the

> > index + M, where M is the number of rows in the smaller table 
and

> assuming

> > the lookup cost into the index does minimize the amount of data 
read

> of the

> > second table.  Drill currently doesn’t do this.

> >

> > -- Zelaine

> >

> > On 5/8/17, 9:09 AM, "Muhammad Gelbana"  
wrote:

> >

> > ​I believe 

[GitHub] drill pull request #827: DRILL-5481: Persist profiles in-memory only with a ...

2017-05-11 Thread sudheeshkatkam
Github user sudheeshkatkam commented on a diff in the pull request:

https://github.com/apache/drill/pull/827#discussion_r116116864
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/sys/PersistentStoreConfig.java
 ---
@@ -106,9 +113,14 @@ protected StoreConfigBuilder(InstanceSerializer 
serializer) {
   return this;
 }
 
+public StoreConfigBuilder setMaxCapacity(int maxCapacity){
--- End diff --

`) {`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #827: DRILL-5481: Persist profiles in-memory only with a ...

2017-05-11 Thread sudheeshkatkam
Github user sudheeshkatkam commented on a diff in the pull request:

https://github.com/apache/drill/pull/827#discussion_r116116935
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/sys/store/InMemoryPersistentStore.java
 ---
@@ -15,28 +15,40 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.drill.exec.testing.store;
+package org.apache.drill.exec.store.sys.store;
 
 import java.util.Iterator;
 import java.util.Map;
 import java.util.concurrent.ConcurrentMap;
+import java.util.concurrent.ConcurrentSkipListMap;
+import java.util.concurrent.atomic.AtomicInteger;
 import java.util.concurrent.locks.ReadWriteLock;
 import java.util.concurrent.locks.ReentrantReadWriteLock;
 
-import com.google.common.collect.Iterables;
-import com.google.common.collect.Maps;
 import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.exception.StoreException;
 import org.apache.drill.exec.exception.VersionMismatchException;
 import org.apache.drill.exec.store.sys.BasePersistentStore;
 import org.apache.drill.exec.store.sys.PersistentStoreMode;
-import org.apache.drill.exec.store.sys.store.DataChangeVersion;
 
-public class NoWriteLocalStore extends BasePersistentStore {
+import com.google.common.collect.Iterables;
+
+public class InMemoryPersistentStore extends BasePersistentStore {
+  // private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(InMemoryPersistentStore.class);
+
   private final ReadWriteLock readWriteLock = new ReentrantReadWriteLock();
   private final AutoCloseableLock readLock = new 
AutoCloseableLock(readWriteLock.readLock());
   private final AutoCloseableLock writeLock = new 
AutoCloseableLock(readWriteLock.writeLock());
-  private final ConcurrentMap store = Maps.newConcurrentMap();
+  private ConcurrentMap store;
   private int version = -1;
+  private int maxCapacity;
--- End diff --

final, here and below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #827: DRILL-5481: Persist profiles in-memory only with a ...

2017-05-11 Thread sudheeshkatkam
Github user sudheeshkatkam commented on a diff in the pull request:

https://github.com/apache/drill/pull/827#discussion_r116116220
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/sys/store/InMemoryPersistentStore.java
 ---
@@ -15,28 +15,40 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.drill.exec.testing.store;
+package org.apache.drill.exec.store.sys.store;
 
 import java.util.Iterator;
 import java.util.Map;
 import java.util.concurrent.ConcurrentMap;
+import java.util.concurrent.ConcurrentSkipListMap;
+import java.util.concurrent.atomic.AtomicInteger;
 import java.util.concurrent.locks.ReadWriteLock;
 import java.util.concurrent.locks.ReentrantReadWriteLock;
 
-import com.google.common.collect.Iterables;
-import com.google.common.collect.Maps;
 import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.exception.StoreException;
 import org.apache.drill.exec.exception.VersionMismatchException;
 import org.apache.drill.exec.store.sys.BasePersistentStore;
 import org.apache.drill.exec.store.sys.PersistentStoreMode;
-import org.apache.drill.exec.store.sys.store.DataChangeVersion;
 
-public class NoWriteLocalStore extends BasePersistentStore {
+import com.google.common.collect.Iterables;
+
+public class InMemoryPersistentStore extends BasePersistentStore {
+  // private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(InMemoryPersistentStore.class);
+
   private final ReadWriteLock readWriteLock = new ReentrantReadWriteLock();
   private final AutoCloseableLock readLock = new 
AutoCloseableLock(readWriteLock.readLock());
   private final AutoCloseableLock writeLock = new 
AutoCloseableLock(readWriteLock.writeLock());
-  private final ConcurrentMap store = Maps.newConcurrentMap();
+  private ConcurrentMap store;
   private int version = -1;
+  private int maxCapacity;
+  private AtomicInteger currentSize = new AtomicInteger();
+
+  public InMemoryPersistentStore(int maximumCapacity) throws 
StoreException {
--- End diff --

Does not throw?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #827: DRILL-5481: Persist profiles in-memory only with a ...

2017-05-11 Thread sudheeshkatkam
Github user sudheeshkatkam commented on a diff in the pull request:

https://github.com/apache/drill/pull/827#discussion_r116116899
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/sys/store/InMemoryPersistentStore.java
 ---
@@ -15,28 +15,40 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.drill.exec.testing.store;
+package org.apache.drill.exec.store.sys.store;
 
 import java.util.Iterator;
 import java.util.Map;
 import java.util.concurrent.ConcurrentMap;
+import java.util.concurrent.ConcurrentSkipListMap;
+import java.util.concurrent.atomic.AtomicInteger;
 import java.util.concurrent.locks.ReadWriteLock;
 import java.util.concurrent.locks.ReentrantReadWriteLock;
 
-import com.google.common.collect.Iterables;
-import com.google.common.collect.Maps;
 import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.exception.StoreException;
 import org.apache.drill.exec.exception.VersionMismatchException;
 import org.apache.drill.exec.store.sys.BasePersistentStore;
 import org.apache.drill.exec.store.sys.PersistentStoreMode;
-import org.apache.drill.exec.store.sys.store.DataChangeVersion;
 
-public class NoWriteLocalStore extends BasePersistentStore {
+import com.google.common.collect.Iterables;
+
+public class InMemoryPersistentStore extends BasePersistentStore {
+  // private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(InMemoryPersistentStore.class);
+
   private final ReadWriteLock readWriteLock = new ReentrantReadWriteLock();
   private final AutoCloseableLock readLock = new 
AutoCloseableLock(readWriteLock.readLock());
   private final AutoCloseableLock writeLock = new 
AutoCloseableLock(readWriteLock.writeLock());
-  private final ConcurrentMap store = Maps.newConcurrentMap();
+  private ConcurrentMap store;
--- End diff --

final


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #827: DRILL-5481: Persist profiles in-memory only with a ...

2017-05-11 Thread sudheeshkatkam
Github user sudheeshkatkam commented on a diff in the pull request:

https://github.com/apache/drill/pull/827#discussion_r116116596
  
--- Diff: exec/java-exec/src/main/resources/drill-module.conf ---
@@ -138,7 +138,10 @@ drill.exec: {
 class: 
"org.apache.drill.exec.store.sys.store.provider.ZookeeperPersistentStoreProvider",
 local: {
   path: "/tmp/drill",
-  write: true
+  inmemory: {
+write: false,
+capacity: 1000
--- End diff --

Remove this line, otherwise `if 
(config.hasPath(ExecConstants.SYS_STORE_CAPACITY_PROFILES))` is always true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #827: DRILL-5481: Persist profiles in-memory only with a ...

2017-05-11 Thread sudheeshkatkam
Github user sudheeshkatkam commented on a diff in the pull request:

https://github.com/apache/drill/pull/827#discussion_r116116310
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/sys/store/InMemoryPersistentStore.java
 ---
@@ -93,6 +105,14 @@ public void put(final String key, final V value, final 
DataChangeVersion dataCha
 throw new VersionMismatchException("Version mismatch detected", 
dataChangeVersion.getVersion());
   }
   store.put(key, value);
+  currentSize.incrementAndGet();
+
+  if (currentSize.get() > maxCapacity) {
+//Pop Out Oldest
+((ConcurrentSkipListMap) store).pollLastEntry();
--- End diff --

Rather than cast, declare store as this type?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Running cartesian joins on Drill

2017-05-11 Thread Zelaine Fong
I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s my output:

0: jdbc:drill:zk=local> alter session set 
`planner.enable_nljoin_for_scalar_only` = false;
+---+-+
|  ok   | summary |
+---+-+
| true  | planner.enable_nljoin_for_scalar_only updated.  |
+---+-+
1 row selected (0.137 seconds)
0: jdbc:drill:zk=local> explain plan for select * from 
dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
+--+--+
| text | json |
+--+--+
| 00-00Screen
00-01  ProjectAllowDup(*=[$0], *0=[$1])
00-02NestedLoopJoin(condition=[true], joinType=[inner])
00-04  Project(T2¦¦*=[$0])
00-06Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])
00-03  Project(T3¦¦*=[$0])
00-05Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`], 
files=[file:/Users/zfong/foo.csv]]])

-- Zelaine

On 5/11/17, 3:17 PM, "Muhammad Gelbana"  wrote:

​But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?

*-*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong  wrote:

> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
> without an explicit join condition, the query should use the Cartesian
> join/nested loop join.
>
> -- Zelaine
>
> On 5/11/17, 4:20 AM, "Anup Tiwari"  wrote:
>
> Hi,
>
> I have one question here.. so if we have to use Cartesian join in 
Drill
> then do we have to follow some workaround like Shadi mention : adding 
a
> dummy column on the fly that has the value 1 in both tables and then
> join
> on that column leading to having a match of every row of the first
> table
> with every row of the second table, hence do a Cartesian product?
> OR
> If we just don't specify join condition like :
> select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
> this
> query as Cartesian join.
>
> Regards,
> *Anup Tiwari*
>
> On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong  wrote:
>
> > Cartesian joins in Drill are implemented as nested loop joins, and I
> think
> > you should see that reflected in the resultant query plan when you
> run
> > explain plan on the query.
> >
> > Yes, Cartesian joins/nested loop joins are expensive because you’re
> > effectively doing an MxN read of your tables.  There are more
> efficient
> > ways of processing a nested loop join, e.g., by creating an index on
> the
> > larger table in the join and then using that index to do lookups
> into that
> > table.  That way, the nested loop join cost is the cost of creating
> the
> > index + M, where M is the number of rows in the smaller table and
> assuming
> > the lookup cost into the index does minimize the amount of data read
> of the
> > second table.  Drill currently doesn’t do this.
> >
> > -- Zelaine
> >
> > On 5/8/17, 9:09 AM, "Muhammad Gelbana"  wrote:
> >
> > ​I believe ​clhubert is referring to this discussion
> >  > cartesian-product-in-apache-drill#post1>
> > .
> >
> > So why Drill doesn't transform this query into a nested join
> query ?
> > Simply
> > because there is no Calcite rule to transform it into a nested
> loop
> > join ?
> > Is it not technically possible to write such Rule or is it
> feasible so
> > I
> > may take on this challenge ?
> >
> > Also pardon me for repeating my question but I fail to find an
> answer
> > in
> > your replies, why doesn't Drill just run a cartesian join ?
> Because
> > it's
> > expensive regarding resources (i.e. CPU\Network\RAM) ?
> >
> > Thanks a lot Shadi for the query, it works for me.
> >
> > *-*
> > *Muhammad Gelbana*
> > http://www.linkedin.com/in/mgelbana
> >
> > On Mon, May 8, 2017 at 

Re: Running cartesian joins on Drill

2017-05-11 Thread Muhammad Gelbana
​But the query I provided failed to be planned because it's a cartesian
join, although I've set the option you mentioned to false. Is there a
reason why wouldn't Drill rules physically implement the logical join in my
query to a nested loop join ?

*-*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong  wrote:

> Provided `planner.enable_nljoin_for_scalar_only` is set to false, even
> without an explicit join condition, the query should use the Cartesian
> join/nested loop join.
>
> -- Zelaine
>
> On 5/11/17, 4:20 AM, "Anup Tiwari"  wrote:
>
> Hi,
>
> I have one question here.. so if we have to use Cartesian join in Drill
> then do we have to follow some workaround like Shadi mention : adding a
> dummy column on the fly that has the value 1 in both tables and then
> join
> on that column leading to having a match of every row of the first
> table
> with every row of the second table, hence do a Cartesian product?
> OR
> If we just don't specify join condition like :
> select a.*, b.* from tt1 as a, tt2 b; then will it internally treat
> this
> query as Cartesian join.
>
> Regards,
> *Anup Tiwari*
>
> On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong  wrote:
>
> > Cartesian joins in Drill are implemented as nested loop joins, and I
> think
> > you should see that reflected in the resultant query plan when you
> run
> > explain plan on the query.
> >
> > Yes, Cartesian joins/nested loop joins are expensive because you’re
> > effectively doing an MxN read of your tables.  There are more
> efficient
> > ways of processing a nested loop join, e.g., by creating an index on
> the
> > larger table in the join and then using that index to do lookups
> into that
> > table.  That way, the nested loop join cost is the cost of creating
> the
> > index + M, where M is the number of rows in the smaller table and
> assuming
> > the lookup cost into the index does minimize the amount of data read
> of the
> > second table.  Drill currently doesn’t do this.
> >
> > -- Zelaine
> >
> > On 5/8/17, 9:09 AM, "Muhammad Gelbana"  wrote:
> >
> > ​I believe ​clhubert is referring to this discussion
> >  > cartesian-product-in-apache-drill#post1>
> > .
> >
> > So why Drill doesn't transform this query into a nested join
> query ?
> > Simply
> > because there is no Calcite rule to transform it into a nested
> loop
> > join ?
> > Is it not technically possible to write such Rule or is it
> feasible so
> > I
> > may take on this challenge ?
> >
> > Also pardon me for repeating my question but I fail to find an
> answer
> > in
> > your replies, why doesn't Drill just run a cartesian join ?
> Because
> > it's
> > expensive regarding resources (i.e. CPU\Network\RAM) ?
> >
> > Thanks a lot Shadi for the query, it works for me.
> >
> > *-*
> > *Muhammad Gelbana*
> > http://www.linkedin.com/in/mgelbana
> >
> > On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <
> khal...@cs.queensu.ca>
> > wrote:
> >
> > > Hi Muhammad,
> > >
> > > I did the following as a workaround to have Cartesian product.
> The
> > basic
> > > idea is to create a dummy column on the fly that has the value
> 1 in
> > both
> > > tables and then join on that column leading to having a match
> of
> > every row
> > > of the first table with every row of the second table, hence
> do a
> > Cartesian
> > > product. This might not be the most efficient way but it will
> do the
> > job.
> > >
> > > *Original Query:*
> > > SELECT * FROM
> > > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT
> > > 2147483647) `t0`
> > > INNER JOIN
> > > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
> LIMIT
> > > 2147483647) `t1`
> > > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
> > > LIMIT 2147483647
> > >
> > > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to
> tables
> > one
> > > and two, respectively. Names don't really matter, just need to
> be
> > unique):*
> > > SELECT * FROM
> > > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
> > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
> > > INNER JOIN
> > > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
> > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
> > > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
> > > WHERE 

[jira] [Created] (DRILL-5505) Enabling exchanges increased the external sorts spill count by 2 times

2017-05-11 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-5505:


 Summary: Enabling exchanges increased the external sorts spill 
count by 2 times
 Key: DRILL-5505
 URL: https://issues.apache.org/jira/browse/DRILL-5505
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Rahul Challapalli
Assignee: Paul Rogers


git.commit.id.abbrev=1e0a14c

Based on the profile, the below query spilled 32 times
{code}
ALTER SESSION SET `exec.sort.disable_managed` = false;
alter session set `planner.width.max_per_node` = 1;
alter session set `planner.width.max_per_node` = 1;
alter session set `planner.disable_exchanges` = true;
alter session set `planner.memory.max_query_memory_per_node` = 62914560;
select count(*) from (select * from 
dfs.`/drill/testdata/resource-manager/250wide-small.tbl` order by columns[0])d 
where d.columns[0] = 'ljdfhwuehnoiueyf';
{code}

Now if I enabled the exchanges, rest all being same, the same query spilled 66 
times. I attached the 2 profiles and the log file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] drill issue #827: DRILL-5481: Persist profiles in-memory only with a max cap...

2017-05-11 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/827
  
@sudheeshkatkam & @ppadma ... please review and bless the PR if everything 
looks fine.!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #832: DRILL-5504: Vector validator to diagnose offset vec...

2017-05-11 Thread paul-rogers
GitHub user paul-rogers opened a pull request:

https://github.com/apache/drill/pull/832

DRILL-5504: Vector validator to diagnose offset vector issues

Validates offset vectors in VarChar and repeated vectors. Validates the
special case of repeated VarChar vectors (two layers of offsets.)

Provides two new session variables to turn on validation. One enables
the existing operator (iterator) validation, the other adds vector
validation. This allows validation to occur in a “production” Drill
(without restarting Drill with assertions, as previously required.)

Unit tests validate the validator. Another test validates the
integration, but requires manual steps, so is ignored by default.

This version is first-cut: all work is done within a single class.
Allows back-porting to an earlier version to solve a specific issues. A
revision should move some of the work into generated code (or refactor
vectors to allow outside access), since offset vectors appear for each
subclass; not on a base class that would allow generic operations.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/paul-rogers/drill DRILL-5504

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/832.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #832


commit 175e592419ca6bda1fd0259cc42b033616facc3d
Author: Paul Rogers 
Date:   2017-05-11T19:46:15Z

DRILL-5504: Vector validator to diagnose offset vector issues

Validates offset vectors in VarChar and repeated vectors. Validates the
special case of repeated VarChar vectors (two layers of offsets.)

Provides two new session variables to turn on validation. One enables
the existing operator (iterator) validation, the other adds vector
validation. This allows validation to occur in a “production” Drill
(without restarting Drill with assertions, as previously required.)

Unit tests validate the validator. Another test validates the
integration, but requires manual steps, so is ignored by default.

This version is first-cut: all work is done within a single class.
Allows back-porting to an earlier version to solve a specific issues. A
revision should move some of the work into generated code (or refactor
vectors to allow outside access), since offset vectors appear for each
subclass; not on a base class that would allow generic operations.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #819: DRILL-5419: Calculate return string length for literals & ...

2017-05-11 Thread jinfengni
Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/819
  
+1

The revised commit looks good to me. Thanks for taking effort to separate 
the return type inference logic from function scope logic. You may consider 
adding brief description of that change, plus the  refactoring of part of 
function code-gen logic (ValueReference etc).

  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Running cartesian joins on Drill

2017-05-11 Thread Zelaine Fong
Provided `planner.enable_nljoin_for_scalar_only` is set to false, even without 
an explicit join condition, the query should use the Cartesian join/nested loop 
join.

-- Zelaine

On 5/11/17, 4:20 AM, "Anup Tiwari"  wrote:

Hi,

I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
If we just don't specify join condition like :
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.

Regards,
*Anup Tiwari*

On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong  wrote:

> Cartesian joins in Drill are implemented as nested loop joins, and I think
> you should see that reflected in the resultant query plan when you run
> explain plan on the query.
>
> Yes, Cartesian joins/nested loop joins are expensive because you’re
> effectively doing an MxN read of your tables.  There are more efficient
> ways of processing a nested loop join, e.g., by creating an index on the
> larger table in the join and then using that index to do lookups into that
> table.  That way, the nested loop join cost is the cost of creating the
> index + M, where M is the number of rows in the smaller table and assuming
> the lookup cost into the index does minimize the amount of data read of 
the
> second table.  Drill currently doesn’t do this.
>
> -- Zelaine
>
> On 5/8/17, 9:09 AM, "Muhammad Gelbana"  wrote:
>
> ​I believe ​clhubert is referring to this discussion
>  cartesian-product-in-apache-drill#post1>
> .
>
> So why Drill doesn't transform this query into a nested join query ?
> Simply
> because there is no Calcite rule to transform it into a nested loop
> join ?
> Is it not technically possible to write such Rule or is it feasible so
> I
> may take on this challenge ?
>
> Also pardon me for repeating my question but I fail to find an answer
> in
> your replies, why doesn't Drill just run a cartesian join ? Because
> it's
> expensive regarding resources (i.e. CPU\Network\RAM) ?
>
> Thanks a lot Shadi for the query, it works for me.
>
> *-*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>
> On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa 
> wrote:
>
> > Hi Muhammad,
> >
> > I did the following as a workaround to have Cartesian product. The
> basic
> > idea is to create a dummy column on the fly that has the value 1 in
> both
> > tables and then join on that column leading to having a match of
> every row
> > of the first table with every row of the second table, hence do a
> Cartesian
> > product. This might not be the most efficient way but it will do the
> job.
> >
> > *Original Query:*
> > SELECT * FROM
> > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> > 2147483647) `t0`
> > INNER JOIN
> > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> > 2147483647) `t1`
> > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
> > LIMIT 2147483647
> >
> > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
> one
> > and two, respectively. Names don't really matter, just need to be
> unique):*
> > SELECT * FROM
> > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
> > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
> > INNER JOIN
> > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
> > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
> > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
> > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
> > LIMIT 2147483647
> >
> > Regards
> >
> >
> > *Shadi Khalifa, PhD*
> > Postdoctoral Fellow
> > Cognitive Analytics Development Hub
> > Centre for Advanced Computing
> > Queen’s University
> > (613) 533-6000 x78347
> > http://cac.queensu.ca
> >
> > I'm just a neuron in the society collective brain
> >
> > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
> > *
> >
> > P Please consider your 

Re: Running cartesian joins on Drill

2017-05-11 Thread Anup Tiwari
Hi,

I have one question here.. so if we have to use Cartesian join in Drill
then do we have to follow some workaround like Shadi mention : adding a
dummy column on the fly that has the value 1 in both tables and then join
on that column leading to having a match of every row of the first table
with every row of the second table, hence do a Cartesian product?
OR
If we just don't specify join condition like :
select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
query as Cartesian join.

Regards,
*Anup Tiwari*

On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong  wrote:

> Cartesian joins in Drill are implemented as nested loop joins, and I think
> you should see that reflected in the resultant query plan when you run
> explain plan on the query.
>
> Yes, Cartesian joins/nested loop joins are expensive because you’re
> effectively doing an MxN read of your tables.  There are more efficient
> ways of processing a nested loop join, e.g., by creating an index on the
> larger table in the join and then using that index to do lookups into that
> table.  That way, the nested loop join cost is the cost of creating the
> index + M, where M is the number of rows in the smaller table and assuming
> the lookup cost into the index does minimize the amount of data read of the
> second table.  Drill currently doesn’t do this.
>
> -- Zelaine
>
> On 5/8/17, 9:09 AM, "Muhammad Gelbana"  wrote:
>
> ​I believe ​clhubert is referring to this discussion
>  cartesian-product-in-apache-drill#post1>
> .
>
> So why Drill doesn't transform this query into a nested join query ?
> Simply
> because there is no Calcite rule to transform it into a nested loop
> join ?
> Is it not technically possible to write such Rule or is it feasible so
> I
> may take on this challenge ?
>
> Also pardon me for repeating my question but I fail to find an answer
> in
> your replies, why doesn't Drill just run a cartesian join ? Because
> it's
> expensive regarding resources (i.e. CPU\Network\RAM) ?
>
> Thanks a lot Shadi for the query, it works for me.
>
> *-*
> *Muhammad Gelbana*
> http://www.linkedin.com/in/mgelbana
>
> On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa 
> wrote:
>
> > Hi Muhammad,
> >
> > I did the following as a workaround to have Cartesian product. The
> basic
> > idea is to create a dummy column on the fly that has the value 1 in
> both
> > tables and then join on that column leading to having a match of
> every row
> > of the first table with every row of the second table, hence do a
> Cartesian
> > product. This might not be the most efficient way but it will do the
> job.
> >
> > *Original Query:*
> > SELECT * FROM
> > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> > 2147483647) `t0`
> > INNER JOIN
> > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
> > 2147483647) `t1`
> > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
> > LIMIT 2147483647
> >
> > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
> one
> > and two, respectively. Names don't really matter, just need to be
> unique):*
> > SELECT * FROM
> > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
> > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
> > INNER JOIN
> > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
> > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
> > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
> > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
> > LIMIT 2147483647
> >
> > Regards
> >
> >
> > *Shadi Khalifa, PhD*
> > Postdoctoral Fellow
> > Cognitive Analytics Development Hub
> > Centre for Advanced Computing
> > Queen’s University
> > (613) 533-6000 x78347
> > http://cac.queensu.ca
> >
> > I'm just a neuron in the society collective brain
> >
> > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
> > *
> >
> > P Please consider your environmental responsibility before printing
> this
> > e-mail
> >
> > *01001001 0010 01101100 0110 01110110 01100101 0010
> 01000101
> > 01100111 0001 0111 01110100 *
> >
> > *The information transmitted is intended only for the person or
> entity to
> > which it is addressed and may contain confidential material. Any
> review or
> > dissemination of this information by persons other than the intended
> > recipient is prohibited. If you received this in error, please
> contact the
> > sender and delete the material from any computer. Thank you.*
> >
> >
> >
> > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
>