[jira] [Created] (DRILL-4976) Querying Parquet files on S3 pulls

2016-10-28 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created DRILL-4976:
--

 Summary: Querying Parquet files on S3 pulls 
 Key: DRILL-4976
 URL: https://issues.apache.org/jira/browse/DRILL-4976
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Parquet
Affects Versions: 1.8.0
Reporter: Uwe L. Korn


Currently (Drill 1.8, Hadoop 2.7.2) when queries are executed on files stored 
in S3, the underlying implementation of s3a requests magnitudes too much data. 
Given sufficient seek sizes, the following HTTP pattern is observed:

* GET bytes=8k-100M
* GET bytes=2M-100M
* GET bytes=4M-100M

Although the HTTP request were normally aborted before all the data was
send by the server, it was still about 10-15x the size of the input files
that went over the network, i.e. for a file of the size of 100M, sometimes 1G 
of data is transferred over the network.

A fix for this is the newly introduced 
{{fs.s3a.experimental.input.fadvise=random}} mode which will be introduced with 
Hadoop 3.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4977) Reading parquet metadata cache from S3 with fadvise=random and Hadoop 3 generates a large number of requests

2016-10-28 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created DRILL-4977:
--

 Summary: Reading parquet metadata cache from S3 with 
fadvise=random and Hadoop 3 generates a large number of requests
 Key: DRILL-4977
 URL: https://issues.apache.org/jira/browse/DRILL-4977
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Parquet
Affects Versions: 1.8.0
 Environment: Hadoop 3.0
Reporter: Uwe L. Korn


When using the new {{fs.s3a.experimental.input.fadvise=random}} mode for 
accessing Parquet files stored in S3, we see a significant improvement for the 
query performance but a slowdown on query planning. This is due to the way the 
metadata file is read (each chunk of 8000 bytes generates a new GET request to 
S3). Indicating with {{FSDataInputStream.setReadahead(metadata-filesize)}} that 
we will read the whole file, this behaviour is circumvented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4978) Parquet metadata cache on S3 is always renewed

2016-10-28 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created DRILL-4978:
--

 Summary: Parquet metadata cache on S3 is always renewed
 Key: DRILL-4978
 URL: https://issues.apache.org/jira/browse/DRILL-4978
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.8.0
 Environment: Hadoop s3a storage
Reporter: Uwe L. Korn


As dictionary modification times are not tracked by S3 (see 
https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-aws/tools/hadoop-aws/index.html#Warning_2:_Because_Object_stores_dont_track_modification_times_of_directories
 ) the Parquet metadata is always renewed on query planning.

This could either be tuned by:
 * for the case of s3a, check the modification times of all Parquet files in 
this directory
 * deactivate the metadata cache for s3a



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4979) Make dataport configurable

2016-10-28 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created DRILL-4979:
--

 Summary: Make dataport configurable
 Key: DRILL-4979
 URL: https://issues.apache.org/jira/browse/DRILL-4979
 Project: Apache Drill
  Issue Type: New Feature
  Components:  Server
Affects Versions: 1.8.0
 Environment: Scheduling drillbits with Apache Mesos+Aurora
Reporter: Uwe L. Korn


Currently the dataport of a Drillbit is fixed to +1 on the control port. In a 
dynamic execution environment like Apache Mesos+Aurora, each port is allocated 
by the scheduler and then passed on to the application process. There is no 
possibility or guarantee to allocate two consecutive ports. Therefore, to run 
Drill in this environment, the dataport of the drillbit also needs to 
configurable by the scheduler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #600: DRILL-4373: Drill and Hive have incompatible timest...

2016-10-28 Thread vdiravka
Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/600#discussion_r85522267
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java
 ---
@@ -739,30 +741,76 @@ public void runTestAndValidate(String selection, 
String validationSelection, Str
   }
 
   /*
-  Test the reading of an int96 field. Impala encodes timestamps as int96 
fields
+Impala encodes timestamp values as int96 fields. Test the reading of 
an int96 field with two converters:
+the first one converts parquet INT96 into drill VARBINARY and the 
second one (works while
+store.parquet.reader.int96_as_timestamp option is enabled) converts 
parquet INT96 into drill TIMESTAMP.
*/
   @Test
   public void testImpalaParquetInt96() throws Exception {
 compareParquetReadersColumnar("field_impala_ts", 
"cp.`parquet/int96_impala_1.parquet`");
+try {
+  test("alter session set %s = true", 
ExecConstants.PARQUET_READER_INT96_AS_TIMESTAMP);
+  compareParquetReadersColumnar("field_impala_ts", 
"cp.`parquet/int96_impala_1.parquet`");
+} finally {
+  test("alter session reset %s", 
ExecConstants.PARQUET_READER_INT96_AS_TIMESTAMP);
+}
   }
 
   /*
-  Test the reading of a binary field where data is in dicationary _and_ 
non-dictionary encoded pages
+  Test the reading of a binary field as drill varbinary where data is in 
dicationary _and_ non-dictionary encoded pages
*/
   @Test
-  public void testImpalaParquetVarBinary_DictChange() throws Exception {
+  public void testImpalaParquetBinaryAsVarBinary_DictChange() throws 
Exception {
 compareParquetReadersColumnar("field_impala_ts", 
"cp.`parquet/int96_dict_change.parquet`");
   }
 
   /*
+  Test the reading of a binary field as drill timestamp where data is in 
dicationary _and_ non-dictionary encoded pages
+   */
+  @Test
+  public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws 
Exception {
+final String WORKING_PATH = TestTools.getWorkingPath();
+final String TEST_RES_PATH = WORKING_PATH + "/src/test/resources";
+try {
+  testBuilder()
+  .sqlQuery("select int96_ts from 
dfs_test.`%s/parquet/int96_dict_change`", TEST_RES_PATH)
+  .optionSettingQueriesForTestQuery(
+  "alter session set `%s` = true", 
ExecConstants.PARQUET_READER_INT96_AS_TIMESTAMP)
+  .ordered()
+  
.csvBaselineFile("testframework/testParquetReader/testInt96DictChange/q1.tsv")
+  .baselineTypes(TypeProtos.MinorType.TIMESTAMP)
+  .baselineColumns("int96_ts")
+  .build().run();
+} finally {
+  test("alter system reset `%s`", 
ExecConstants.PARQUET_READER_INT96_AS_TIMESTAMP);
+}
+  }
+
+  /*
  Test the conversion from int96 to impala timestamp
*/
   @Test
-  public void testImpalaParquetTimestampAsInt96() throws Exception {
+  public void testTimestampImpalaConvertFrom() throws Exception {
 compareParquetReadersColumnar("convert_from(field_impala_ts, 
'TIMESTAMP_IMPALA')", "cp.`parquet/int96_impala_1.parquet`");
   }
 
   /*
+ Test reading parquet Int96 as TimeStamp and comparing obtained values 
with the
+ old results (reading the same values as VarBinary and 
convert_fromTIMESTAMP_IMPALA function using)
+   */
+  @Test
+  public void testImpalaParquetTimestampInt96AsTimeStamp() throws 
Exception {
--- End diff --

This test compares the results between new converter (Int96 to TimeStamp) 
and the old one (Int96 to VarBinary) with `convert_fromTIMESTAMP_IMPALA` 
function. 
The issue was in the `ConvertFromImpalaTimestamp` [link to the code

](https://github.com/apache/drill/pull/600/commits/a45490af2dd663168220cc3bda62a2d79170db62#diff-5d8360c5e3cf7d2f6ac7bfe58b6d319aL57)
 Because the timezone changing shouldn't affect on the result timestamp values.
I deleted timezone consideration there, so now all tests passed successfuly 
even across different timezones.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: TO_TIMESTAMP function returns in-correct results

2016-10-28 Thread Khurram Faraaz
Thanks Serhii.

Can you please give me a working example of the usage with "s" for second
of minute and "S" for fraction of second.

I tried with both those symbols, however Drill 1.9.0 (commit: a29f1e29)
does not honor those symbols when used from within the to_date function.

On Thu, Oct 27, 2016 at 6:31 PM, Serhii Harnyk 
wrote:

> Hello, Khurram
>
> http://joda-time.sourceforge.net/apidocs/org/joda/time/
> format/DateTimeFormat.html
>
> s   second of minute number55
> S   fraction of second   number978
>
>
>
> 2016-10-27 13:54 GMT+03:00 Khurram Faraaz :
>
> > All,
> >
> > I am on Drill 1.9.0 git commit ID : a29f1e29 on CentOS
> >
> > TO_TIMESTAMP function does not return correct results, note that the
> > minutes, seconds and milliseconds parts of timestamp are incorrect in the
> > results
> >
> > {noformat}
> > 0: jdbc:drill:schema=dfs.tmp> VALUES(TO_TIMESTAMP('2015-03-30
> 20:49:59.10
> > UTC', '-MM-dd HH:mm:ss.s z'));
> > ++
> > | EXPR$0 |
> > ++
> > | 2015-03-30 20:49:10.0  |
> > ++
> > 1 row selected (0.228 seconds)
> > {noformat}
> >
> > {noformat}
> > 0: jdbc:drill:schema=dfs.tmp> VALUES(CAST(TO_TIMESTAMP('2015-03-30
> > 20:49:59.10 UTC', '-MM-dd HH:mm:ss.s z') AS TIMESTAMP));
> > ++
> > | EXPR$0 |
> > ++
> > | 2015-03-30 20:49:10.0  |
> > ++
> > 1 row selected (0.265 seconds)
> > {noformat}
> >
> > This case returns correct results, when the same string used above is
> given
> > as input to CAST function, note that minutes mm, seconds ss and
> millisecond
> > s parts are honored
> >
> > {noformat}
> > 0: jdbc:drill:schema=dfs.tmp> VALUES(CAST('2015-03-30 20:49:59.10 UTC' AS
> > TIMESTAMP));
> > ++
> > | EXPR$0 |
> > ++
> > | 2015-03-30 20:49:59.1  |
> > ++
> > 1 row selected (0.304 seconds)
> > {noformat}
> >
> > Thanks,
> > Khurram
> >
>


to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

2016-10-28 Thread Khurram Faraaz
All,

Question is - why does it work for a parquet column and fails when CSV
column is used ?

Drill 1.9.0 commit : a29f1e29

This is a simple project of column from a csv file, works.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv` t1
limit 5;
+-+
|   EXPR$0|
+-+
| 2011-11-04  |
| 1986-10-22  |
| 1992-09-10  |
| 2016-08-07  |
| 1986-01-25  |
+-+
5 rows selected (0.26 seconds)
{noformat}

Using TO_DATE function with columns[x] as first input fails, with an
IllegalArgumentException
{noformat}
0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'-mm-dd') FROM
`typeall_l.csv` t1 limit 5;
Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""

Fragment 0:0

[Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on centos-01.qa.lab:31010]
(state=,code=0)
{noformat}

However, interestingly same query over parquet column returns correct
results, on same data.

{noformat}
0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'-mm-dd') FROM
typeall_l limit 5;
+-+
|   EXPR$0|
+-+
| 2011-01-04  |
| 1986-01-22  |
| 1992-01-10  |
| 2016-01-07  |
| 1986-01-25  |
+-+
5 rows selected (0.286 seconds)
{noformat}

When the date string is passed as first input, to_date function returns
correct results.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','-mm-dd')
from (values(1));
+-+
|   EXPR$0|
+-+
| 2011-01-04  |
+-+
1 row selected (0.235 seconds)
{noformat}

Thanks,
Khurram


Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

2016-10-28 Thread Andries Engelbrecht
You should use -MM-dd.

I have not noticed the issue before with CSV.

--Andries

> On Oct 28, 2016, at 6:24 AM, Khurram Faraaz  wrote:
> 
> All,
> 
> Question is - why does it work for a parquet column and fails when CSV
> column is used ?
> 
> Drill 1.9.0 commit : a29f1e29
> 
> This is a simple project of column from a csv file, works.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv` t1
> limit 5;
> +-+
> |   EXPR$0|
> +-+
> | 2011-11-04  |
> | 1986-10-22  |
> | 1992-09-10  |
> | 2016-08-07  |
> | 1986-01-25  |
> +-+
> 5 rows selected (0.26 seconds)
> {noformat}
> 
> Using TO_DATE function with columns[x] as first input fails, with an
> IllegalArgumentException
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'-mm-dd') FROM
> `typeall_l.csv` t1 limit 5;
> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> 
> Fragment 0:0
> 
> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on centos-01.qa.lab:31010]
> (state=,code=0)
> {noformat}
> 
> However, interestingly same query over parquet column returns correct
> results, on same data.
> 
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'-mm-dd') FROM
> typeall_l limit 5;
> +-+
> |   EXPR$0|
> +-+
> | 2011-01-04  |
> | 1986-01-22  |
> | 1992-01-10  |
> | 2016-01-07  |
> | 1986-01-25  |
> +-+
> 5 rows selected (0.286 seconds)
> {noformat}
> 
> When the date string is passed as first input, to_date function returns
> correct results.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','-mm-dd')
> from (values(1));
> +-+
> |   EXPR$0|
> +-+
> | 2011-01-04  |
> +-+
> 1 row selected (0.235 seconds)
> {noformat}
> 
> Thanks,
> Khurram



Re: TO_TIMESTAMP function returns in-correct results

2016-10-28 Thread Serhii Harnyk
Example of the usage with "s" for second of minute and "S" for fraction of
second:

VALUES(TO_TIMESTAMP('2015-03-30 20:49:59.10 UTC', '-MM-dd HH:mm:ss.SSS
z'))

2016-10-28 16:17 GMT+03:00 Khurram Faraaz :

> Thanks Serhii.
>
> Can you please give me a working example of the usage with "s" for second
> of minute and "S" for fraction of second.
>
> I tried with both those symbols, however Drill 1.9.0 (commit: a29f1e29)
> does not honor those symbols when used from within the to_date function.
>
> On Thu, Oct 27, 2016 at 6:31 PM, Serhii Harnyk 
> wrote:
>
> > Hello, Khurram
> >
> > http://joda-time.sourceforge.net/apidocs/org/joda/time/
> > format/DateTimeFormat.html
> >
> > s   second of minute number55
> > S   fraction of second   number978
> >
> >
> >
> > 2016-10-27 13:54 GMT+03:00 Khurram Faraaz :
> >
> > > All,
> > >
> > > I am on Drill 1.9.0 git commit ID : a29f1e29 on CentOS
> > >
> > > TO_TIMESTAMP function does not return correct results, note that the
> > > minutes, seconds and milliseconds parts of timestamp are incorrect in
> the
> > > results
> > >
> > > {noformat}
> > > 0: jdbc:drill:schema=dfs.tmp> VALUES(TO_TIMESTAMP('2015-03-30
> > 20:49:59.10
> > > UTC', '-MM-dd HH:mm:ss.s z'));
> > > ++
> > > | EXPR$0 |
> > > ++
> > > | 2015-03-30 20:49:10.0  |
> > > ++
> > > 1 row selected (0.228 seconds)
> > > {noformat}
> > >
> > > {noformat}
> > > 0: jdbc:drill:schema=dfs.tmp> VALUES(CAST(TO_TIMESTAMP('2015-03-30
> > > 20:49:59.10 UTC', '-MM-dd HH:mm:ss.s z') AS TIMESTAMP));
> > > ++
> > > | EXPR$0 |
> > > ++
> > > | 2015-03-30 20:49:10.0  |
> > > ++
> > > 1 row selected (0.265 seconds)
> > > {noformat}
> > >
> > > This case returns correct results, when the same string used above is
> > given
> > > as input to CAST function, note that minutes mm, seconds ss and
> > millisecond
> > > s parts are honored
> > >
> > > {noformat}
> > > 0: jdbc:drill:schema=dfs.tmp> VALUES(CAST('2015-03-30 20:49:59.10 UTC'
> AS
> > > TIMESTAMP));
> > > ++
> > > | EXPR$0 |
> > > ++
> > > | 2015-03-30 20:49:59.1  |
> > > ++
> > > 1 row selected (0.304 seconds)
> > > {noformat}
> > >
> > > Thanks,
> > > Khurram
> > >
> >
>


[jira] [Created] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-10-28 Thread Vitalii Diravka (JIRA)
Vitalii Diravka created DRILL-4980:
--

 Summary: Upgrading of the approach of parquet date correctness 
status detection
 Key: DRILL-4980
 URL: https://issues.apache.org/jira/browse/DRILL-4980
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.8.0
Reporter: Vitalii Diravka
Assignee: Vitalii Diravka
 Fix For: 1.9.0


This jira is an addition for the 
[DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
The date correctness label for the new generated parquet files should be 
upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #633: DRILL-4972: Set WorkManager.StatusThread's daemon f...

2016-10-28 Thread sudheeshkatkam
Github user sudheeshkatkam commented on a diff in the pull request:

https://github.com/apache/drill/pull/633#discussion_r85552294
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/work/WorkManager.java ---
@@ -295,7 +295,7 @@ public FragmentExecutor getFragmentRunner(final 
FragmentHandle handle) {
*/
   private class StatusThread extends Thread {
 public StatusThread() {
-  setDaemon(true);
+  setDaemon(false);
--- End diff --

.. but this will make sure that is the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
I agree that it would be good if the approach of parquet date correctness
detection will be upgraded. So I created the jira for it DRILL-4980
.

But now we have two ideas:
1. To add checking of the drill version additionally, so later we can
delete isDateCorrect label from parquet metadata.
2. To add parquet writer version to the parquet metadata and check this
value instead of isDateCorrect and drillVersion.

So which way, we should prefer now?

Kind regards
Vitalii

2016-10-27 23:54 GMT+00:00 Paul Rogers :

> FWIW: back on the magic flag issue…
>
> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
> grained for our needs.
>
> A typical solution is include the version of the Parquet writer in
> addition to that of Drill. Each time we change something in the writer,
> increment the version number. If we number changes, we can easily handle
> two changes in the same Drill release, or differentiate between the “early
> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>
> Since we have no version now, start it at some arbitrary point (2?).
>
> Now, if the Parquet file has a Drill Writer version in the header, and
> that version is 2 or greater, the date is in the “correct” format. Anything
> written by Drill before writer version 2, the date is wrong. The “check the
> data to see if it is sane” approach is needed only for files were we can’t
> tell if an older Drill wrote it.
>
> Do other tools label the data? Does Hive say that it wrote the file? If
> so, we don’t need to do the sanity check if we can tell the data comes from
> Hive (or Impala, or anything other than old Drill.)
>
> - Paul
>
> > On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> >
> > Vitalii -- are you still planning to open a ticket and pull request for
> the
> > fix you've noted below?
> >
> > -- Zelaine
> >
> > On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> vitalii.dira...@gmail.com>
> > wrote:
> >
> >> @Paul Rogers
> >> It may be the undefined case when the file is generated with
> drill.version
> >> = 1.9-SNAPSHOT.
> >> It is more easy to determine corrupted date with this flag and there is
> no
> >> need to wait the end of release to merge these changes.
> >>
> >> @Jinfeng NI
> >> It looks like you are right.
> >> With consistent mode (isDateCorrect = true) all tests are passed. So I
> am
> >> going to open a jira ticket for it with next changes
> >> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
> >> 3410cac5d3
> >> Thanks.
> >>
> >> Kind regards
> >> Vitalii
> >>
> >> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
> >>
> >>> I'm not sure if I fully understand your answers. The bottom line is
> >>> quite simple: given a set of parquet files, the ParquetTableMeta
> >>> instance constructed in Drill should have identical value for
> >>> "isDateCorrect", whether it comes from parquet footer, or parquet
> >>> metadata cache, or whether there is partition pruning or not. However,
> >>> the code shows that this flag is not in consistent mode across
> >>> different cases.
> >>>
> >>>
> >>>
> >>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
> >>>  wrote:
>  Hi Jinfeng,
> 
>  1.If the parquet files are generated with Drill after Drill-4203 these
>  files have "isDateCorrect = true" property.
>  Drill serializes this property from metadata now. When we set this
> >>> property
>  in the first constructor we will hide the value from metadata.
>  IsDateCorrect will be false only if this value equals to the false (no
> >>> case
>  for it now) or absent in parquet metadata footer.
> 
> 
>  2. I'm not sure the reason to change isDateCorrect metadata property
> >> when
>  the user disable dates correction.
>  If you have some use case it would be great if you provide it.
> 
>  3. Maybe you are right regarding to when Parquet metadata is cloned.
>  Here I added the property in the same manner as Jason's new property
>  "drillVersion. So need it a separate unit test?
> 
> 
>  Kind regards
>  Vitalii
> 
>  2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
> 
> > Forgot to copy the link to the code.
> >
> > [1] https://github.com/apache/drill/blob/master/exec/java-
> > exec/src/main/java/org/apache/drill/exec/store/parquet/
> > Metadata.java#L950-L955
> >
> > On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni  wrote:
> >> @Jason, @Vitalli,
> >>
> >> Any thoughts on this question, since both you worked on fix of
> > DRILL-4203?
> >>
> >> Looking through the code, there is a third case [1], where this flag
> >> is set to false when Parquet metadata is cloned (after partition
> >> pruning, etc).  That means, for the 2nd case where the flag is set
> >> to
> >> true, if there is pruning happening, the new parquet metadata will
> >> see
> >> the flag is flipped to false. This does not make s

Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

2016-10-28 Thread Veera Naranammalpuram
Do you have zero length strings in your data? I have seen cases where the
system option to cast empty strings to NULL doesn't work as advertised. You
should re-open DRILL-3214.

When I run into this problem, I usually use a regex to workaround. The
PROJECT takes a performance hit when you do this for larger data sets but
it works.

$cat nulls.psv
date_col|string_col
|test
2016-10-28|test2
$ sqlline
apache drill 1.8.0
"a little sql for your nosql"
0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
+-+-+
|  date_col   | string_col  |
+-+-+
| | test|
| 2016-10-28  | test2   |
+-+-+
2 rows selected (0.303 seconds)
0: jdbc:drill:> select to_date(date_col,'-mm-dd') from `nulls.psv`;
Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""

Fragment 0:0

[Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on se-node10.se.lab:31010]
(state=,code=0)
0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
to_date(date_col,'-MM-dd') else null end as date_col_converted from
`nulls.psv`;
+-+
| date_col_converted  |
+-+
| null|
| 2016-10-28  |
+-+
2 rows selected (0.521 seconds)
0: jdbc:drill:> alter system set
`drill.exec.functions.cast_empty_string_to_null` = true;
+---+--+
|  ok   | summary  |
+---+--+
| true  | drill.exec.functions.cast_empty_string_to_null updated.  |
+---+--+
1 row selected (0.304 seconds)
0: jdbc:drill:>  select to_date(date_col,'-mm-dd') from `nulls.psv`;
Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""

Fragment 0:0

[Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on se-node10.se.lab:31010]
(state=,code=0)
0: jdbc:drill:>

-Veera

On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz 
wrote:

> All,
>
> Question is - why does it work for a parquet column and fails when CSV
> column is used ?
>
> Drill 1.9.0 commit : a29f1e29
>
> This is a simple project of column from a csv file, works.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv` t1
> limit 5;
> +-+
> |   EXPR$0|
> +-+
> | 2011-11-04  |
> | 1986-10-22  |
> | 1992-09-10  |
> | 2016-08-07  |
> | 1986-01-25  |
> +-+
> 5 rows selected (0.26 seconds)
> {noformat}
>
> Using TO_DATE function with columns[x] as first input fails, with an
> IllegalArgumentException
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'-mm-dd') FROM
> `typeall_l.csv` t1 limit 5;
> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
>
> Fragment 0:0
>
> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on centos-01.qa.lab:31010]
> (state=,code=0)
> {noformat}
>
> However, interestingly same query over parquet column returns correct
> results, on same data.
>
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'-mm-dd') FROM
> typeall_l limit 5;
> +-+
> |   EXPR$0|
> +-+
> | 2011-01-04  |
> | 1986-01-22  |
> | 1992-01-10  |
> | 2016-01-07  |
> | 1986-01-25  |
> +-+
> 5 rows selected (0.286 seconds)
> {noformat}
>
> When the date string is passed as first input, to_date function returns
> correct results.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','-mm-dd')
> from (values(1));
> +-+
> |   EXPR$0|
> +-+
> | 2011-01-04  |
> +-+
> 1 row selected (0.235 seconds)
> {noformat}
>
> Thanks,
> Khurram
>



-- 
Veera Naranammalpuram
Product Specialist - SQL on Hadoop
*MapR Technologies (www.mapr.com )*
*(Email) vnaranammalpu...@maprtech.com *
*(Mobile) 917 683 8116 - can text *
*Timezone: ET (UTC -5:00 / -4:00)*


Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

2016-10-28 Thread Andries Engelbrecht
Good catch on empty string Veera!

Wouldn't it be cheaper to check for an empty string?
case when columns[] ='' then null else to_date(columns[],'-MM-dd') end

I don't think the option to read csv empty columns (or empty string in any text 
reader) as null is in the reader yet. So we can't check with columns[] is null.


--Andries


> On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram 
>  wrote:
> 
> Do you have zero length strings in your data? I have seen cases where the
> system option to cast empty strings to NULL doesn't work as advertised. You
> should re-open DRILL-3214.
> 
> When I run into this problem, I usually use a regex to workaround. The
> PROJECT takes a performance hit when you do this for larger data sets but
> it works.
> 
> $cat nulls.psv
> date_col|string_col
> |test
> 2016-10-28|test2
> $ sqlline
> apache drill 1.8.0
> "a little sql for your nosql"
> 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
> +-+-+
> |  date_col   | string_col  |
> +-+-+
> | | test|
> | 2016-10-28  | test2   |
> +-+-+
> 2 rows selected (0.303 seconds)
> 0: jdbc:drill:> select to_date(date_col,'-mm-dd') from `nulls.psv`;
> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> 
> Fragment 0:0
> 
> [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on se-node10.se.lab:31010]
> (state=,code=0)
> 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
> to_date(date_col,'-MM-dd') else null end as date_col_converted from
> `nulls.psv`;
> +-+
> | date_col_converted  |
> +-+
> | null|
> | 2016-10-28  |
> +-+
> 2 rows selected (0.521 seconds)
> 0: jdbc:drill:> alter system set
> `drill.exec.functions.cast_empty_string_to_null` = true;
> +---+--+
> |  ok   | summary  |
> +---+--+
> | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
> +---+--+
> 1 row selected (0.304 seconds)
> 0: jdbc:drill:>  select to_date(date_col,'-mm-dd') from `nulls.psv`;
> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> 
> Fragment 0:0
> 
> [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on se-node10.se.lab:31010]
> (state=,code=0)
> 0: jdbc:drill:>
> 
> -Veera
> 
> On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz 
> wrote:
> 
>> All,
>> 
>> Question is - why does it work for a parquet column and fails when CSV
>> column is used ?
>> 
>> Drill 1.9.0 commit : a29f1e29
>> 
>> This is a simple project of column from a csv file, works.
>> {noformat}
>> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv` t1
>> limit 5;
>> +-+
>> |   EXPR$0|
>> +-+
>> | 2011-11-04  |
>> | 1986-10-22  |
>> | 1992-09-10  |
>> | 2016-08-07  |
>> | 1986-01-25  |
>> +-+
>> 5 rows selected (0.26 seconds)
>> {noformat}
>> 
>> Using TO_DATE function with columns[x] as first input fails, with an
>> IllegalArgumentException
>> {noformat}
>> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'-mm-dd') FROM
>> `typeall_l.csv` t1 limit 5;
>> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
>> 
>> Fragment 0:0
>> 
>> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on centos-01.qa.lab:31010]
>> (state=,code=0)
>> {noformat}
>> 
>> However, interestingly same query over parquet column returns correct
>> results, on same data.
>> 
>> {noformat}
>> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'-mm-dd') FROM
>> typeall_l limit 5;
>> +-+
>> |   EXPR$0|
>> +-+
>> | 2011-01-04  |
>> | 1986-01-22  |
>> | 1992-01-10  |
>> | 2016-01-07  |
>> | 1986-01-25  |
>> +-+
>> 5 rows selected (0.286 seconds)
>> {noformat}
>> 
>> When the date string is passed as first input, to_date function returns
>> correct results.
>> {noformat}
>> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','-mm-dd')
>> from (values(1));
>> +-+
>> |   EXPR$0|
>> +-+
>> | 2011-01-04  |
>> +-+
>> 1 row selected (0.235 seconds)
>> {noformat}
>> 
>> Thanks,
>> Khurram
>> 
> 
> 
> 
> -- 
> Veera Naranammalpuram
> Product Specialist - SQL on Hadoop
> *MapR Technologies (www.mapr.com )*
> *(Email) vnaranammalpu...@maprtech.com *
> *(Mobile) 917 683 8116 - can text *
> *Timezone: ET (UTC -5:00 / -4:00)*



Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

2016-10-28 Thread Veera Naranammalpuram
That should work and a lot faster too. Thanks for the pointer.

-Veera

On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> Good catch on empty string Veera!
>
> Wouldn't it be cheaper to check for an empty string?
> case when columns[] ='' then null else to_date(columns[],'-MM-dd') end
>
> I don't think the option to read csv empty columns (or empty string in any
> text reader) as null is in the reader yet. So we can't check with columns[]
> is null.
>
>
> --Andries
>
>
> > On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
> vnaranammalpu...@maprtech.com> wrote:
> >
> > Do you have zero length strings in your data? I have seen cases where the
> > system option to cast empty strings to NULL doesn't work as advertised.
> You
> > should re-open DRILL-3214.
> >
> > When I run into this problem, I usually use a regex to workaround. The
> > PROJECT takes a performance hit when you do this for larger data sets but
> > it works.
> >
> > $cat nulls.psv
> > date_col|string_col
> > |test
> > 2016-10-28|test2
> > $ sqlline
> > apache drill 1.8.0
> > "a little sql for your nosql"
> > 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
> > +-+-+
> > |  date_col   | string_col  |
> > +-+-+
> > | | test|
> > | 2016-10-28  | test2   |
> > +-+-+
> > 2 rows selected (0.303 seconds)
> > 0: jdbc:drill:> select to_date(date_col,'-mm-dd') from `nulls.psv`;
> > Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >
> > Fragment 0:0
> >
> > [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
> se-node10.se.lab:31010]
> > (state=,code=0)
> > 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
> > to_date(date_col,'-MM-dd') else null end as date_col_converted from
> > `nulls.psv`;
> > +-+
> > | date_col_converted  |
> > +-+
> > | null|
> > | 2016-10-28  |
> > +-+
> > 2 rows selected (0.521 seconds)
> > 0: jdbc:drill:> alter system set
> > `drill.exec.functions.cast_empty_string_to_null` = true;
> > +---+--+
> > |  ok   | summary  |
> > +---+--+
> > | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
> > +---+--+
> > 1 row selected (0.304 seconds)
> > 0: jdbc:drill:>  select to_date(date_col,'-mm-dd') from `nulls.psv`;
> > Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >
> > Fragment 0:0
> >
> > [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
> se-node10.se.lab:31010]
> > (state=,code=0)
> > 0: jdbc:drill:>
> >
> > -Veera
> >
> > On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz 
> > wrote:
> >
> >> All,
> >>
> >> Question is - why does it work for a parquet column and fails when CSV
> >> column is used ?
> >>
> >> Drill 1.9.0 commit : a29f1e29
> >>
> >> This is a simple project of column from a csv file, works.
> >> {noformat}
> >> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv` t1
> >> limit 5;
> >> +-+
> >> |   EXPR$0|
> >> +-+
> >> | 2011-11-04  |
> >> | 1986-10-22  |
> >> | 1992-09-10  |
> >> | 2016-08-07  |
> >> | 1986-01-25  |
> >> +-+
> >> 5 rows selected (0.26 seconds)
> >> {noformat}
> >>
> >> Using TO_DATE function with columns[x] as first input fails, with an
> >> IllegalArgumentException
> >> {noformat}
> >> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'-mm-dd')
> FROM
> >> `typeall_l.csv` t1 limit 5;
> >> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> >>
> >> Fragment 0:0
> >>
> >> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on
> centos-01.qa.lab:31010]
> >> (state=,code=0)
> >> {noformat}
> >>
> >> However, interestingly same query over parquet column returns correct
> >> results, on same data.
> >>
> >> {noformat}
> >> 0: jdbc:drill:schema=dfs.tmp> select to_date(col_dt,'-mm-dd') FROM
> >> typeall_l limit 5;
> >> +-+
> >> |   EXPR$0|
> >> +-+
> >> | 2011-01-04  |
> >> | 1986-01-22  |
> >> | 1992-01-10  |
> >> | 2016-01-07  |
> >> | 1986-01-25  |
> >> +-+
> >> 5 rows selected (0.286 seconds)
> >> {noformat}
> >>
> >> When the date string is passed as first input, to_date function returns
> >> correct results.
> >> {noformat}
> >> 0: jdbc:drill:schema=dfs.tmp> select to_date('2011-01-04','-mm-dd')
> >> from (values(1));
> >> +-+
> >> |   EXPR$0|
> >> +-+
> >> | 2011-01-04  |
> >> +-+
> >> 1 row selected (0.235 seconds)
> >> {noformat}
> >>
> >> Thanks,
> >> Khurram
> >>
> >
> >
> >
> > --
> > Veera Naranammalpuram
> > Product Specialist - SQL on Hadoop
> > *MapR Technologies (www.mapr.com 

[GitHub] drill pull request #633: DRILL-4972: Set WorkManager.StatusThread's daemon f...

2016-10-28 Thread sohami
Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/633#discussion_r85557510
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/work/WorkManager.java ---
@@ -295,7 +295,7 @@ public FragmentExecutor getFragmentRunner(final 
FragmentHandle handle) {
*/
   private class StatusThread extends Thread {
 public StatusThread() {
-  setDaemon(true);
+  setDaemon(false);
--- End diff --

Not sure why removing it will not ensure the same. I think it's redundant 
and like overriding a default already set in constructor with default value 
using setter method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

2016-10-28 Thread Khurram Faraaz
Thanks Andries and Veera.

1. Yes, my CSV file does have empty strings in some rows in columns[4].
2. it worked for parquet because I had used the case expression to cast
empty strings to NULL.
3. I tried with '-mm-dd' and '-MM-dd' and to_Date returned results
with both representations.

Question - Shouldn't Drill handle such empty strings that are within rows
in CSV files ?
 Why should user have to take care such cases ?

Regards,
Khurram

On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram <
vnaranammalpu...@maprtech.com> wrote:

> That should work and a lot faster too. Thanks for the pointer.
>
> -Veera
>
> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
> aengelbre...@maprtech.com> wrote:
>
> > Good catch on empty string Veera!
> >
> > Wouldn't it be cheaper to check for an empty string?
> > case when columns[] ='' then null else to_date(columns[],'-MM-dd')
> end
> >
> > I don't think the option to read csv empty columns (or empty string in
> any
> > text reader) as null is in the reader yet. So we can't check with
> columns[]
> > is null.
> >
> >
> > --Andries
> >
> >
> > > On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
> > vnaranammalpu...@maprtech.com> wrote:
> > >
> > > Do you have zero length strings in your data? I have seen cases where
> the
> > > system option to cast empty strings to NULL doesn't work as advertised.
> > You
> > > should re-open DRILL-3214.
> > >
> > > When I run into this problem, I usually use a regex to workaround. The
> > > PROJECT takes a performance hit when you do this for larger data sets
> but
> > > it works.
> > >
> > > $cat nulls.psv
> > > date_col|string_col
> > > |test
> > > 2016-10-28|test2
> > > $ sqlline
> > > apache drill 1.8.0
> > > "a little sql for your nosql"
> > > 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
> > > +-+-+
> > > |  date_col   | string_col  |
> > > +-+-+
> > > | | test|
> > > | 2016-10-28  | test2   |
> > > +-+-+
> > > 2 rows selected (0.303 seconds)
> > > 0: jdbc:drill:> select to_date(date_col,'-mm-dd') from `nulls.psv`;
> > > Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
> > se-node10.se.lab:31010]
> > > (state=,code=0)
> > > 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
> > > to_date(date_col,'-MM-dd') else null end as date_col_converted from
> > > `nulls.psv`;
> > > +-+
> > > | date_col_converted  |
> > > +-+
> > > | null|
> > > | 2016-10-28  |
> > > +-+
> > > 2 rows selected (0.521 seconds)
> > > 0: jdbc:drill:> alter system set
> > > `drill.exec.functions.cast_empty_string_to_null` = true;
> > > +---+--+
> > > |  ok   | summary  |
> > > +---+--+
> > > | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
> > > +---+--+
> > > 1 row selected (0.304 seconds)
> > > 0: jdbc:drill:>  select to_date(date_col,'-mm-dd') from
> `nulls.psv`;
> > > Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
> > se-node10.se.lab:31010]
> > > (state=,code=0)
> > > 0: jdbc:drill:>
> > >
> > > -Veera
> > >
> > > On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz 
> > > wrote:
> > >
> > >> All,
> > >>
> > >> Question is - why does it work for a parquet column and fails when CSV
> > >> column is used ?
> > >>
> > >> Drill 1.9.0 commit : a29f1e29
> > >>
> > >> This is a simple project of column from a csv file, works.
> > >> {noformat}
> > >> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv`
> t1
> > >> limit 5;
> > >> +-+
> > >> |   EXPR$0|
> > >> +-+
> > >> | 2011-11-04  |
> > >> | 1986-10-22  |
> > >> | 1992-09-10  |
> > >> | 2016-08-07  |
> > >> | 1986-01-25  |
> > >> +-+
> > >> 5 rows selected (0.26 seconds)
> > >> {noformat}
> > >>
> > >> Using TO_DATE function with columns[x] as first input fails, with an
> > >> IllegalArgumentException
> > >> {noformat}
> > >> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'-mm-dd')
> > FROM
> > >> `typeall_l.csv` t1 limit 5;
> > >> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> > >>
> > >> Fragment 0:0
> > >>
> > >> [Error Id: 9cff3eb9-4045-4d9a-a6a1-1eadaa597f30 on
> > centos-01.qa.lab:31010]
> > >> (state=,code=0)
> > >> {noformat}
> > >>
> > >> However, interestingly same query over parquet column returns correct
> > >> results, on same data.
> > >>
> > >> {noformat}
> > >> 0: jdbc:drill:schema=dfs.tmp> select to_

Re: TO_TIMESTAMP function returns in-correct results

2016-10-28 Thread Khurram Faraaz
Works, thanks!

0: jdbc:drill:schema=dfs.tmp> VALUES(TO_TIMESTAMP('2015-03-30 20:49:59.10
UTC', '-MM-dd HH:mm:ss.SSS z'));
++
| EXPR$0 |
++
| 2015-03-30 20:49:59.1  |
++
1 row selected (0.245 seconds)

On Fri, Oct 28, 2016 at 7:36 PM, Serhii Harnyk 
wrote:

> Example of the usage with "s" for second of minute and "S" for fraction of
> second:
>
> VALUES(TO_TIMESTAMP('2015-03-30 20:49:59.10 UTC', '-MM-dd HH:mm:ss.SSS
> z'))
>
> 2016-10-28 16:17 GMT+03:00 Khurram Faraaz :
>
> > Thanks Serhii.
> >
> > Can you please give me a working example of the usage with "s" for second
> > of minute and "S" for fraction of second.
> >
> > I tried with both those symbols, however Drill 1.9.0 (commit: a29f1e29)
> > does not honor those symbols when used from within the to_date function.
> >
> > On Thu, Oct 27, 2016 at 6:31 PM, Serhii Harnyk 
> > wrote:
> >
> > > Hello, Khurram
> > >
> > > http://joda-time.sourceforge.net/apidocs/org/joda/time/
> > > format/DateTimeFormat.html
> > >
> > > s   second of minute number55
> > > S   fraction of second   number978
> > >
> > >
> > >
> > > 2016-10-27 13:54 GMT+03:00 Khurram Faraaz :
> > >
> > > > All,
> > > >
> > > > I am on Drill 1.9.0 git commit ID : a29f1e29 on CentOS
> > > >
> > > > TO_TIMESTAMP function does not return correct results, note that the
> > > > minutes, seconds and milliseconds parts of timestamp are incorrect in
> > the
> > > > results
> > > >
> > > > {noformat}
> > > > 0: jdbc:drill:schema=dfs.tmp> VALUES(TO_TIMESTAMP('2015-03-30
> > > 20:49:59.10
> > > > UTC', '-MM-dd HH:mm:ss.s z'));
> > > > ++
> > > > | EXPR$0 |
> > > > ++
> > > > | 2015-03-30 20:49:10.0  |
> > > > ++
> > > > 1 row selected (0.228 seconds)
> > > > {noformat}
> > > >
> > > > {noformat}
> > > > 0: jdbc:drill:schema=dfs.tmp> VALUES(CAST(TO_TIMESTAMP('2015-03-30
> > > > 20:49:59.10 UTC', '-MM-dd HH:mm:ss.s z') AS TIMESTAMP));
> > > > ++
> > > > | EXPR$0 |
> > > > ++
> > > > | 2015-03-30 20:49:10.0  |
> > > > ++
> > > > 1 row selected (0.265 seconds)
> > > > {noformat}
> > > >
> > > > This case returns correct results, when the same string used above is
> > > given
> > > > as input to CAST function, note that minutes mm, seconds ss and
> > > millisecond
> > > > s parts are honored
> > > >
> > > > {noformat}
> > > > 0: jdbc:drill:schema=dfs.tmp> VALUES(CAST('2015-03-30 20:49:59.10
> UTC'
> > AS
> > > > TIMESTAMP));
> > > > ++
> > > > | EXPR$0 |
> > > > ++
> > > > | 2015-03-30 20:49:59.1  |
> > > > ++
> > > > 1 row selected (0.304 seconds)
> > > > {noformat}
> > > >
> > > > Thanks,
> > > > Khurram
> > > >
> > >
> >
>


Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Paul Rogers
Thanks Vitalii.

The Parquet Writer solution “just works”. As soon as someone upgrades the 
writer, files are labeled as having that new version. No fuzziness during a 
release as in 1.9.

It is fine to also include the Drill version. But, format decisions should be 
keyed off of the writer version.

By the way, do other tools happen to already do this? It would be rather 
surprising if they didn’t.

- Paul

> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka  
> wrote:
> 
> I agree that it would be good if the approach of parquet date correctness
> detection will be upgraded. So I created the jira for it DRILL-4980
> .
> 
> But now we have two ideas:
> 1. To add checking of the drill version additionally, so later we can
> delete isDateCorrect label from parquet metadata.
> 2. To add parquet writer version to the parquet metadata and check this
> value instead of isDateCorrect and drillVersion.
> 
> So which way, we should prefer now?
> 
> Kind regards
> Vitalii
> 
> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
> 
>> FWIW: back on the magic flag issue…
>> 
>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
>> grained for our needs.
>> 
>> A typical solution is include the version of the Parquet writer in
>> addition to that of Drill. Each time we change something in the writer,
>> increment the version number. If we number changes, we can easily handle
>> two changes in the same Drill release, or differentiate between the “early
>> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>> 
>> Since we have no version now, start it at some arbitrary point (2?).
>> 
>> Now, if the Parquet file has a Drill Writer version in the header, and
>> that version is 2 or greater, the date is in the “correct” format. Anything
>> written by Drill before writer version 2, the date is wrong. The “check the
>> data to see if it is sane” approach is needed only for files were we can’t
>> tell if an older Drill wrote it.
>> 
>> Do other tools label the data? Does Hive say that it wrote the file? If
>> so, we don’t need to do the sanity check if we can tell the data comes from
>> Hive (or Impala, or anything other than old Drill.)
>> 
>> - Paul
>> 
>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
>>> 
>>> Vitalii -- are you still planning to open a ticket and pull request for
>> the
>>> fix you've noted below?
>>> 
>>> -- Zelaine
>>> 
>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>> vitalii.dira...@gmail.com>
>>> wrote:
>>> 
 @Paul Rogers
 It may be the undefined case when the file is generated with
>> drill.version
 = 1.9-SNAPSHOT.
 It is more easy to determine corrupted date with this flag and there is
>> no
 need to wait the end of release to merge these changes.
 
 @Jinfeng NI
 It looks like you are right.
 With consistent mode (isDateCorrect = true) all tests are passed. So I
>> am
 going to open a jira ticket for it with next changes
 https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
 3410cac5d3
 Thanks.
 
 Kind regards
 Vitalii
 
 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
 
> I'm not sure if I fully understand your answers. The bottom line is
> quite simple: given a set of parquet files, the ParquetTableMeta
> instance constructed in Drill should have identical value for
> "isDateCorrect", whether it comes from parquet footer, or parquet
> metadata cache, or whether there is partition pruning or not. However,
> the code shows that this flag is not in consistent mode across
> different cases.
> 
> 
> 
> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>  wrote:
>> Hi Jinfeng,
>> 
>> 1.If the parquet files are generated with Drill after Drill-4203 these
>> files have "isDateCorrect = true" property.
>> Drill serializes this property from metadata now. When we set this
> property
>> in the first constructor we will hide the value from metadata.
>> IsDateCorrect will be false only if this value equals to the false (no
> case
>> for it now) or absent in parquet metadata footer.
>> 
>> 
>> 2. I'm not sure the reason to change isDateCorrect metadata property
 when
>> the user disable dates correction.
>> If you have some use case it would be great if you provide it.
>> 
>> 3. Maybe you are right regarding to when Parquet metadata is cloned.
>> Here I added the property in the same manner as Jason's new property
>> "drillVersion. So need it a separate unit test?
>> 
>> 
>> Kind regards
>> Vitalii
>> 
>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni :
>> 
>>> Forgot to copy the link to the code.
>>> 
>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>> Metadata.java#L950-L955
>>> 
>>> On T

[GitHub] drill issue #628: DRILL-4964: Drill fails to connect to hive metastore after...

2016-10-28 Thread jinfengni
Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/628
  
+1 

LGTM.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #628: DRILL-4964: Drill fails to connect to hive metastor...

2016-10-28 Thread sohami
Github user sohami closed the pull request at:

https://github.com/apache/drill/pull/628


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #628: DRILL-4964: Drill fails to connect to hive metastore after...

2016-10-28 Thread sohami
Github user sohami commented on the issue:

https://github.com/apache/drill/pull/628
  
@jinfengni - Thanks for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Time for a 1.9 Release?

2016-10-28 Thread Sudheesh Katkam
Hi Drillers,

We have a reasonable number of fixes and features since the last release
[1]. Releasing itself takes a while; so I propose we start the 1.9 release
process.

I volunteer as the release manager, unless there are objections.

We should also discuss what the release version number should be after 1.9.

Thank you,
Sudheesh

[1] https://issues.apache.org/jira/browse/DRILL/fixforversion/12337861


Re: ZK lost connectivity issue on large cluster

2016-10-28 Thread Padma Penumarthy
Hi Francois,

Thank you for the picture and the info you provided.
We will keep you updated and let you know when we make changes in future 
release.

Thanks,
Padma


> On Oct 26, 2016, at 6:06 PM, François Méthot  wrote:
> 
> Hi,
> 
> Sorry it took so long, lost the origin picture, had to go to a ms paint
> training
> here we go:
> 
> https://github.com/fmethot/imagine/blob/master/affinity_factor.png
> 
> 
> 
> On Thu, Oct 20, 2016 at 12:57 PM, Sudheesh Katkam 
> wrote:
> 
>> The mailing list does not seem to allow for images. Can you put the image
>> elsewhere (Github or Dropbox), and reply with a link to it?
>> 
>> - Sudheesh
>> 
>>> On Oct 19, 2016, at 5:37 PM, François Méthot 
>> wrote:
>>> 
>>> We had problem on the 220 nodes cluster. No problem on the 12 nodes
>> cluster.
>>> 
>>> I agree that the data may not be distributed evenly. It would be a long
>> and tedious process for me to produce a report.
>>> 
>>> Here is a drawing  of the fragments overview before and after the
>> changes of the affinity factory on a sample query ran on the 220 nodes
>> cluster.  max_width_per_node=8 on both, but it turned out to be irrelevant
>> to the issue.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Before: SYSTEM ERROR: ForemanException: One more more nodes lost
>> connectivity during query.  Identified nodes were [server121:31010].
>>> 
>>> After: error is gone
>>> 
>>> Before: low disk io, high network io on the bottom part of the graph
>>> after : high disk io, low network io on the bottom part of the graph
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Oct 18, 2016 at 12:58 AM, Padma Penumarthy <
>> ppenumar...@maprtech.com > wrote:
>>> Hi Francois,
>>> 
>>> It would be good to understand how increasing affinity_factor helped in
>> your case
>>> so we can better document and also use that knowledge to improve things
>> in future release.
>>> 
>>> If you have two clusters,  it is not clear whether you had the problem
>> on 12 node cluster
>>> or 220 node cluster or both. Is the dataset same on both ? Is
>> max_width_per_node=8 in both clusters ?
>>> 
>>> Increasing affinity factor will lower remote reads  by scheduling more
>> fragments/doing more work
>>> on nodes which have data available locally.  So, there seem to be some
>> kind of non uniform
>>> data distribution for sure. It would be good if you can provide more
>> details i.e. how the data is
>>> distributed in the cluster and how the load on the nodes changed when
>> affinity factor was increased.
>>> 
>>> Thanks,
>>> Padma
>>> 
>>> 
 On Oct 14, 2016, at 6:45 PM, François Méthot > > wrote:
 
 We have  a 12 nodes cluster and a 220 nodes cluster, but they do not
>> talk
 to each other. So Padma's analysis do not apply but thanks for your
 comments. Our goal had been to run Drill on the 220 nodes cluster
>> after it
 proved worthy of it on the small cluster.
 
 planner.width.max_per_node was eventually reduced to 2 when we were
>> trying
 to figure this out, it would still fail. After we figured out the
 affinity_factor, we put it back to its original value and it would work
 fine.
 
 
 
 Sudheesh: Indeed, The Zk/drill services use the same network on our
>> bigger
 cluster.
 
 potential improvements:
 - planner.affinity_factor should be better documented.
 - When ZK disconnected, the running queries systematically failed.
>> When we
 disabled the ForemanException thrown in the QueryManager.
 drillbitUnregistered method, most of our query started to run
>> successfully,
 we would sometime get Drillbit Disconnected error within the rpc work
>> bus.
 It did confirm that we still had something on our network going on,
>> but it
 also showed that the RPC bus between drillbits was more resilient to
 network hiccup. I could not prove it, but I think under certain
>> condition,
 the ZK session gets recreated, which cause a Query Manager unregistered
 (query fail) and register call right after, but the RPC
 bus  would remains connected.
 
 
 We really appreciate your feedback and we hope to contribute to this
>> great
 project in the future.
 Thanks
 Francois
 
 
 
 
 
 
 On Fri, Oct 14, 2016 at 3:00 PM, Padma Penumarthy <
>> ppenumar...@maprtech.com >
 wrote:
 
> 
> Seems like you have 215 nodes, but the data for your query is there on
> only 12 nodes.
> Drill tries to distribute the scan fragments across the cluster more
> uniformly (trying to utilize all CPU resources).
> That is why you have lot of remote reads going on and increasing
>> affinity
> factor eliminates running scan
> fragments on the other (215-12) nodes.
> 
> you also mentioned planner.width.max_per_node is set to 8.
> So, with increased affinity factor,  you have 8 scan fragments doing
>> a lot

[GitHub] drill issue #633: DRILL-4972: Remove setDaemon(true) call in WorkManager.Sta...

2016-10-28 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/633
  
+1. LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #629: DRILL-4967: Adding template_name to source code generated ...

2016-10-28 Thread amansinha100
Github user amansinha100 commented on the issue:

https://github.com/apache/drill/pull/629
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Time for a 1.9 Release?

2016-10-28 Thread Aman Sinha
+1

On Fri, Oct 28, 2016 at 10:34 AM, Sudheesh Katkam 
wrote:

> Hi Drillers,
>
> We have a reasonable number of fixes and features since the last release
> [1]. Releasing itself takes a while; so I propose we start the 1.9 release
> process.
>
> I volunteer as the release manager, unless there are objections.
>
> We should also discuss what the release version number should be after 1.9.
>
> Thank you,
> Sudheesh
>
> [1] https://issues.apache.org/jira/browse/DRILL/fixforversion/12337861
>


[GitHub] drill pull request #635: DRILL-4927 (part 2): Add support for Null Equality ...

2016-10-28 Thread KulykRoman
GitHub user KulykRoman opened a pull request:

https://github.com/apache/drill/pull/635

DRILL-4927 (part 2): Add support for Null Equality Joins (mixed compa…

…rators)

This changes are a subset of the original pull request from DRILL-4539 
(PR-462).
- Added changes to support mixed comparators;
- Added tests for it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/KulykRoman/drill DRILL-4927

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/635.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #635


commit 47abbd9aa21443d76bb58b001d3f8a2e36511a7f
Author: Roman Kulyk 
Date:   2016-10-28T13:26:53Z

DRILL-4927 (part 2): Add support for Null Equality Joins (mixed comparators)

This changes are a subset of the original pull request from DRILL-4539 
(PR-462).
- Added changes to support mixed comparators;
- Added tests for it.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (DRILL-4968) Add column size information to ColumnMetadata

2016-10-28 Thread Sudheesh Katkam (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheesh Katkam resolved DRILL-4968.

   Resolution: Fixed
Fix Version/s: 1.9.0

Fixed in 
[c6dbe6a|https://github.com/apache/drill/commit/c6dbe6a2f7033114d6239a4850a9b5092e684589].

> Add column size information to ColumnMetadata
> -
>
> Key: DRILL-4968
> URL: https://issues.apache.org/jira/browse/DRILL-4968
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Metadata
>Reporter: Laurent Goujon
>Assignee: Laurent Goujon
> Fix For: 1.9.0
>
>
> Both ODBC and JDBC needs column size information for the column metadata. 
> Instead of duplicating the logic between C++ and Java (and having to keep in 
> them sync), column size should be computed on the server so that value is 
> kept consistent across clients.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Time for a 1.9 Release?

2016-10-28 Thread Jinfeng Ni
+1

I'm working on DRILL-1950 to support parquet row group level filter
pruning. I plan to submit a pull request for code review in 1-2 days,
hopefully.



On Fri, Oct 28, 2016 at 11:04 AM, Aman Sinha  wrote:
> +1
>
> On Fri, Oct 28, 2016 at 10:34 AM, Sudheesh Katkam 
> wrote:
>
>> Hi Drillers,
>>
>> We have a reasonable number of fixes and features since the last release
>> [1]. Releasing itself takes a while; so I propose we start the 1.9 release
>> process.
>>
>> I volunteer as the release manager, unless there are objections.
>>
>> We should also discuss what the release version number should be after 1.9.
>>
>> Thank you,
>> Sudheesh
>>
>> [1] https://issues.apache.org/jira/browse/DRILL/fixforversion/12337861
>>


Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

2016-10-28 Thread Andries Engelbrecht
You want to use MM for month and not mm for minute as imm can produce the wornd 
result.

Probably best to file an enhancement JIRA to have the function handle empty 
fields and produce a null value. Then the wider audience can review the merit 
for implementation.

--Andries


> On Oct 28, 2016, at 9:09 AM, Khurram Faraaz  wrote:
> 
> Thanks Andries and Veera.
> 
> 1. Yes, my CSV file does have empty strings in some rows in columns[4].
> 2. it worked for parquet because I had used the case expression to cast
> empty strings to NULL.
> 3. I tried with '-mm-dd' and '-MM-dd' and to_Date returned results
> with both representations.
> 
> Question - Shouldn't Drill handle such empty strings that are within rows
> in CSV files ?
> Why should user have to take care such cases ?
> 
> Regards,
> Khurram
> 
> On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram <
> vnaranammalpu...@maprtech.com> wrote:
> 
>> That should work and a lot faster too. Thanks for the pointer.
>> 
>> -Veera
>> 
>> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
>> aengelbre...@maprtech.com> wrote:
>> 
>>> Good catch on empty string Veera!
>>> 
>>> Wouldn't it be cheaper to check for an empty string?
>>> case when columns[] ='' then null else to_date(columns[],'-MM-dd')
>> end
>>> 
>>> I don't think the option to read csv empty columns (or empty string in
>> any
>>> text reader) as null is in the reader yet. So we can't check with
>> columns[]
>>> is null.
>>> 
>>> 
>>> --Andries
>>> 
>>> 
 On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
>>> vnaranammalpu...@maprtech.com> wrote:
 
 Do you have zero length strings in your data? I have seen cases where
>> the
 system option to cast empty strings to NULL doesn't work as advertised.
>>> You
 should re-open DRILL-3214.
 
 When I run into this problem, I usually use a regex to workaround. The
 PROJECT takes a performance hit when you do this for larger data sets
>> but
 it works.
 
 $cat nulls.psv
 date_col|string_col
 |test
 2016-10-28|test2
 $ sqlline
 apache drill 1.8.0
 "a little sql for your nosql"
 0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
 +-+-+
 |  date_col   | string_col  |
 +-+-+
 | | test|
 | 2016-10-28  | test2   |
 +-+-+
 2 rows selected (0.303 seconds)
 0: jdbc:drill:> select to_date(date_col,'-mm-dd') from `nulls.psv`;
 Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
 
 Fragment 0:0
 
 [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
>>> se-node10.se.lab:31010]
 (state=,code=0)
 0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
 to_date(date_col,'-MM-dd') else null end as date_col_converted from
 `nulls.psv`;
 +-+
 | date_col_converted  |
 +-+
 | null|
 | 2016-10-28  |
 +-+
 2 rows selected (0.521 seconds)
 0: jdbc:drill:> alter system set
 `drill.exec.functions.cast_empty_string_to_null` = true;
 +---+--+
 |  ok   | summary  |
 +---+--+
 | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
 +---+--+
 1 row selected (0.304 seconds)
 0: jdbc:drill:>  select to_date(date_col,'-mm-dd') from
>> `nulls.psv`;
 Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
 
 Fragment 0:0
 
 [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
>>> se-node10.se.lab:31010]
 (state=,code=0)
 0: jdbc:drill:>
 
 -Veera
 
 On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz 
 wrote:
 
> All,
> 
> Question is - why does it work for a parquet column and fails when CSV
> column is used ?
> 
> Drill 1.9.0 commit : a29f1e29
> 
> This is a simple project of column from a csv file, works.
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv`
>> t1
> limit 5;
> +-+
> |   EXPR$0|
> +-+
> | 2011-11-04  |
> | 1986-10-22  |
> | 1992-09-10  |
> | 2016-08-07  |
> | 1986-01-25  |
> +-+
> 5 rows selected (0.26 seconds)
> {noformat}
> 
> Using TO_DATE function with columns[x] as first input fails, with an
> IllegalArgumentException
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select to_date(columns[4],'-mm-dd')
>>> FROM
> `typeall_l.csv` t1 limit 5;
> Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> 
> Fragment 0:0
> 

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Hi Vitalli,

DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
this isDateCorrect flag have for each possiblity, especially for
META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
same things, or different?

Thanks.

Jinfeng



On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers  wrote:
> Thanks Vitalii.
>
> The Parquet Writer solution “just works”. As soon as someone upgrades the 
> writer, files are labeled as having that new version. No fuzziness during a 
> release as in 1.9.
>
> It is fine to also include the Drill version. But, format decisions should be 
> keyed off of the writer version.
>
> By the way, do other tools happen to already do this? It would be rather 
> surprising if they didn’t.
>
> - Paul
>
>> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka  
>> wrote:
>>
>> I agree that it would be good if the approach of parquet date correctness
>> detection will be upgraded. So I created the jira for it DRILL-4980
>> .
>>
>> But now we have two ideas:
>> 1. To add checking of the drill version additionally, so later we can
>> delete isDateCorrect label from parquet metadata.
>> 2. To add parquet writer version to the parquet metadata and check this
>> value instead of isDateCorrect and drillVersion.
>>
>> So which way, we should prefer now?
>>
>> Kind regards
>> Vitalii
>>
>> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
>>
>>> FWIW: back on the magic flag issue…
>>>
>>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
>>> grained for our needs.
>>>
>>> A typical solution is include the version of the Parquet writer in
>>> addition to that of Drill. Each time we change something in the writer,
>>> increment the version number. If we number changes, we can easily handle
>>> two changes in the same Drill release, or differentiate between the “early
>>> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>>>
>>> Since we have no version now, start it at some arbitrary point (2?).
>>>
>>> Now, if the Parquet file has a Drill Writer version in the header, and
>>> that version is 2 or greater, the date is in the “correct” format. Anything
>>> written by Drill before writer version 2, the date is wrong. The “check the
>>> data to see if it is sane” approach is needed only for files were we can’t
>>> tell if an older Drill wrote it.
>>>
>>> Do other tools label the data? Does Hive say that it wrote the file? If
>>> so, we don’t need to do the sanity check if we can tell the data comes from
>>> Hive (or Impala, or anything other than old Drill.)
>>>
>>> - Paul
>>>
 On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:

 Vitalii -- are you still planning to open a ticket and pull request for
>>> the
 fix you've noted below?

 -- Zelaine

 On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>>> vitalii.dira...@gmail.com>
 wrote:

> @Paul Rogers
> It may be the undefined case when the file is generated with
>>> drill.version
> = 1.9-SNAPSHOT.
> It is more easy to determine corrupted date with this flag and there is
>>> no
> need to wait the end of release to merge these changes.
>
> @Jinfeng NI
> It looks like you are right.
> With consistent mode (isDateCorrect = true) all tests are passed. So I
>>> am
> going to open a jira ticket for it with next changes
> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
> 3410cac5d3
> Thanks.
>
> Kind regards
> Vitalii
>
> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
>
>> I'm not sure if I fully understand your answers. The bottom line is
>> quite simple: given a set of parquet files, the ParquetTableMeta
>> instance constructed in Drill should have identical value for
>> "isDateCorrect", whether it comes from parquet footer, or parquet
>> metadata cache, or whether there is partition pruning or not. However,
>> the code shows that this flag is not in consistent mode across
>> different cases.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>>  wrote:
>>> Hi Jinfeng,
>>>
>>> 1.If the parquet files are generated with Drill after Drill-4203 these
>>> files have "isDateCorrect = true" property.
>>> Drill serializes this property from metadata now. When we set this
>> property
>>> in the first constructor we will hide the value from metadata.
>>> IsDateCorrect will be false only if this value equals to the false (no
>> case
>>> for it now) or absent in parquet metadata footer.
>>>
>>>
>>> 2. I'm not sure the reason to change isDateCorrect metadata property
> when
>>> the user disable dates correction.
>>> If you have some use case it would be great if you provide it.
>>>
>>> 3. Maybe you are right regarding to when Parquet metadata is c

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jason Altekruse
The isDataCorrect flag means that the values are known to be correct, and
there is no need to auto-detect corruption or correct anything.

META_SHOWS_CORRUPTION can be set either when we have a known old version of
Drill written in the metadata, or we have older files that might have been
written by Drill that we have checked the values in the statistics and
found corrupt looking values. Really old files without any statistics don't
have information that allows us to identify them as Drill-produced, so we
have to test the values during actual page reads, this is where
META_UNCLEAR_TEST_VALUES is used.

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Fri, Oct 28, 2016 at 12:53 PM, Jinfeng Ni  wrote:

> Hi Vitalli,
>
> DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
> META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
> this isDateCorrect flag have for each possiblity, especially for
> META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
> same things, or different?
>
> Thanks.
>
> Jinfeng
>
>
>
> On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers  wrote:
> > Thanks Vitalii.
> >
> > The Parquet Writer solution “just works”. As soon as someone upgrades
> the writer, files are labeled as having that new version. No fuzziness
> during a release as in 1.9.
> >
> > It is fine to also include the Drill version. But, format decisions
> should be keyed off of the writer version.
> >
> > By the way, do other tools happen to already do this? It would be rather
> surprising if they didn’t.
> >
> > - Paul
> >
> >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka 
> wrote:
> >>
> >> I agree that it would be good if the approach of parquet date
> correctness
> >> detection will be upgraded. So I created the jira for it DRILL-4980
> >> .
> >>
> >> But now we have two ideas:
> >> 1. To add checking of the drill version additionally, so later we can
> >> delete isDateCorrect label from parquet metadata.
> >> 2. To add parquet writer version to the parquet metadata and check this
> >> value instead of isDateCorrect and drillVersion.
> >>
> >> So which way, we should prefer now?
> >>
> >> Kind regards
> >> Vitalii
> >>
> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
> >>
> >>> FWIW: back on the magic flag issue…
> >>>
> >>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too
> course
> >>> grained for our needs.
> >>>
> >>> A typical solution is include the version of the Parquet writer in
> >>> addition to that of Drill. Each time we change something in the writer,
> >>> increment the version number. If we number changes, we can easily
> handle
> >>> two changes in the same Drill release, or differentiate between the
> “early
> >>> 1.9” files with old-style dates and “late 1.9” files with correct
> dates.
> >>>
> >>> Since we have no version now, start it at some arbitrary point (2?).
> >>>
> >>> Now, if the Parquet file has a Drill Writer version in the header, and
> >>> that version is 2 or greater, the date is in the “correct” format.
> Anything
> >>> written by Drill before writer version 2, the date is wrong. The
> “check the
> >>> data to see if it is sane” approach is needed only for files were we
> can’t
> >>> tell if an older Drill wrote it.
> >>>
> >>> Do other tools label the data? Does Hive say that it wrote the file? If
> >>> so, we don’t need to do the sanity check if we can tell the data comes
> from
> >>> Hive (or Impala, or anything other than old Drill.)
> >>>
> >>> - Paul
> >>>
>  On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> 
>  Vitalii -- are you still planning to open a ticket and pull request
> for
> >>> the
>  fix you've noted below?
> 
>  -- Zelaine
> 
>  On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> >>> vitalii.dira...@gmail.com>
>  wrote:
> 
> > @Paul Rogers
> > It may be the undefined case when the file is generated with
> >>> drill.version
> > = 1.9-SNAPSHOT.
> > It is more easy to determine corrupted date with this flag and there
> is
> >>> no
> > need to wait the end of release to merge these changes.
> >
> > @Jinfeng NI
> > It looks like you are right.
> > With consistent mode (isDateCorrect = true) all tests are passed. So
> I
> >>> am
> > going to open a jira ticket for it with next changes
> > https://github.com/vdiravka/drill/commit/
> ff8d5c7d601915f760d1b0e9618730
> > 3410cac5d3
> > Thanks.
> >
> > Kind regards
> > Vitalii
> >
> > 2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
> >
> >> I'm not sure if I fully understand your answers. The bottom line is
> >> quite simple: given a set of parquet files, the ParquetTableMeta
> >> instance constructed in Drill should have identical value for
> >> "isDateCorrect", whether it comes from parquet footer, or parquet
> >> metadata cache, or whether there is partition pruning or not.

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
I explored metadata of parquet files generated from different tools:

* Impala:*
creator: impala version 2.2.0-cdh5.4.5 (build
4a81c1d04c39961ef14ff6131d543dd96ef60e6e)

*Hive:*
creator:   parquet-mr version 1.6.0

*Pig:*
creator: parquet-mr version 1.5.1-SNAPSHOT
extra:   pig.schema = recipe: chararray,ingredients: {(name:
chararray)},inventor: (name: chararray,age: int)

*Spark:*
creator:  parquet-mr
extra:parquet.proto.descriptor = name:
"ArrayWithNestedGroupAndArray" field { name: "primitive" number: 1 label:
LABEL_OPTIONAL type: TYPE_INT32 } field { name: "myComplex" number: 2
label: LABEL_REPEATED type: TYPE_MESSAGE type_name:
".TestProtobuf.MyComplex" }
extra:parquet.proto.class =
parquet.proto.test.TestProtobuf$ArrayWithNestedGroupAndArray


*Drill (now):*creator: parquet-mr version 1.8.1-drill-r0 (build
6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
extra:   drill.version = 1.9.0-SNAPSHOT
extra:   is.date.correct = true

So we can replace second extra with "parquet-writer.version = 2.0.0".
Thoughts?


Kind regards
Vitalii

2016-10-28 16:26 GMT+00:00 Paul Rogers :

> Thanks Vitalii.
>
> The Parquet Writer solution “just works”. As soon as someone upgrades the
> writer, files are labeled as having that new version. No fuzziness during a
> release as in 1.9.
>
> It is fine to also include the Drill version. But, format decisions should
> be keyed off of the writer version.
>
> By the way, do other tools happen to already do this? It would be rather
> surprising if they didn’t.
>
> - Paul
>
> > On Oct 28, 2016, at 8:30 AM, Vitalii Diravka 
> wrote:
> >
> > I agree that it would be good if the approach of parquet date correctness
> > detection will be upgraded. So I created the jira for it DRILL-4980
> > .
> >
> > But now we have two ideas:
> > 1. To add checking of the drill version additionally, so later we can
> > delete isDateCorrect label from parquet metadata.
> > 2. To add parquet writer version to the parquet metadata and check this
> > value instead of isDateCorrect and drillVersion.
> >
> > So which way, we should prefer now?
> >
> > Kind regards
> > Vitalii
> >
> > 2016-10-27 23:54 GMT+00:00 Paul Rogers :
> >
> >> FWIW: back on the magic flag issue…
> >>
> >> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
> >> grained for our needs.
> >>
> >> A typical solution is include the version of the Parquet writer in
> >> addition to that of Drill. Each time we change something in the writer,
> >> increment the version number. If we number changes, we can easily handle
> >> two changes in the same Drill release, or differentiate between the
> “early
> >> 1.9” files with old-style dates and “late 1.9” files with correct dates.
> >>
> >> Since we have no version now, start it at some arbitrary point (2?).
> >>
> >> Now, if the Parquet file has a Drill Writer version in the header, and
> >> that version is 2 or greater, the date is in the “correct” format.
> Anything
> >> written by Drill before writer version 2, the date is wrong. The “check
> the
> >> data to see if it is sane” approach is needed only for files were we
> can’t
> >> tell if an older Drill wrote it.
> >>
> >> Do other tools label the data? Does Hive say that it wrote the file? If
> >> so, we don’t need to do the sanity check if we can tell the data comes
> from
> >> Hive (or Impala, or anything other than old Drill.)
> >>
> >> - Paul
> >>
> >>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> >>>
> >>> Vitalii -- are you still planning to open a ticket and pull request for
> >> the
> >>> fix you've noted below?
> >>>
> >>> -- Zelaine
> >>>
> >>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> >> vitalii.dira...@gmail.com>
> >>> wrote:
> >>>
>  @Paul Rogers
>  It may be the undefined case when the file is generated with
> >> drill.version
>  = 1.9-SNAPSHOT.
>  It is more easy to determine corrupted date with this flag and there
> is
> >> no
>  need to wait the end of release to merge these changes.
> 
>  @Jinfeng NI
>  It looks like you are right.
>  With consistent mode (isDateCorrect = true) all tests are passed. So I
> >> am
>  going to open a jira ticket for it with next changes
>  https://github.com/vdiravka/drill/commit/
> ff8d5c7d601915f760d1b0e9618730
>  3410cac5d3
>  Thanks.
> 
>  Kind regards
>  Vitalii
> 
>  2016-10-25 18:36 GMT+00:00 Jinfeng Ni :
> 
> > I'm not sure if I fully understand your answers. The bottom line is
> > quite simple: given a set of parquet files, the ParquetTableMeta
> > instance constructed in Drill should have identical value for
> > "isDateCorrect", whether it comes from parquet footer, or parquet
> > metadata cache, or whether there is partition pruning or not.
> However,
> > the code shows that this flag is not in consistent mode across
> > different cases.

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Thanks for the explanation, Jason.

The three different values for DateCorruptionStatus make sense to me.

The isDataCorrect flag = true, means that the values are known to be correct.
The isDataCorrect flag = false, means that the values are know to be
incorrect, or unclear?


On Fri, Oct 28, 2016 at 12:59 PM, Jason Altekruse  wrote:
> The isDataCorrect flag means that the values are known to be correct, and
> there is no need to auto-detect corruption or correct anything.
>
> META_SHOWS_CORRUPTION can be set either when we have a known old version of
> Drill written in the metadata, or we have older files that might have been
> written by Drill that we have checked the values in the statistics and
> found corrupt looking values. Really old files without any statistics don't
> have information that allows us to identify them as Drill-produced, so we
> have to test the values during actual page reads, this is where
> META_UNCLEAR_TEST_VALUES is used.
>
> Jason Altekruse
> Software Engineer at Dremio
> Apache Drill Committer
>
> On Fri, Oct 28, 2016 at 12:53 PM, Jinfeng Ni  wrote:
>
>> Hi Vitalli,
>>
>> DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
>> META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
>> this isDateCorrect flag have for each possiblity, especially for
>> META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
>> same things, or different?
>>
>> Thanks.
>>
>> Jinfeng
>>
>>
>>
>> On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers  wrote:
>> > Thanks Vitalii.
>> >
>> > The Parquet Writer solution “just works”. As soon as someone upgrades
>> the writer, files are labeled as having that new version. No fuzziness
>> during a release as in 1.9.
>> >
>> > It is fine to also include the Drill version. But, format decisions
>> should be keyed off of the writer version.
>> >
>> > By the way, do other tools happen to already do this? It would be rather
>> surprising if they didn’t.
>> >
>> > - Paul
>> >
>> >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka 
>> wrote:
>> >>
>> >> I agree that it would be good if the approach of parquet date
>> correctness
>> >> detection will be upgraded. So I created the jira for it DRILL-4980
>> >> .
>> >>
>> >> But now we have two ideas:
>> >> 1. To add checking of the drill version additionally, so later we can
>> >> delete isDateCorrect label from parquet metadata.
>> >> 2. To add parquet writer version to the parquet metadata and check this
>> >> value instead of isDateCorrect and drillVersion.
>> >>
>> >> So which way, we should prefer now?
>> >>
>> >> Kind regards
>> >> Vitalii
>> >>
>> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
>> >>
>> >>> FWIW: back on the magic flag issue…
>> >>>
>> >>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too
>> course
>> >>> grained for our needs.
>> >>>
>> >>> A typical solution is include the version of the Parquet writer in
>> >>> addition to that of Drill. Each time we change something in the writer,
>> >>> increment the version number. If we number changes, we can easily
>> handle
>> >>> two changes in the same Drill release, or differentiate between the
>> “early
>> >>> 1.9” files with old-style dates and “late 1.9” files with correct
>> dates.
>> >>>
>> >>> Since we have no version now, start it at some arbitrary point (2?).
>> >>>
>> >>> Now, if the Parquet file has a Drill Writer version in the header, and
>> >>> that version is 2 or greater, the date is in the “correct” format.
>> Anything
>> >>> written by Drill before writer version 2, the date is wrong. The
>> “check the
>> >>> data to see if it is sane” approach is needed only for files were we
>> can’t
>> >>> tell if an older Drill wrote it.
>> >>>
>> >>> Do other tools label the data? Does Hive say that it wrote the file? If
>> >>> so, we don’t need to do the sanity check if we can tell the data comes
>> from
>> >>> Hive (or Impala, or anything other than old Drill.)
>> >>>
>> >>> - Paul
>> >>>
>>  On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
>> 
>>  Vitalii -- are you still planning to open a ticket and pull request
>> for
>> >>> the
>>  fix you've noted below?
>> 
>>  -- Zelaine
>> 
>>  On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>> >>> vitalii.dira...@gmail.com>
>>  wrote:
>> 
>> > @Paul Rogers
>> > It may be the undefined case when the file is generated with
>> >>> drill.version
>> > = 1.9-SNAPSHOT.
>> > It is more easy to determine corrupted date with this flag and there
>> is
>> >>> no
>> > need to wait the end of release to merge these changes.
>> >
>> > @Jinfeng NI
>> > It looks like you are right.
>> > With consistent mode (isDateCorrect = true) all tests are passed. So
>> I
>> >>> am
>> > going to open a jira ticket for it with next changes
>> > https://github.com/vdiravka/drill/commit/
>> ff8d5c7d601915f760d1b0e9618730
>> > 3410cac5d3
>> > Th

Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Paul Rogers
I like the proposal. The Parquet Writer version should just be 2 (no .0.0 as we 
won’t have major or minor versions.) With things like writer versions (or RPC 
versions, etc.) the usual rule is to use increasing integers.

I am surprised that the other tools don’t include more detail about the 
version; if they ever change their writers, they’ll have the same vagueness 
problem that we’re trying to address for Drill…

Thanks,

- Paul

> On Oct 28, 2016, at 1:03 PM, Vitalii Diravka  
> wrote:
> 
> I explored metadata of parquet files generated from different tools:
> 
> * Impala:*
> creator: impala version 2.2.0-cdh5.4.5 (build
> 4a81c1d04c39961ef14ff6131d543dd96ef60e6e)
> 
> *Hive:*
> creator:   parquet-mr version 1.6.0
> 
> *Pig:*
> creator: parquet-mr version 1.5.1-SNAPSHOT
> extra:   pig.schema = recipe: chararray,ingredients: {(name:
> chararray)},inventor: (name: chararray,age: int)
> 
> *Spark:*
> creator:  parquet-mr
> extra:parquet.proto.descriptor = name:
> "ArrayWithNestedGroupAndArray" field { name: "primitive" number: 1 label:
> LABEL_OPTIONAL type: TYPE_INT32 } field { name: "myComplex" number: 2
> label: LABEL_REPEATED type: TYPE_MESSAGE type_name:
> ".TestProtobuf.MyComplex" }
> extra:parquet.proto.class =
> parquet.proto.test.TestProtobuf$ArrayWithNestedGroupAndArray
> 
> 
> *Drill (now):*creator: parquet-mr version 1.8.1-drill-r0 (build
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
> extra:   drill.version = 1.9.0-SNAPSHOT
> extra:   is.date.correct = true
> 
> So we can replace second extra with "parquet-writer.version = 2.0.0".
> Thoughts?
> 
> 
> Kind regards
> Vitalii
> 
> 2016-10-28 16:26 GMT+00:00 Paul Rogers :
> 
>> Thanks Vitalii.
>> 
>> The Parquet Writer solution “just works”. As soon as someone upgrades the
>> writer, files are labeled as having that new version. No fuzziness during a
>> release as in 1.9.
>> 
>> It is fine to also include the Drill version. But, format decisions should
>> be keyed off of the writer version.
>> 
>> By the way, do other tools happen to already do this? It would be rather
>> surprising if they didn’t.
>> 
>> - Paul
>> 
>>> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka 
>> wrote:
>>> 
>>> I agree that it would be good if the approach of parquet date correctness
>>> detection will be upgraded. So I created the jira for it DRILL-4980
>>> .
>>> 
>>> But now we have two ideas:
>>> 1. To add checking of the drill version additionally, so later we can
>>> delete isDateCorrect label from parquet metadata.
>>> 2. To add parquet writer version to the parquet metadata and check this
>>> value instead of isDateCorrect and drillVersion.
>>> 
>>> So which way, we should prefer now?
>>> 
>>> Kind regards
>>> Vitalii
>>> 
>>> 2016-10-27 23:54 GMT+00:00 Paul Rogers :
>>> 
 FWIW: back on the magic flag issue…
 
 I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
 grained for our needs.
 
 A typical solution is include the version of the Parquet writer in
 addition to that of Drill. Each time we change something in the writer,
 increment the version number. If we number changes, we can easily handle
 two changes in the same Drill release, or differentiate between the
>> “early
 1.9” files with old-style dates and “late 1.9” files with correct dates.
 
 Since we have no version now, start it at some arbitrary point (2?).
 
 Now, if the Parquet file has a Drill Writer version in the header, and
 that version is 2 or greater, the date is in the “correct” format.
>> Anything
 written by Drill before writer version 2, the date is wrong. The “check
>> the
 data to see if it is sane” approach is needed only for files were we
>> can’t
 tell if an older Drill wrote it.
 
 Do other tools label the data? Does Hive say that it wrote the file? If
 so, we don’t need to do the sanity check if we can tell the data comes
>> from
 Hive (or Impala, or anything other than old Drill.)
 
 - Paul
 
> On Oct 27, 2016, at 4:03 PM, Zelaine Fong  wrote:
> 
> Vitalii -- are you still planning to open a ticket and pull request for
 the
> fix you've noted below?
> 
> -- Zelaine
> 
> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
 vitalii.dira...@gmail.com>
> wrote:
> 
>> @Paul Rogers
>> It may be the undefined case when the file is generated with
 drill.version
>> = 1.9-SNAPSHOT.
>> It is more easy to determine corrupted date with this flag and there
>> is
 no
>> need to wait the end of release to merge these changes.
>> 
>> @Jinfeng NI
>> It looks like you are right.
>> With consistent mode (isDateCorrect = true) all tests are passed. So I
 am
>> going to open a jira ticket for it with next changes
>> https://github.com/vdiravka/drill/commit/
>> ff8d5c7d601915

Re: to_date(csv-columns[x],'yyyy-mm-dd') - IllegalArgumentException

2016-10-28 Thread Veera Naranammalpuram
I would expect it to work. You should just reopen DRILL-3214. You have
already created one for this.

-Veera

On Fri, Oct 28, 2016 at 3:08 PM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> You want to use MM for month and not mm for minute as imm can produce the
> wornd result.
>
> Probably best to file an enhancement JIRA to have the function handle
> empty fields and produce a null value. Then the wider audience can review
> the merit for implementation.
>
> --Andries
>
>
> > On Oct 28, 2016, at 9:09 AM, Khurram Faraaz 
> wrote:
> >
> > Thanks Andries and Veera.
> >
> > 1. Yes, my CSV file does have empty strings in some rows in columns[4].
> > 2. it worked for parquet because I had used the case expression to cast
> > empty strings to NULL.
> > 3. I tried with '-mm-dd' and '-MM-dd' and to_Date returned
> results
> > with both representations.
> >
> > Question - Shouldn't Drill handle such empty strings that are within rows
> > in CSV files ?
> > Why should user have to take care such cases ?
> >
> > Regards,
> > Khurram
> >
> > On Fri, Oct 28, 2016 at 9:17 PM, Veera Naranammalpuram <
> > vnaranammalpu...@maprtech.com> wrote:
> >
> >> That should work and a lot faster too. Thanks for the pointer.
> >>
> >> -Veera
> >>
> >> On Fri, Oct 28, 2016 at 11:43 AM, Andries Engelbrecht <
> >> aengelbre...@maprtech.com> wrote:
> >>
> >>> Good catch on empty string Veera!
> >>>
> >>> Wouldn't it be cheaper to check for an empty string?
> >>> case when columns[] ='' then null else to_date(columns[],'-MM-dd')
> >> end
> >>>
> >>> I don't think the option to read csv empty columns (or empty string in
> >> any
> >>> text reader) as null is in the reader yet. So we can't check with
> >> columns[]
> >>> is null.
> >>>
> >>>
> >>> --Andries
> >>>
> >>>
>  On Oct 28, 2016, at 8:21 AM, Veera Naranammalpuram <
> >>> vnaranammalpu...@maprtech.com> wrote:
> 
>  Do you have zero length strings in your data? I have seen cases where
> >> the
>  system option to cast empty strings to NULL doesn't work as
> advertised.
> >>> You
>  should re-open DRILL-3214.
> 
>  When I run into this problem, I usually use a regex to workaround. The
>  PROJECT takes a performance hit when you do this for larger data sets
> >> but
>  it works.
> 
>  $cat nulls.psv
>  date_col|string_col
>  |test
>  2016-10-28|test2
>  $ sqlline
>  apache drill 1.8.0
>  "a little sql for your nosql"
>  0: jdbc:drill:> select date_col, string_col from `nulls.psv`;
>  +-+-+
>  |  date_col   | string_col  |
>  +-+-+
>  | | test|
>  | 2016-10-28  | test2   |
>  +-+-+
>  2 rows selected (0.303 seconds)
>  0: jdbc:drill:> select to_date(date_col,'-mm-dd') from
> `nulls.psv`;
>  Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> 
>  Fragment 0:0
> 
>  [Error Id: c058acbe-f2bf-4c3b-a447-66bebdc4c642 on
> >>> se-node10.se.lab:31010]
>  (state=,code=0)
>  0: jdbc:drill:>  select case when date_col similar to '[0-9]+%' then
>  to_date(date_col,'-MM-dd') else null end as date_col_converted
> from
>  `nulls.psv`;
>  +-+
>  | date_col_converted  |
>  +-+
>  | null|
>  | 2016-10-28  |
>  +-+
>  2 rows selected (0.521 seconds)
>  0: jdbc:drill:> alter system set
>  `drill.exec.functions.cast_empty_string_to_null` = true;
>  +---+--+
>  |  ok   | summary  |
>  +---+--+
>  | true  | drill.exec.functions.cast_empty_string_to_null updated.  |
>  +---+--+
>  1 row selected (0.304 seconds)
>  0: jdbc:drill:>  select to_date(date_col,'-mm-dd') from
> >> `nulls.psv`;
>  Error: SYSTEM ERROR: IllegalArgumentException: Invalid format: ""
> 
>  Fragment 0:0
> 
>  [Error Id: 92126a1b-1c03-4e90-bc3a-01c5c81bb013 on
> >>> se-node10.se.lab:31010]
>  (state=,code=0)
>  0: jdbc:drill:>
> 
>  -Veera
> 
>  On Fri, Oct 28, 2016 at 9:24 AM, Khurram Faraaz  >
>  wrote:
> 
> > All,
> >
> > Question is - why does it work for a parquet column and fails when
> CSV
> > column is used ?
> >
> > Drill 1.9.0 commit : a29f1e29
> >
> > This is a simple project of column from a csv file, works.
> > {noformat}
> > 0: jdbc:drill:schema=dfs.tmp> select columns[4] FROM `typeall_l.csv`
> >> t1
> > limit 5;
> > +-+
> > |   EXPR$0|
> > +-+
> > | 2011-11-04  |
> > | 1986-10-22  |
> > | 1992-09-10  |
> > | 

[GitHub] drill issue #602: Improve Drill C++ connector

2016-10-28 Thread laurentgo
Github user laurentgo commented on the issue:

https://github.com/apache/drill/pull/602
  
sounds good, I think all of the comments I received have been addressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Vitalii Diravka
Jinfeng,

isDateCorrect will be false in the code when isDateCorrect property is
absent in the parquet metadata.

Anyway I am going to implement the mentioned approach with the
parquet-writer.version instead of isDateCorrect property.


Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jinfeng Ni
Vitalli,

Just to confirm, you will "remove" isDateCorrect flag, and use
parquet-writer version in stead, correct?




On Fri, Oct 28, 2016 at 2:52 PM, Vitalii Diravka
 wrote:
> Jinfeng,
>
> isDateCorrect will be false in the code when isDateCorrect property is
> absent in the parquet metadata.
>
> Anyway I am going to implement the mentioned approach with the
> parquet-writer.version instead of isDateCorrect property.


Re: isDateCorrect field in ParquetTableMetadata

2016-10-28 Thread Jason Altekruse
The only worry I have about declaring a writer version is possible
confusion with the Parquet format version itself. The format is already
defined through version 2.1 or something like that, but we are currently
only writing files based on the 1.x version of the format.

My preferred solution to this problem would be to just make point releases
for problems like this (like in this case we could have made a 1.8.1
release, and then all of the 1.8.0-SNAPSHOT would all known to be bad and
everything after would be 1.8.1-SNAPSHOT and could have been known to be
correct).

I'm open to to hearing other opinions on this, I just generally feel like
these bugs should be rare, and fixing them should be done with a lot of
care (and in this case I missed a few things). I don't think it would be
crazy to say that we should only merge these kinds of patches if we are
willing to say the fix is ready for a release.

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Fri, Oct 28, 2016 at 2:52 PM, Vitalii Diravka 
wrote:

> Jinfeng,
>
> isDateCorrect will be false in the code when isDateCorrect property is
> absent in the parquet metadata.
>
> Anyway I am going to implement the mentioned approach with the
> parquet-writer.version instead of isDateCorrect property.
>


[jira] [Created] (DRILL-4981) TPC-DS Query 75 fails on MapR-DB JSON Tables

2016-10-28 Thread Abhishek Girish (JIRA)
Abhishek Girish created DRILL-4981:
--

 Summary: TPC-DS Query 75 fails on MapR-DB JSON Tables
 Key: DRILL-4981
 URL: https://issues.apache.org/jira/browse/DRILL-4981
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - MapRDB
Affects Versions: 1.9.0
Reporter: Abhishek Girish
Assignee: Smidth Panchamia


TPC-DS Query 75 fails on MapR-DB JSON Tables, but succeeds on Text, Parquet & 
JSON File formats. 

I'll work on a simpler repro. Find the original query & error below:

{code}
WITH all_sales 
 AS (SELECT d_year, 
i_brand_id, 
i_class_id, 
i_category_id, 
i_manufact_id, 
Sum(sales_cnt) AS sales_cnt, 
Sum(sales_amt) AS sales_amt 
 FROM   (SELECT d_year, 
i_brand_id, 
i_class_id, 
i_category_id, 
i_manufact_id, 
cs_quantity - COALESCE(cr_return_quantity, 0)AS 
sales_cnt, 
cs_ext_sales_price - COALESCE(cr_return_amount, 0.0) AS 
sales_amt 
 FROM   catalog_sales 
JOIN item 
  ON i_item_sk = cs_item_sk 
JOIN date_dim 
  ON d_date_sk = cs_sold_date_sk 
LEFT JOIN catalog_returns 
   ON ( cs_order_number = cr_order_number 
AND cs_item_sk = cr_item_sk ) 
 WHERE  i_category = 'Men' 
 UNION 
 SELECT d_year, 
i_brand_id, 
i_class_id, 
i_category_id, 
i_manufact_id, 
ss_quantity - COALESCE(sr_return_quantity, 0) AS 
sales_cnt, 
ss_ext_sales_price - COALESCE(sr_return_amt, 0.0) AS 
sales_amt 
 FROM   store_sales 
JOIN item 
  ON i_item_sk = ss_item_sk 
JOIN date_dim 
  ON d_date_sk = ss_sold_date_sk 
LEFT JOIN store_returns 
   ON ( ss_ticket_number = sr_ticket_number 
AND ss_item_sk = sr_item_sk ) 
 WHERE  i_category = 'Men' 
 UNION 
 SELECT d_year, 
i_brand_id, 
i_class_id, 
i_category_id, 
i_manufact_id, 
ws_quantity - COALESCE(wr_return_quantity, 0) AS 
sales_cnt, 
ws_ext_sales_price - COALESCE(wr_return_amt, 0.0) AS 
sales_amt 
 FROM   web_sales 
JOIN item 
  ON i_item_sk = ws_item_sk 
JOIN date_dim 
  ON d_date_sk = ws_sold_date_sk 
LEFT JOIN web_returns 
   ON ( ws_order_number = wr_order_number 
AND ws_item_sk = wr_item_sk ) 
 WHERE  i_category = 'Men') sales_detail 
 GROUP  BY d_year, 
   i_brand_id, 
   i_class_id, 
   i_category_id, 
   i_manufact_id) 
SELECT prev_yr.d_yearAS prev_year, 
   curr_yr.d_yearAS year1, 
   curr_yr.i_brand_id, 
   curr_yr.i_class_id, 
   curr_yr.i_category_id, 
   curr_yr.i_manufact_id, 
   prev_yr.sales_cnt AS prev_yr_cnt, 
   curr_yr.sales_cnt AS curr_yr_cnt, 
   curr_yr.sales_cnt - prev_yr.sales_cnt AS sales_cnt_diff, 
   curr_yr.sales_amt - prev_yr.sales_amt AS sales_amt_diff 
FROM   all_sales curr_yr, 
   all_sales prev_yr 
WHERE  curr_yr.i_brand_id = prev_yr.i_brand_id 
   AND curr_yr.i_class_id = prev_yr.i_class_id 
   AND curr_yr.i_category_id = prev_yr.i_category_id 
   AND curr_yr.i_manufact_id = prev_yr.i_manufact_id 
   AND curr_yr.d_year = 2002 
   AND prev_yr.d_year = 2002 - 1 
   AND Cast(curr_yr.sales_cnt AS DECIMAL(17, 2)) / Cast(prev_yr.sales_cnt 
AS 
DECIMAL(17, 2)) 
   < 0.9 
ORDER  BY sales_cnt_diff
LIMIT 100
{code}

Error:
{code}
Failed with exception
java.sql.SQLException: SYSTEM ERROR: NullPointerException


[Error Id: 128bb62b-a6af-4b8f-90d5-d9f516b9e3d4 on atsqa6c83

[GitHub] drill pull request #635: DRILL-4927 (part 2): Add support for Null Equality ...

2016-10-28 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/635#discussion_r85625856
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/TestJoinNullable.java ---
@@ -493,4 +496,81 @@ public void withNullEqualAdditionFilter() throws 
Exception {
 .go();
   }
 
+  @Test
+  public void withMixedEqualAndIsNotDistinctHashJoin() throws Exception {
--- End diff --

Could you add a test with the predicate of type: 
t1.key = t2.key AND ((t1.data=t2.data) OR (t1.data IS NULL AND t2.data 
IS NULL))
where the right side of the AND gets converted to IS NOT DISTINCT FROM.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #602: Improve Drill C++ connector

2016-10-28 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/602
  
+1. Everything looks good. I'm assuming that you will squash the commits 
which don't have an associated Jira.
Also, if you can add any notes needed to the Windows build instructions, 
that would be useful.
Thanks for patiently fixing all the issues!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Time for a 1.9 Release?

2016-10-28 Thread Sudheesh Katkam
Let's aim for EOD next Friday (11/04/16) to get all changes in; I will try
to get RC0 out on Monday (11/07/16).

Current list of commits:

[Sudheesh]
+ DRILL-4280: pull request being reviewed
https://github.com/apache/drill/pull/578

[Jinfeng]
+ DRILL-1950: pull request pending

Any other pull requests that developers would like to get into the release?
Please post the status too.

Thank you,
Sudheesh

On Fri, Oct 28, 2016 at 11:35 AM, Jinfeng Ni  wrote:

> +1
>
> I'm working on DRILL-1950 to support parquet row group level filter
> pruning. I plan to submit a pull request for code review in 1-2 days,
> hopefully.
>
>
>
> On Fri, Oct 28, 2016 at 11:04 AM, Aman Sinha  wrote:
> > +1
> >
> > On Fri, Oct 28, 2016 at 10:34 AM, Sudheesh Katkam 
> > wrote:
> >
> >> Hi Drillers,
> >>
> >> We have a reasonable number of fixes and features since the last release
> >> [1]. Releasing itself takes a while; so I propose we start the 1.9
> release
> >> process.
> >>
> >> I volunteer as the release manager, unless there are objections.
> >>
> >> We should also discuss what the release version number should be after
> 1.9.
> >>
> >> Thank you,
> >> Sudheesh
> >>
> >> [1] https://issues.apache.org/jira/browse/DRILL/fixforversion/12337861
> >>
>


[GitHub] drill issue #602: Improve Drill C++ connector

2016-10-28 Thread laurentgo
Github user laurentgo commented on the issue:

https://github.com/apache/drill/pull/602
  
yes, let me clean my branch one last time by squashing the small commits 
with no JIRA. For the windows build, I already added instructions regarding the 
need for PowerShell on the system, and on how to get/install cppunit, can you 
have a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #602: Improve Drill C++ connector

2016-10-28 Thread laurentgo
Github user laurentgo commented on the issue:

https://github.com/apache/drill/pull/602
  
Branch updated with only commits associated with a JIRA (everything else 
has been merged into commit for DRILL-4420)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Time for a 1.9 Release?

2016-10-28 Thread Parth Chandra
+1 on doing a release.

I'm hoping the following get a +1 :

DRILL-4800 - Improve parquet reader performance
DRILL-3423 - Add New HTTPD format plugin
DRILL-4858 - REPEATED_COUNT on JSON containing an array of maps

Specifically what did you want to discuss about the release number after
1.9?  Ordinarily you would just go to 2.0. The only reason for holding off
on 2.0, that I can think of, is if you want to make breaking changes in the
2.0 release and those are not going to be ready for the next release cycle.
Are any dev's planning on such breaking changes? If so we should discuss
that (or any other reason we might have for deferring 2.0) in a separate
thread?
I'm +0 on any version number we chose.



On Fri, Oct 28, 2016 at 4:53 PM, Sudheesh Katkam 
wrote:

> Let's aim for EOD next Friday (11/04/16) to get all changes in; I will try
> to get RC0 out on Monday (11/07/16).
>
> Current list of commits:
>
> [Sudheesh]
> + DRILL-4280: pull request being reviewed
> https://github.com/apache/drill/pull/578
>
> [Jinfeng]
> + DRILL-1950: pull request pending
>
> Any other pull requests that developers would like to get into the release?
> Please post the status too.
>
> Thank you,
> Sudheesh
>
> On Fri, Oct 28, 2016 at 11:35 AM, Jinfeng Ni  wrote:
>
> > +1
> >
> > I'm working on DRILL-1950 to support parquet row group level filter
> > pruning. I plan to submit a pull request for code review in 1-2 days,
> > hopefully.
> >
> >
> >
> > On Fri, Oct 28, 2016 at 11:04 AM, Aman Sinha 
> wrote:
> > > +1
> > >
> > > On Fri, Oct 28, 2016 at 10:34 AM, Sudheesh Katkam  >
> > > wrote:
> > >
> > >> Hi Drillers,
> > >>
> > >> We have a reasonable number of fixes and features since the last
> release
> > >> [1]. Releasing itself takes a while; so I propose we start the 1.9
> > release
> > >> process.
> > >>
> > >> I volunteer as the release manager, unless there are objections.
> > >>
> > >> We should also discuss what the release version number should be after
> > 1.9.
> > >>
> > >> Thank you,
> > >> Sudheesh
> > >>
> > >> [1] https://issues.apache.org/jira/browse/DRILL/
> fixforversion/12337861
> > >>
> >
>