[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2016-10-21 Thread Serhii-Harnyk
Github user Serhii-Harnyk commented on the issue:

https://github.com/apache/drill/pull/594
  
Added fixes according to review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2016-11-01 Thread Serhii-Harnyk
Github user Serhii-Harnyk commented on the issue:

https://github.com/apache/drill/pull/594
  
@chunhui-shi what about second commit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2016-11-09 Thread Serhii-Harnyk
Github user Serhii-Harnyk commented on the issue:

https://github.com/apache/drill/pull/594
  
@chunhui-shi 
Results of performance test

| % of nullable records | with fix | master | %(t-t0)/t0 |
| 0.001 | 13986,8 | 13752,1 | 1,71% |
| 0.1 | 13873,6 | 13081,3 | 6,06% |
| 0.5 | 11345,6 | 7552,44 | 50,22% |
| 0.7 | 12699,1 | 5753,44 | 120,72% |
| 0.99 | 14544,2 | 494 | 2844,17% |

So increase of performance depends on % of nullable fields in dataset


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2016-11-17 Thread jinma1978
Github user jinma1978 commented on the issue:

https://github.com/apache/drill/pull/594
  
@chunhui-shi I tried to play with different java structures for column 
path, but no real advance in performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2016-12-09 Thread Serhii-Harnyk
Github user Serhii-Harnyk commented on the issue:

https://github.com/apache/drill/pull/594
  
@chunhui-shi, could you please review new changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2016-12-16 Thread Serhii-Harnyk
Github user Serhii-Harnyk commented on the issue:

https://github.com/apache/drill/pull/594
  
@chunhui-shi, I have made some changes, could you take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2017-02-06 Thread chunhui-shi
Github user chunhui-shi commented on the issue:

https://github.com/apache/drill/pull/594
  
+1. LGTM. Need to address conflict before ready to commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2017-02-07 Thread Serhii-Harnyk
Github user Serhii-Harnyk commented on the issue:

https://github.com/apache/drill/pull/594
  
Squashed all changes into single commit, rebased on master and resolved 
conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2017-02-14 Thread Serhii-Harnyk
Github user Serhii-Harnyk commented on the issue:

https://github.com/apache/drill/pull/594
  
@amansinha100, could you please review new changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2017-02-17 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/594
  
The bug here is fundamental to the way Drill works with JSON. We already 
had an extensive discussion around this area in another PR. The problem is that 
JSON supports a null type which is independent of all other types. In JSON, a 
null is not a "null int" or a "null string" -- it is just null.

Drill must infer a type for a field. This leads to all kinds of grief when 
a file contains a run of nulls before the real value:

{code}
{ id: 1, b: null }
...
{ id: 8, b: "gee, I'm a string!" }
{code}

Drill must do something with the leading values. "b" is a null... what? 
Int? String?

We've had many bugs in this area. The bugs are not just code bugs, they 
represent a basic incompatibility between Drill and JSON.

This fix is yet another attempt to work around the limitation, but cannot 
overcome the basic incompatibility.

What we are doing, it seems, is building a list of fields that have seen 
only null values, deferring action on those fields until later. That works fine 
if "later" occurs in the same record batch. It is not clear what happens if we 
get to the end of the batch (as in the example above), but have never seen the 
type of the field: what type of vector do we create?

There are several solutions.

One is to have a "null" type in Drill. When we see the initial run of 
nulls, we simply create a field of the "null" type. We have type conversion 
rules that say that a "null" vector can be coerced into any other type when we 
ultimately see the type. (And, if we don't see a type in one batch, we can pass 
the null vector along upstream for later reconciliation.) This is a big change; 
too big for a bug fix.

Another solution, used here, is to keep track of "null only" fields, to 
defer the decision for later. That has a performance impact.

A third solution is to go ahead and create a vector of any type, keep 
setting its values to null (as if we had already seen the field type), but be 
ready to discard that vector and convert it to the proper type once we see that 
type. In this way, we treat null fields just as any other up to the point where 
we realize we have a type conflict. Only then do we check the "null only" map 
and decide we can quietly convert the vector type to the proper type.

These are the initial thoughts. I'll add more nuanced comments as I review 
the code in more detail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2017-02-18 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/594
  
We have three cases for nulls:

1. See a typed field, followed by nulls. Here, we know the type, just mark 
the current field as null.
2. Run of nulls at start of file, followed by a non-null value within the 
same record batch. We can transform the null field into a typed field. That is 
what this fix tries to do.
3. Run of nulls in one record batch, followed by a non-null value in the 
next batch. This must trigger a schema change. The question is, does it trigger 
based on adding a new field, or assuming a type in the first batch, then 
changing the type in the second?

The code presumably handles the first case. Case 3 is beyond the scope of 
this fix. So, the question is, how best to handle the second case? Is the 
present fix sufficient?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2017-02-18 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/594
  
Note that Drill *does* have a vector that can possibly used to represent a 
run of nulls: the {{ZeroVector}}. Using this, we can:

* On the first record where we see a field, if that field is null, add a 
{{ZeroVector}} to the record batch.
* On subsequent records, if the value is still null, do nothing.
* If the value is non-null, and the current vector is a {{ZeroVector}}, 
replace it with the proper Nullable vector, with all (0..i-1) values set to 
null, and the ith value set to the current column.
* At end of batch, if any {{ZeroVector}}s remain, simply remove them so 
that the column does not appear in the batch output.

The result of this is that we need not do two map lookups for a null value, 
we just do one: the one to find the column value vector as we'd do for an int 
or string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2017-02-20 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/594
  
Three general rules to keep in mind in the current JSON reader 
implementation:

* Drill can remember the past. (Once a type as been seen for a column, 
Drill will remember that type.)
* Drill cannot predict the future. (If a type has not been seen for a 
column by the end of a record batch, Drill cannot predict what type will appear 
in some later batch.)
* Drill can amend the past within a single record batch. (If a batch starts 
with nulls, but later a type is seen, the previous values are automatically 
filled with nulls.)

Actual implementation of the JSON reader, and the value writers that form 
the implementation, is complex. As we read JSON values, we ask a type-specific 
writer to set that value into the value vector. Each writer marks the column as 
non-null, then adds the value. Any values not so set will default to null.

Consider a file with five null "c1" values followed by a string value "foo" 
for that field. The five nulls are ignored. When we see the non-null c1, the 
code creates a VarChar vector and sets the 6th value to the string "foo". Doing 
so automatically marks the previous five column values as null.

Suppose we have a file with a single string value "foo" for column "c1", 
followed by five nulls. In this case, the first value creates and sets the 
VarChar vector as before. Later, at the end of reading the record batch, the 
reader sets the record count for the vectors. This action, on the VarChar 
vector, has the effect of setting the trailing five column values to null.

Since values default to null, we get this behavior, and the previous, for 
free. The result is that if a record batch contains even a single non-null 
value for a field, that column will be fully populated with nulls for all other 
records in the same batch.

This gets us back to the same old problem in Drill: if all we see are 
nulls, Drill needs to know, "null of what type" while in JSON the value is just 
null. The JIRA tickets linked to this ticket all related to that same 
underlying issue.

There is a long history of this issue: DRILL-5033, DRILL-1256, DRILL-4479, 
DRILL-3806 and more.

This fix affects only "all text mode." This means that, regardless of the 
JSON type, create a VarChar column. Doing so provides a very simple fix. Since 
all columns are VarChar, when we see a new column, with a null value, just 
create a VarChar column. (No need to set the column to null.)

That is, we can "predict the future" for nulls because *all* columns are 
VarChar -- so there is not much to predict.

Otherwise, we have to stick with Jacques' design decision in DRILL-1256: 
"Drill's perspective is a non-existent column and a column with no value are 
equivalent." A record batch of all nulls, followed by a record batch with a 
non-null value, will cause a schema change.

Again, Drill needs a "null" type that is compatible with all other types in 
order to support JSON semantics. (And, needs to differentiate between 
value-exists-and-is-null and value-does-not-exist.)

Yet another solution is to have the user tell us their intent. The [JSON 
Schema](http://jsonschema.net) project provides a way to express the expected 
schema so that Drill would know up front the type of each column (and whether 
the column is really nullable.)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #594: DRILL-4842: SELECT * on JSON data results in NumberFormatE...

2017-02-20 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/594
  
Given all the above, there is very simple fix to the particular case that 
this bug covers.

{code}
  private void writeDataAllText(MapWriter map, FieldSelection selection,
...
  case VALUE_NULL:
// Here we do have a type. This is a null VarChar.
handleNullString(map, fieldName);
break;
...

  /**
   * Create a VarChar column. No need to explicitly set a
   * null value; nulls are the default.
   * 
   * Note: This only works for all-text mode because we can
   * predict that, if we ever see an actual value, it will be
   * treated as a VarChar. This trick will not work for the
   * general case because we cannot predict the actual column
   * type.
   * @param writer
   * @param fieldName
   */

  private void handleNullString(MapWriter writer, String fieldName) {
writer.varChar(fieldName);
  }
{code}

The above simply leverages the existing mechanism for mapping columns to 
types, and for filling in missing null values.

Output, when printing {{tooManyNulls.json}} to CSV:

{code}
4096 row(s):
c1
null
...
null
1 row(s):
c1
Hello World
Total rows returned : 4097.  Returned in 242ms.
{code}

Performance here will be slower than master because we now do a field 
lookup for each null column where in the past we did not. The performance of 
null columns, however, should be identical to non-null columns. And, 
performance of the above fix should be identical to the fix proposed in this 
PR: but the code here is simpler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---