[jira] [Created] (HIVE-10161) LLAP: IO buffers seem to be hard-coded to 256kb

2015-03-31 Thread Gopal V (JIRA)
Gopal V created HIVE-10161:
--

 Summary: LLAP: IO buffers seem to be hard-coded to 256kb 
 Key: HIVE-10161
 URL: https://issues.apache.org/jira/browse/HIVE-10161
 Project: Hive
  Issue Type: Sub-task
Affects Versions: llap
Reporter: Gopal V
Assignee: Sergey Shelukhin
 Fix For: llap


The EncodedReaderImpl will die when reading from the cache, when reading data 
written by the regular ORC writer 

{code}
Caused by: java.io.IOException: java.lang.IllegalArgumentException: Buffer size 
too small. size = 262144 needed = 3919246
at 
org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.rethrowErrorIfAny(LlapInputFormat.java:249)
at 
org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.nextCvb(LlapInputFormat.java:201)
at 
org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:140)
at 
org.apache.hadoop.hive.llap.io.api.impl.LlapInputFormat$LlapRecordReader.next(LlapInputFormat.java:96)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
... 22 more
Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 
262144 needed = 3919246
at 
org.apache.hadoop.hive.ql.io.orc.InStream.addOneCompressionBuffer(InStream.java:780)
at 
org.apache.hadoop.hive.ql.io.orc.InStream.uncompressStream(InStream.java:628)
at 
org.apache.hadoop.hive.ql.io.orc.EncodedReaderImpl.readEncodedColumns(EncodedReaderImpl.java:309)
at 
org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.callInternal(OrcEncodedDataReader.java:278)
at 
org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.callInternal(OrcEncodedDataReader.java:48)
at 
org.apache.hadoop.hive.common.CallableWithNdc.call(CallableWithNdc.java:37)
... 4 more
]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex 
vertex_1424502260528_1945_1_00 [Map 1] killed/failed due to:null]
{code}

Turning off hive.llap.io.enabled makes the error go away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10162) LLAP: Avoid deserializing the plan > 1 times in a single thread

2015-03-31 Thread Gopal V (JIRA)
Gopal V created HIVE-10162:
--

 Summary: LLAP: Avoid deserializing the plan > 1 times in a single 
thread
 Key: HIVE-10162
 URL: https://issues.apache.org/jira/browse/HIVE-10162
 Project: Hive
  Issue Type: Sub-task
Affects Versions: llap
Reporter: Gopal V
Assignee: Gunther Hagleitner
 Fix For: llap
 Attachments: deserialize-plan-1.png, deserialize-plan-2.png

Kryo shows up in the critical hot-path for LLAP when using a plan with a very 
large filter condition, due to the fact that the plan is deserialized more than 
once for each task.

!deserialize-plan-1.png!

!deserialize-plan-2.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10163) CommonMergeJoinOperator calls WritableComparator.get() in the inner loop

2015-03-31 Thread Gopal V (JIRA)
Gopal V created HIVE-10163:
--

 Summary: CommonMergeJoinOperator calls WritableComparator.get() in 
the inner loop
 Key: HIVE-10163
 URL: https://issues.apache.org/jira/browse/HIVE-10163
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 1.2.0
Reporter: Gopal V


The CommonMergeJoinOperator wastes CPU looking up the correct comparator for 
each WritableComparable in each row.

{code}
@SuppressWarnings("rawtypes")
  private int compareKeys(List k1, List k2) {
int ret = 0;
   
  ret = WritableComparator.get(key_1.getClass()).compare(key_1, key_2);
  if (ret != 0) {
return ret;
  }
}
{code}

The slow part of that get() is deep within {{ReflectionUtils.setConf}}, where 
it tries to use reflection to set the Comparator config for each row being 
compared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10164) LLAP: ORC BIGINT SARGs regressed after Parquet PPD fixes (HIVE-8122)

2015-03-31 Thread Gopal V (JIRA)
Gopal V created HIVE-10164:
--

 Summary: LLAP: ORC BIGINT SARGs regressed after Parquet PPD fixes 
(HIVE-8122)
 Key: HIVE-10164
 URL: https://issues.apache.org/jira/browse/HIVE-10164
 Project: Hive
  Issue Type: Sub-task
Reporter: Gopal V
Assignee: Prasanth Jayachandran


HIVE-8122 seems to have introduced a toString() to the ORC PPD codepath for 
BIGINT.

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgumentImpl.java#L162

{code}
   private List getOrcLiteralList() {
  // no need to cast
...
 List result = new ArrayList();
  for (Object o : literalList) {
result.add(Long.valueOf(o.toString()));
  }
  return result;
}
{code}

!orc-sarg-tostring.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10165) Improve hive-hcatalog-streaming extensibility and support updates and deletes.

2015-03-31 Thread Elliot West (JIRA)
Elliot West created HIVE-10165:
--

 Summary: Improve hive-hcatalog-streaming extensibility and support 
updates and deletes.
 Key: HIVE-10165
 URL: https://issues.apache.org/jira/browse/HIVE-10165
 Project: Hive
  Issue Type: Improvement
  Components: HCatalog
Reporter: Elliot West
Assignee: Alan Gates
 Fix For: 1.2.0


h3. Overview
I'd like to extend the 
[hive-hcatalog-streaming|https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest]
 API so that it also supports the writing of record updates and deletes in 
addition to the already supported inserts.

h3. Motivation
We have many Hadoop processes outside of Hive that merge changed facts into 
existing datasets. Traditionally we achieve this by: reading in a ground-truth 
dataset and a modified dataset, grouping by a key, sorting by a sequence and 
then applying a function to determine inserted, updated, and deleted rows. 
However, in our current scheme we must rewrite all partitions that may 
potentially contain changes. In practice the number of mutated records is very 
small when compared with the records contained in a partition. This approach 
results in a number of operational issues:
* Excessive amount of write activity required for small data changes.
* Downstream applications cannot robustly read these datasets while they are 
being updated.
* Due to scale of the updates (hundreds or partitions) the scope for contention 
is high. 

I believe we can address this problem by instead writing only the changed 
records to a Hive transactional table. This should drastically reduce the 
amount of data that we need to write and also provide a means for managing 
concurrent access to the data. Our existing merge processes can read and retain 
each record's {{ROW_ID}}/{{RecordIdentifier}} and pass this through to an 
updated form of the hive-hcatalog-streaming API which will then have the 
required data to perform an update or insert in a transactional manner. 

h3. Benefits
* Enables the creation of large-scale dataset merge processes  
* Opens up Hive transactional functionality in an accessible manner to 
processes that operate outside of Hive.

h3. Implementation
We've patched the API to provide visibility to the underlying 
{{OrcRecordUpdater}} and allow extension of the {{AbstractRecordWriter}} by 
third-parties outside of the package. We've also updated the user facing 
interfaces to provide update and delete functionality. I've provided the 
modifications as three incremental patches. Generally speaking, each patch 
makes the API less backwards compatible but more consistent with respect to 
offering updates, deletes as well as writes (inserts). Ideally I hope that all 
three patches have merit, but only the first patch is absolutely necessary to 
enable the features we need on the API, and it does so in a backwards 
compatible way. I'll summarise the contents of each patch:

h4. HIVE-.0.patch - Required
This patch contains what we consider to be the minimum amount of changes 
required to allow users to create {{RecordWriter}} subclasses that can insert, 
update, and  delete records. These changes also maintain backwards 
compatibility at the expense of confusing the API a little. Note that the row 
representation has be changed from {{byte[]}} to {{Object}}. Within our data 
processing jobs our records are often available in a strongly typed and decoded 
form such as a POJO or a Tuple object. Therefore is seems to make sense that we 
are able to pass this through to the {{OrcRecordUpdater}} without having to go 
through a {{byte[]}} encoding step. This our course still allows users to use 
{{byte[]}} if they wish.

h4. HIVE-.1.patch - Nice to have
This patch builds on the changes made in the *required* patch and aims to make 
the API cleaner and more consistent while accommodating updates and inserts. It 
also adds some logic to prevent the user from submitting multiple operation 
types to a single {{TransactionBatch}} as we found this creates data 
inconsistencies within the Hive table. This patch breaks backwards 
compatibility.

h4. HIVE-.2.patch - Nomenclature
This final patch simply renames some of existing types to more accurately 
convey their increased responsibilities. The API is no longer writing just new 
records, it is now also responsible for writing operations that are applied to 
existing records. This patch breaks backwards compatibility.

h3. Example
I've attached simple typical usage of the API. This is not a patch and is 
intended as an illustration only.

h3. Known issues
I have not yet provided any unit tests for the extended functionality. I fully 
expect that these are required and will work on these if these patches have 
merit.

*Note: Attachments to follow.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10166) Merge Spark branch to trunk 3/31/2015

2015-03-31 Thread Xuefu Zhang (JIRA)
Xuefu Zhang created HIVE-10166:
--

 Summary: Merge Spark branch to trunk 3/31/2015
 Key: HIVE-10166
 URL: https://issues.apache.org/jira/browse/HIVE-10166
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 1.1.0
Reporter: Xuefu Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Hive 1.2

2015-03-31 Thread Xuefu Zhang
Hi Sushanth,

Thanks for starting this. Rapid releasing is good. However, could you
please provide a list of major changes or enhancement? I'd like to see
whether we are in  position to release it as 1.1.1 or 1.2.0.

Thanks,
Xuefu

On Mon, Mar 30, 2015 at 10:02 AM, Sushanth Sowmyan 
wrote:

> Hi Folks,
>
> Given that we landed a bunch of changes in trunk shooting for 1.2
> after 1.1 was forked off, including some metastore changes, I propose
> that we have a feature rollup release of hive that matches the state
> of trunk sooner rather than later. For the timeline, I was thinking
> that it'd be ideal to fork around Apr 18th(Friday), and try to get RCs
> going within a couple of days, and a release by the end of
> April/beginning of May.
>
> I would like to volunteer to perform the duties of a release manager
> for this if there is enough appeal for this.
>
> Thanks
>


Re: Review Request 32370: HIVE-10040

2015-03-31 Thread Jesús Camacho Rodríguez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32370/#review78367
---



ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveDefaultCostModel.java


We could do that, but I don't feel strongly about architecting it that way, 
as we would be mixing two different cost models into the same method: one based 
on cardinality, and one based on CPU+IO+cardinality. Thus, the cost coming from 
that method for "NONE" and any other algorithm e.g. "COMMON" should not be even 
comparable, right?


- Jesús Camacho Rodríguez


On March 27, 2015, 4:15 p.m., Jesús Camacho Rodríguez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32370/
> ---
> 
> (Updated March 27, 2015, 4:15 p.m.)
> 
> 
> Review request for hive and John Pullokkaran.
> 
> 
> Bugs: HIVE-10040
> https://issues.apache.org/jira/browse/HIVE-10040
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> CBO (Calcite Return Path): Pluggable cost modules [CBO branch]
> 
> 
> Diffs
> -
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java
>  977313a5a632329fc963daf7ff276ccdd59ce7c5 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveCost.java 
> 41604cd0af68e7f90296fa271c42debc5aaf743a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveCostModel.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveDefaultCostModel.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveRelMdCost.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveAggregate.java
>  9a8a5da81b92c7c1f33d1af8072b1fb94e237290 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveFilter.java
>  3e45a3fbed3265b126a3ff9b6ffe44bee24453ef 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveJoin.java
>  e2b010b641d48ea1bf04750ddf5eb24fb3a7fcbe 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveLimit.java
>  5fc64f3e8c97fc8988bc35be39dbabf78dd7de24 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveProject.java
>  6c215c96190f0fcebe063b15c2763c49ebf1faaf 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveTableScan.java
>  f2c5408d913bfe2648c4e1e1e43b1bbc5f43a549 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCollation.java
>  4984683c3c8c6c0378a22e21fd6d961f3901f25c 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistribution.java
>  f846dd19899af51194f3407ef913fcb9bcc24977 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdRowCount.java
>  dabbe280278dc80f00f0240a0c615fe6c7b8533a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java
>  95515b23e409d73d5c61e107931727add3f992a6 
> 
> Diff: https://reviews.apache.org/r/32370/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Jesús Camacho Rodríguez
> 
>



Review Request 32692: HIVE-10083 SMBJoin fails in case one table is uninitialized

2015-03-31 Thread Na Yang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32692/
---

Review request for hive, Brock Noland, Chao Sun, and Xuefu Zhang.


Bugs: 10083
https://issues.apache.org/jira/browse/10083


Repository: hive-git


Description
---

When one table is unintialized, the smallTblFilesNames is a empty list which 
caues the IndexOutOfBoundsException when smallTblFileNames.get(toAddSmallIndex) 
is called.


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/optimizer/AbstractBucketJoinProc.java 
70c23a6 

Diff: https://reviews.apache.org/r/32692/diff/


Testing
---


Thanks,

Na Yang



Hive-0.14 - Build # 910 - Fixed

2015-03-31 Thread Apache Jenkins Server
Changes for Build #909

Changes for Build #910



No tests ran.

The Apache Jenkins build system has built Hive-0.14 (build #910)

Status: Fixed

Check console output at https://builds.apache.org/job/Hive-0.14/910/ to view 
the results.

Re: Review Request 32406: Add another level of explain for RDBMS audience

2015-03-31 Thread Ashutosh Chauhan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32406/#review78298
---


Overall looks good. Few minor code level comments.


common/src/java/org/apache/hadoop/hive/conf/HiveConf.java


nit : whitespace



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Attr.java


Can be overridden from o.a.h.h.common.ObjectPair with overriding equals?

Also, consider placing this class in o.a.h.h.common package since ql/ 
package is distributed to cluster and we want to minimize its size.



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Op.java


will be good to add comment for this.



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Op.java


Better to name it: connections. We may use it for other purposes later.



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Stage.java


Add comments for this boolean.



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Stage.java


Seems like this boolean is used to keep state of printer while visiting 
over plan.
If so, better design IMO is to keep this state with printer. This class 
should be state free.
Just a thought, you may know better.



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Vertex.java


Can you add comments about this and next boolean.



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Vertex.java


Similar comment about state being with printer.


- Ashutosh Chauhan


On March 26, 2015, 8:10 p.m., pengcheng xiong wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32406/
> ---
> 
> (Updated March 26, 2015, 8:10 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and John Pullokkaran.
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> Current Hive Explain (default) is targeted at MR Audience. We need a new 
> level of explain plan to be targeted at RDBMS audience. The explain requires 
> these:
> 1) The focus needs to be on what part of the query is being executed rather 
> than internals of the engines
> 2) There needs to be a clearly readable tree of operations
> 3) Examples - Table scan should mention the table being scanned, the Sarg, 
> the size of table and expected cardinality after the Sarg'ed read. The join 
> should mention the table being joined with and the join condition. The 
> aggregate should mention the columns in the group-by.
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java cf82e8b 
>   itests/src/test/resources/testconfiguration.properties 288270e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/ExplainTask.java 149f911 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Attr.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Connection.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Op.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Stage.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/TezJsonParser.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/explain/Vertex.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/merge/MergeFileWork.java e572338 
>   ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/stats/PartialScanWork.java 
> 095afd4 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/truncate/ColumnTruncateWork.java
>  092f627 
>   
> ql/src/java/org/apache/hadoop/hive/ql/parse/AlterTablePartMergeFilesDesc.java 
> eaf3dc4 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java 
> 38b6d96 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/AbstractOperatorDesc.java 
> 476dfd1 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/AlterDatabaseDesc.java e45bc26 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/AlterIndexDesc.java db2cf7f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/AlterTableDesc.java 24cf1da 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ArchiveWork.java 9fb5c8b 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 1737a34 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/BucketMapJoinContext.java 
> f436bc0 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/CollectDesc.java 588e14d 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java a44c8e8 
>   

Re: Review Request 32692: HIVE-10083 SMBJoin fails in case one table is uninitialized

2015-03-31 Thread Chao Sun

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32692/#review78380
---

Ship it!


Ship It!

- Chao Sun


On March 31, 2015, 5:01 p.m., Na Yang wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32692/
> ---
> 
> (Updated March 31, 2015, 5:01 p.m.)
> 
> 
> Review request for hive, Brock Noland, Chao Sun, and Xuefu Zhang.
> 
> 
> Bugs: 10083
> https://issues.apache.org/jira/browse/10083
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> When one table is unintialized, the smallTblFilesNames is a empty list which 
> caues the IndexOutOfBoundsException when 
> smallTblFileNames.get(toAddSmallIndex) is called.
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/AbstractBucketJoinProc.java 
> 70c23a6 
> 
> Diff: https://reviews.apache.org/r/32692/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Na Yang
> 
>



[jira] [Created] (HIVE-10167) HS2 logs the server started only before the server is shut down

2015-03-31 Thread Jimmy Xiang (JIRA)
Jimmy Xiang created HIVE-10167:
--

 Summary: HS2 logs the server started only before the server is 
shut down
 Key: HIVE-10167
 URL: https://issues.apache.org/jira/browse/HIVE-10167
 Project: Hive
  Issue Type: Bug
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
Priority: Trivial


TThreadPoolServer#serve() blocks till the server is down. We should log before 
that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10168) make groupby3_map.q more stable

2015-03-31 Thread Alexander Pivovarov (JIRA)
Alexander Pivovarov created HIVE-10168:
--

 Summary: make groupby3_map.q more stable
 Key: HIVE-10168
 URL: https://issues.apache.org/jira/browse/HIVE-10168
 Project: Hive
  Issue Type: Bug
  Components: Tests
Reporter: Alexander Pivovarov
Assignee: Alexander Pivovarov


The test run aggregation query which produces several DOUBLE numbers.
Assertion framework compares output containing DOUBLE numbers without any delta.
As a result test is not stable

e.g. build 3219 failed with the following test result
{code}
groupby3_map.q.out
139c139
< 130091.0  260.182 256.10355987055016  98.00.0 
142.92680950752379  143.06995106518903  20428.0728759   
20469.010897795582
---
> 130091.0  260.182 256.10355987055016  98.00.0 
> 142.9268095075238   143.06995106518906  20428.072876
> 20469.01089779559
{code}

http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3219/testReport/junit/org.apache.hadoop.hive.cli/TestCliDriver/testCliDriver_groupby3_map/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 32637: HIVE-10128 LLAP: BytesBytesMultiHashMap does not allow concurrent read-only access

2015-03-31 Thread Sergey Shelukhin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32637/
---

(Updated March 31, 2015, 6:19 p.m.)


Review request for hive, Ashutosh Chauhan and Gopal V.


Repository: hive-git


Description
---

see jira


Diffs (updated)
-

  
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/BytesBytesMultiHashMap.java
 2312ccb 
  
ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinBytesTableContainer.java
 b323e8e 
  serde/src/java/org/apache/hadoop/hive/serde2/WriteBuffers.java f9ab964 

Diff: https://reviews.apache.org/r/32637/diff/


Testing
---


Thanks,

Sergey Shelukhin



Re: Hive 1.2

2015-03-31 Thread Thejas Nair
Xuefu,
Releases such as 1.1.1 are by convention bug fix releases, and only
include a selected set of important bug fixes that have been applied
to the release branch (branch-1.1 in this case).

In the 2 months since the creation of branch-1.0, we have around 211
jiras that have been committed in trunk (fix version 1.2.0). I would
expect around another 100 to go in, with the timeline proposed above.

https://issues.apache.org/jira/browse/HIVE-9727?jql=project%20%3D%20HIVE%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.2.0%20ORDER%20BY%20updatedDate%20DESC

Maintenance releases of branch-1.1 and earlier branches can continue
irrespective of a new release from trunk.



On Tue, Mar 31, 2015 at 7:05 AM, Xuefu Zhang  wrote:
> Hi Sushanth,
>
> Thanks for starting this. Rapid releasing is good. However, could you
> please provide a list of major changes or enhancement? I'd like to see
> whether we are in  position to release it as 1.1.1 or 1.2.0.
>
> Thanks,
> Xuefu
>
> On Mon, Mar 30, 2015 at 10:02 AM, Sushanth Sowmyan 
> wrote:
>
>> Hi Folks,
>>
>> Given that we landed a bunch of changes in trunk shooting for 1.2
>> after 1.1 was forked off, including some metastore changes, I propose
>> that we have a feature rollup release of hive that matches the state
>> of trunk sooner rather than later. For the timeline, I was thinking
>> that it'd be ideal to fork around Apr 18th(Friday), and try to get RCs
>> going within a couple of days, and a release by the end of
>> April/beginning of May.
>>
>> I would like to volunteer to perform the duties of a release manager
>> for this if there is enough appeal for this.
>>
>> Thanks
>>


[jira] [Created] (HIVE-10169) get metatool to work with hbase metastore

2015-03-31 Thread Thejas M Nair (JIRA)
Thejas M Nair created HIVE-10169:


 Summary: get metatool to work with hbase metastore
 Key: HIVE-10169
 URL: https://issues.apache.org/jira/browse/HIVE-10169
 Project: Hive
  Issue Type: Sub-task
Reporter: Thejas M Nair


The metatool is used for  enabling namenode HA, and it uses ObjectStore 
directly. 
There needs to be a way to support equivalent functionality with hbase 
metastore.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Hive 1.2

2015-03-31 Thread Xuefu Zhang
Okay. Makes sense.

Thanks,
Xuefu

On Tue, Mar 31, 2015 at 11:38 AM, Thejas Nair  wrote:

> Xuefu,
> Releases such as 1.1.1 are by convention bug fix releases, and only
> include a selected set of important bug fixes that have been applied
> to the release branch (branch-1.1 in this case).
>
> In the 2 months since the creation of branch-1.0, we have around 211
> jiras that have been committed in trunk (fix version 1.2.0). I would
> expect around another 100 to go in, with the timeline proposed above.
>
>
> https://issues.apache.org/jira/browse/HIVE-9727?jql=project%20%3D%20HIVE%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.2.0%20ORDER%20BY%20updatedDate%20DESC
>
> Maintenance releases of branch-1.1 and earlier branches can continue
> irrespective of a new release from trunk.
>
>
>
> On Tue, Mar 31, 2015 at 7:05 AM, Xuefu Zhang  wrote:
> > Hi Sushanth,
> >
> > Thanks for starting this. Rapid releasing is good. However, could you
> > please provide a list of major changes or enhancement? I'd like to see
> > whether we are in  position to release it as 1.1.1 or 1.2.0.
> >
> > Thanks,
> > Xuefu
> >
> > On Mon, Mar 30, 2015 at 10:02 AM, Sushanth Sowmyan 
> > wrote:
> >
> >> Hi Folks,
> >>
> >> Given that we landed a bunch of changes in trunk shooting for 1.2
> >> after 1.1 was forked off, including some metastore changes, I propose
> >> that we have a feature rollup release of hive that matches the state
> >> of trunk sooner rather than later. For the timeline, I was thinking
> >> that it'd be ideal to fork around Apr 18th(Friday), and try to get RCs
> >> going within a couple of days, and a release by the end of
> >> April/beginning of May.
> >>
> >> I would like to volunteer to perform the duties of a release manager
> >> for this if there is enough appeal for this.
> >>
> >> Thanks
> >>
>


Re: Review Request 32489: HIVE-9518 Implement MONTHS_BETWEEN aligned with Oracle one

2015-03-31 Thread Alexander Pivovarov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32489/
---

(Updated March 31, 2015, 8:08 p.m.)


Review request for hive and Jason Dere.


Changes
---

The function should support both short date and full timestamp string format 
and it should not skip time part.
String lenght can not be used to determine the format because year might be 
less that 4 chars and day and month can be just 1 char

This is why I decided to use both Timestamp and Date converters to convert 
input value to java Date.
I also removed the fix I did before to GenericUDF which consider string lenght 
(str.length==10)

I added tests for dates without day, dates with partial time (no seconds) and 
dates with short year, month and day.

Now string Dates parsing behavious shold be consistend with other UDFs (e.g. 
datediff)


Bugs: HIVE-9518
https://issues.apache.org/jira/browse/HIVE-9518


Repository: hive-git


Description
---

HIVE-9518 Implement MONTHS_BETWEEN aligned with Oracle one


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 
2476e832b8b7101971ea2226368aa82633b7e7d1 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java 
ce981232382e993c7c9d640efe9b2d21f70a0ed4 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMonthsBetween.java 
PRE-CREATION 
  
ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFMonthsBetween.java
 PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_months_between.q PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out 
22091d06241218a5c0ee21d6ee6be00a71706971 
  ql/src/test/results/clientpositive/udf_months_between.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/32489/diff/


Testing
---


Thanks,

Alexander Pivovarov



Re: Hive 1.2

2015-03-31 Thread Thejas Nair
+1
Thanks for volunteering Sushanth!


On Mon, Mar 30, 2015 at 10:02 AM, Sushanth Sowmyan  wrote:
> Hi Folks,
>
> Given that we landed a bunch of changes in trunk shooting for 1.2
> after 1.1 was forked off, including some metastore changes, I propose
> that we have a feature rollup release of hive that matches the state
> of trunk sooner rather than later. For the timeline, I was thinking
> that it'd be ideal to fork around Apr 18th(Friday), and try to get RCs
> going within a couple of days, and a release by the end of
> April/beginning of May.
>
> I would like to volunteer to perform the duties of a release manager
> for this if there is enough appeal for this.
>
> Thanks


Re: Review Request 32489: HIVE-9518 Implement MONTHS_BETWEEN aligned with Oracle one

2015-03-31 Thread Mohit Sabharwal

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32489/#review78395
---



ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMonthsBetween.java


nit: cleaner to just say (31 * 24 * 60 * 60)


- Mohit Sabharwal


On March 31, 2015, 8:08 p.m., Alexander Pivovarov wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32489/
> ---
> 
> (Updated March 31, 2015, 8:08 p.m.)
> 
> 
> Review request for hive and Jason Dere.
> 
> 
> Bugs: HIVE-9518
> https://issues.apache.org/jira/browse/HIVE-9518
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-9518 Implement MONTHS_BETWEEN aligned with Oracle one
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 
> 2476e832b8b7101971ea2226368aa82633b7e7d1 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java 
> ce981232382e993c7c9d640efe9b2d21f70a0ed4 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMonthsBetween.java
>  PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFMonthsBetween.java
>  PRE-CREATION 
>   ql/src/test/queries/clientpositive/udf_months_between.q PRE-CREATION 
>   ql/src/test/results/clientpositive/show_functions.q.out 
> 22091d06241218a5c0ee21d6ee6be00a71706971 
>   ql/src/test/results/clientpositive/udf_months_between.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/32489/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Alexander Pivovarov
> 
>



Re: Review Request 32489: HIVE-9518 Implement MONTHS_BETWEEN aligned with Oracle one

2015-03-31 Thread Alexander Pivovarov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32489/
---

(Updated March 31, 2015, 9:32 p.m.)


Review request for hive and Jason Dere.


Changes
---

patch #10 - added SEC_IN_31_DAYS constant for clarity


Bugs: HIVE-9518
https://issues.apache.org/jira/browse/HIVE-9518


Repository: hive-git


Description
---

HIVE-9518 Implement MONTHS_BETWEEN aligned with Oracle one


Diffs (updated)
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 
2476e832b8b7101971ea2226368aa82633b7e7d1 
  ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java 
ce981232382e993c7c9d640efe9b2d21f70a0ed4 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMonthsBetween.java 
PRE-CREATION 
  
ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFMonthsBetween.java
 PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_months_between.q PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out 
22091d06241218a5c0ee21d6ee6be00a71706971 
  ql/src/test/results/clientpositive/udf_months_between.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/32489/diff/


Testing
---


Thanks,

Alexander Pivovarov



Re: Review Request 32489: HIVE-9518 Implement MONTHS_BETWEEN aligned with Oracle one

2015-03-31 Thread Alexander Pivovarov


> On March 31, 2015, 8:47 p.m., Mohit Sabharwal wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMonthsBetween.java,
> >  line 131
> > 
> >
> > nit: cleaner to just say (31 * 24 * 60 * 60)

added SEC_IN_DAY and SEC_IN_31_DAYS constants for clarity


- Alexander


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32489/#review78395
---


On March 31, 2015, 9:32 p.m., Alexander Pivovarov wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32489/
> ---
> 
> (Updated March 31, 2015, 9:32 p.m.)
> 
> 
> Review request for hive and Jason Dere.
> 
> 
> Bugs: HIVE-9518
> https://issues.apache.org/jira/browse/HIVE-9518
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> HIVE-9518 Implement MONTHS_BETWEEN aligned with Oracle one
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 
> 2476e832b8b7101971ea2226368aa82633b7e7d1 
>   ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.java 
> ce981232382e993c7c9d640efe9b2d21f70a0ed4 
>   
> ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMonthsBetween.java
>  PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFMonthsBetween.java
>  PRE-CREATION 
>   ql/src/test/queries/clientpositive/udf_months_between.q PRE-CREATION 
>   ql/src/test/results/clientpositive/show_functions.q.out 
> 22091d06241218a5c0ee21d6ee6be00a71706971 
>   ql/src/test/results/clientpositive/udf_months_between.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/32489/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Alexander Pivovarov
> 
>



[jira] [Created] (HIVE-10170) LLAP: general cache deadlock avoidance

2015-03-31 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created HIVE-10170:
---

 Summary: LLAP: general cache deadlock avoidance
 Key: HIVE-10170
 URL: https://issues.apache.org/jira/browse/HIVE-10170
 Project: Hive
  Issue Type: Sub-task
Reporter: Sergey Shelukhin


See HIVE-10092. Even with improved locking it's good to have cache bypass that 
will prevent deadlocks where everyone locked part of their split, cannot yet 
release anything and cannot read any more, under very high load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: ORC separate project

2015-03-31 Thread Owen O'Malley
All,

Moving this forward, I'll submit a resolution to the Apache board for the
next meeting.

One of the concerns that has been mentioned is how to deal with the
vectorization and SARG APIs. I'd like to propose that we pull the minimal
set of classes in a new Hive module named "storage-api". This module will
include VectorizedRowBatch, the various ColumnVector classes, and the SARG
classes. It will form the start of an API that high performance storage
formats can use to integrate with Hive. Both ORC and Parquet can use the
new API to support vectorization and SARGs without performance destroying
shims. I'll create a jira to discuss the idea.

Thanks!
   Owen


[jira] [Created] (HIVE-10171) Create a storage-api module

2015-03-31 Thread Owen O'Malley (JIRA)
Owen O'Malley created HIVE-10171:


 Summary: Create a storage-api module
 Key: HIVE-10171
 URL: https://issues.apache.org/jira/browse/HIVE-10171
 Project: Hive
  Issue Type: Bug
Reporter: Owen O'Malley
Assignee: Owen O'Malley


To support high performance file formats, I'd like to propose that we move the 
minimal set of classes that are required to integrate with Hive in to a new 
module named "storage-api". This module will include VectorizedRowBatch, the 
various ColumnVector classes, and the SARG classes. It will form the start of 
an API that high performance storage formats can use to integrate with Hive. 
Both ORC and Parquet can use the new API to support vectorization and SARGs 
without performance destroying shims.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10172) Fix performance regression caused by HIVE-8122 for ORC

2015-03-31 Thread Prasanth Jayachandran (JIRA)
Prasanth Jayachandran created HIVE-10172:


 Summary: Fix performance regression caused by HIVE-8122 for ORC
 Key: HIVE-10172
 URL: https://issues.apache.org/jira/browse/HIVE-10172
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran


See HIVE-10164 for description. We should fix this in trunk and move it to 
branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 32324: HIVE-10037 JDBC support for interval expressions

2015-03-31 Thread Thejas Nair

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32324/#review78462
---



jdbc/src/java/org/apache/hive/jdbc/JdbcColumn.java


can we re-use typeStringToHiveType here ?
I see void and null as only things that need to be handled separately.


- Thejas Nair


On March 20, 2015, 11:48 p.m., Jason Dere wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32324/
> ---
> 
> (Updated March 20, 2015, 11:48 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Thejas Nair.
> 
> 
> Bugs: HIVE-10037
> https://issues.apache.org/jira/browse/HIVE-10037
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> There is no interval type in Jdbc, so year-month intervals and day-time 
> intervals both use java.sql.Types.OTHER as the Jdbc type.
> Also does some changes in JdbcColumn/HiveResultSetMetaData to allow a more 
> accurate column precision/display size in the case of interval types, and to 
> allow users to use get the HiveIntervalYearMonth/HiveIntervalDayTime values 
> when calling ResultSet.getObject().
> 
> 
> Diffs
> -
> 
>   itests/hive-unit/src/test/java/org/apache/hive/jdbc/TestJdbcDriver2.java 
> 2c85877 
>   jdbc/src/java/org/apache/hive/jdbc/HiveBaseResultSet.java cd1916f 
>   jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java 764a3f1 
>   jdbc/src/java/org/apache/hive/jdbc/HiveResultSetMetaData.java 3fcdd56 
>   jdbc/src/java/org/apache/hive/jdbc/JdbcColumn.java 4383f56 
>   metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java 
> 2758eb0 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java 
> b9e15a1 
>   service/if/TCLIService.thrift 6f1a4ca 
>   service/src/java/org/apache/hive/service/cli/ColumnValue.java 9b48396 
>   service/src/java/org/apache/hive/service/cli/Type.java 92d237d 
> 
> Diff: https://reviews.apache.org/r/32324/diff/
> 
> 
> Testing
> ---
> 
> Added test to TestJdbcDriver2
> 
> 
> Thanks,
> 
> Jason Dere
> 
>



Request for feedback on work intent for non-equijoin support

2015-03-31 Thread Andres.Quiroz
Dear Hive development community members,

I am interested in learning more about the current support for non-equijoins in 
Hive and/or other Hadoop SQL engines, and in getting feedback about community 
interest in more extensive support for such a feature. I intend to work on this 
challenge, assuming people find it compelling, and I intend to contribute 
results to the community. Where possible, it would be great to receive feedback 
and engage in collaborations along the way (for a bit more context, see the 
postscript of this message).

My initial goal is to support query conditions such as the following:

A.x < B.y
A.x in_range [B.y, B.z]
distance(A.x, B.y) < D

where A and B are distinct tables/files. It is my understanding that current 
support for performing non-equijoins like those above is quite limited, and 
where some forms are supported (like in Cloudera's Impala), this support is 
based on doing a potentially expensive cross product join. Depending on the 
data types involved, I believe that joins with these conditions can be made to 
be tractable (at least on the average) with join algorithms that exploit 
properties of the data types, possibly with some pre-scanning of the data.

I am asking for feedback on the interest & need in the community for this work, 
as well as any pointers to similar work. In particular, I would appreciate any 
answers people could give on the following questions:

- Is my understanding of the state of the art in Hive and similar tools 
accurate? Are there groups currently working on similar or related issues, or 
tools that already accomplish some or all of what I have proposed?
- Is there significant value to the community in the support of such a feature? 
In other words, are the manual workarounds necessary because of the absence of 
non-equijoins such as these enough of a pain to justify the work I propose?
- Being aware that the potential pre-scanning adds to the cost of the join, and 
that data could still blow-up in the worst case, am I missing any other 
important considerations and tradeoffs for this problem?
- What would be a good avenue to contribute this feature to the community (e.g. 
as a standalone tool on top of Hadoop, or as a Hive extension or plugin)?
- What is the best way to get started in working with the community?

Thanks for your attention and any info you can provide!

Andres Quiroz

P.S. If you are interested in some context, and why/how I am proposing to do 
this work, please read on.

I am part of a small project team at PARC working on the general problems of 
data integration and automated ETL. We have proposed a tool called HiperFuse 
that is designed to accept declarative, high-level queries in order to produce 
joined (fused) data sets from multiple heterogeneous raw data sources. In our 
preliminary work, which you can find here (pointer to the paper), we designed 
the architecture of the tool and obtained some results separately on the 
problems of automated data cleansing, data type inference, and query planning. 
One of the planned prototype implementations of HiperFuse relies on Hadoop MR, 
and because the declarative language we proposed was closely related to SQL, we 
thought that we could exploit the existing work in Hive and/or other 
open-source tools for handling the SQL part and layer our work on top of that. 
For example, the query given in the paper could easily be expressed in SQL-like 
form with a non-equijoin condition:

SELECT web_access_log.ip, census.income
FROM web_access_log, ip2zip, census
WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
AND ip2zip.zip = census.zip

As you can see, the first impasse that we hit in order to bring the elements 
together to solve this query end-to-end was the realization and performance of 
the non-equality join in the query. The intent now is to tackle this problem in 
a general sense and provide a solution for a wide range of queries.

The work I propose to do would be based on three main components within 
HiperFuse:

- Enhancements to the extensible data type framework in HiperFuse that would 
categorize data types based on the properties needed to support the join 
algorithms, in order to write join-ready domain-specific data type libraries.
- The join algorithms themselves, based on Hive or directly on Hadoop MR.
- A query planner, which would determine the right algorithm to apply and 
automatically schedule any necessary pre-scanning of the data.



[jira] [Created] (HIVE-10173) ThreadLocal synchronized initialvalue() is irrelevant in JDK7

2015-03-31 Thread Gopal V (JIRA)
Gopal V created HIVE-10173:
--

 Summary: ThreadLocal synchronized initialvalue() is irrelevant in 
JDK7
 Key: HIVE-10173
 URL: https://issues.apache.org/jira/browse/HIVE-10173
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 1.2.0
Reporter: Gopal V
Priority: Minor


The threadlocals need not synchronize the calls to initialvalue(), since that 
is effectively going to be called once per-thread in JDK7.

The anti-pattern lives on due to a very old JDK bug - 
https://bugs.openjdk.java.net/browse/JDK-6550283

{code}
$ git grep --name-only -c "protected.*synchronized.*initialValue"
common/src/java/org/apache/hadoop/hive/conf/LoopingByteArrayInputStream.java
contrib/src/java/org/apache/hadoop/hive/contrib/util/typedbytes/TypedBytesInput.java
contrib/src/java/org/apache/hadoop/hive/contrib/util/typedbytes/TypedBytesOutput.java
contrib/src/java/org/apache/hadoop/hive/contrib/util/typedbytes/TypedBytesRecordInput.java
contrib/src/java/org/apache/hadoop/hive/contrib/util/typedbytes/TypedBytesRecordOutput.java
contrib/src/java/org/apache/hadoop/hive/contrib/util/typedbytes/TypedBytesWritableInput.java
contrib/src/java/org/apache/hadoop/hive/contrib/util/typedbytes/TypedBytesWritableOutput.java
metastore/src/java/org/apache/hadoop/hive/metastore/Deadline.java
metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java
ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
ql/src/java/org/apache/hadoop/hive/ql/session/OperationLog.java
serde/src/java/org/apache/hadoop/hive/serde2/io/TimestampWritable.java
serde/src/test/org/apache/hadoop/hive/serde2/io/TestTimestampWritable.java
service/src/java/org/apache/hive/service/auth/TSetIpAddressProcessor.java
service/src/java/org/apache/hive/service/cli/session/SessionManager.java
shims/common/src/main/java/org/apache/hadoop/hive/thrift/HadoopThriftAuthBridge.java
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10174) LLAP: ORC MemoryManager is singleton synchronized

2015-03-31 Thread Gopal V (JIRA)
Gopal V created HIVE-10174:
--

 Summary: LLAP: ORC MemoryManager is singleton synchronized
 Key: HIVE-10174
 URL: https://issues.apache.org/jira/browse/HIVE-10174
 Project: Hive
  Issue Type: Sub-task
  Components: File Formats
Affects Versions: llap
Reporter: Gopal V
 Attachments: orc-memorymanager-1.png, orc-memorymanager-2.png

ORC MemoryManager::addedRow() checks are bad for LLAP multi-threaded 
performance.

!orc-memory-manager-1.png!
!orc-memory-manager-2.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10175) Tez DynamicPartitionPruning lacks a fast-path exit for large IN() queries

2015-03-31 Thread Gopal V (JIRA)
Gopal V created HIVE-10175:
--

 Summary: Tez DynamicPartitionPruning lacks a fast-path exit for 
large IN() queries
 Key: HIVE-10175
 URL: https://issues.apache.org/jira/browse/HIVE-10175
 Project: Hive
  Issue Type: Bug
  Components: Tez
Affects Versions: 1.2.0
Reporter: Gopal V
Priority: Minor


TezCompiler::runDynamicPartitionPruning() calls the graph walker even if all 
tables provided to the optimizer are unpartitioned temporary tables.

This makes it extremely slow as it will walk & inspect a large/complex 
FilterOperator later in the pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 32712: HIVE-10168 make groupby3_map.q more stable

2015-03-31 Thread Alexander Pivovarov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32712/
---

Review request for hive.


Repository: hive-git


Description
---

HIVE-10168 make groupby3_map.q more stable


Diffs
-

  ql/src/test/queries/clientpositive/groupby3_map.q 
7ecc71dfab64abeaa0733a619faa9c8f65b166ab 
  ql/src/test/results/clientpositive/groupby3_map.q.out 
5cebc729ebc6f57b544353efae6cc38da21b56c4 

Diff: https://reviews.apache.org/r/32712/diff/


Testing
---


Thanks,

Alexander Pivovarov



Re: Review Request 32712: HIVE-10168 make groupby3_map.q more stable

2015-03-31 Thread Alexander Pivovarov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32712/
---

(Updated April 1, 2015, 4:19 a.m.)


Review request for hive and Xuefu Zhang.


Bugs: HIVE-10168
https://issues.apache.org/jira/browse/HIVE-10168


Repository: hive-git


Description
---

HIVE-10168 make groupby3_map.q more stable


Diffs
-

  ql/src/test/queries/clientpositive/groupby3_map.q 
7ecc71dfab64abeaa0733a619faa9c8f65b166ab 
  ql/src/test/results/clientpositive/groupby3_map.q.out 
5cebc729ebc6f57b544353efae6cc38da21b56c4 

Diff: https://reviews.apache.org/r/32712/diff/


Testing
---


Thanks,

Alexander Pivovarov



Re: Request for feedback on work intent for non-equijoin support

2015-03-31 Thread Lefty Leverenz
Hello Andres, the link to your paper is missing:

In our preliminary work, which you can find here (pointer to the paper) ...


You can find general information about contributing to Hive in the
wiki:  Resources
for Contributors

, How to Contribute
.

-- Lefty

On Tue, Mar 31, 2015 at 10:42 PM,  wrote:

>  Dear Hive development community members,
>
>
>
> I am interested in learning more about the current support for
> non-equijoins in Hive and/or other Hadoop SQL engines, and in getting
> feedback about community interest in more extensive support for such a
> feature. I intend to work on this challenge, assuming people find it
> compelling, and I intend to contribute results to the community. Where
> possible, it would be great to receive feedback and engage in
> collaborations along the way (for a bit more context, see the postscript of
> this message).
>
>
>
> My initial goal is to support query conditions such as the following:
>
>
>
> A.x < B.y
>
> A.x in_range [B.y, B.z]
>
> distance(A.x, B.y) < D
>
>
>
> where A and B are distinct tables/files. It is my understanding that
> current support for performing non-equijoins like those above is quite
> limited, and where some forms are supported (like in Cloudera's Impala),
> this support is based on doing a potentially expensive cross product join.
> Depending on the data types involved, I believe that joins with these
> conditions can be made to be tractable (at least on the average) with join
> algorithms that exploit properties of the data types, possibly with some
> pre-scanning of the data.
>
>
>
> I am asking for feedback on the interest & need in the community for this
> work, as well as any pointers to similar work. In particular, I would
> appreciate any answers people could give on the following questions:
>
>
>
> - Is my understanding of the state of the art in Hive and similar tools
> accurate? Are there groups currently working on similar or related issues,
> or tools that already accomplish some or all of what I have proposed?
>
> - Is there significant value to the community in the support of such a
> feature? In other words, are the manual workarounds necessary because of
> the absence of non-equijoins such as these enough of a pain to justify the
> work I propose?
>
> - Being aware that the potential pre-scanning adds to the cost of the
> join, and that data could still blow-up in the worst case, am I missing any
> other important considerations and tradeoffs for this problem?
>
> - What would be a good avenue to contribute this feature to the community
> (e.g. as a standalone tool on top of Hadoop, or as a Hive extension or
> plugin)?
>
> - What is the best way to get started in working with the community?
>
>
>
> Thanks for your attention and any info you can provide!
>
>
>
> Andres Quiroz
>
>
>
> P.S. If you are interested in some context, and why/how I am proposing to
> do this work, please read on.
>
>
>
> I am part of a small project team at PARC working on the general problems
> of data integration and automated ETL. We have proposed a tool called
> HiperFuse that is designed to accept declarative, high-level queries in
> order to produce joined (fused) data sets from multiple heterogeneous raw
> data sources. In our preliminary work, which you can find here (pointer to
> the paper), we designed the architecture of the tool and obtained some
> results separately on the problems of automated data cleansing, data type
> inference, and query planning. One of the planned prototype implementations
> of HiperFuse relies on Hadoop MR, and because the declarative language we
> proposed was closely related to SQL, we thought that we could exploit the
> existing work in Hive and/or other open-source tools for handling the SQL
> part and layer our work on top of that. For example, the query given in the
> paper could easily be expressed in SQL-like form with a non-equijoin
> condition:
>
>
>
> SELECT web_access_log.ip, census.income
>
> FROM web_access_log, ip2zip, census
>
> WHERE web_access_log.ip in_range [ip2zip.ip_low, ip2zip.ip_high]
>
> AND ip2zip.zip = census.zip
>
>
>
> As you can see, the first impasse that we hit in order to bring the
> elements together to solve this query end-to-end was the realization and
> performance of the non-equality join in the query. The intent now is to
> tackle this problem in a general sense and provide a solution for a wide
> range of queries.
>
>
>
> The work I propose to do would be based on three main components within
> HiperFuse:
>
>
>
> - Enhancements to the extensible data type framework in HiperFuse that
> would categorize data types based on the properties needed to support the
> join algorithms, in order to write join-ready domain-specific data type
> libraries.
>
> - The join algorithms themselves, based on Hive or directly on Hadoop MR.
>
>

[jira] [Created] (HIVE-10176) skip.header.line.count causes values to be skipped when performing insert values

2015-03-31 Thread Wenbo Wang (JIRA)
Wenbo Wang created HIVE-10176:
-

 Summary: skip.header.line.count causes values to be skipped when 
performing insert values
 Key: HIVE-10176
 URL: https://issues.apache.org/jira/browse/HIVE-10176
 Project: Hive
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Wenbo Wang


When inserting values in to tables with TBLPROPERTIES 
("skip.header.line.count"="1") the first value listed is also skipped. 

create table test (row int, name string) TBLPROPERTIES 
("skip.header.line.count"="1"); 
load data local inpath '/root/data' into table test;
insert into table test values (1, 'a'), (2, 'b'), (3, 'c');

(1, 'a') isn't inserted into the table. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-10177) Enable constant folding for char & varchar

2015-03-31 Thread Ashutosh Chauhan (JIRA)
Ashutosh Chauhan created HIVE-10177:
---

 Summary: Enable constant folding for char & varchar
 Key: HIVE-10177
 URL: https://issues.apache.org/jira/browse/HIVE-10177
 Project: Hive
  Issue Type: Improvement
  Components: Logical Optimizer
Affects Versions: 1.0.0, 0.14.0, 1.1.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


TIMESTAMP to DATE conversion for negative unix time

2015-03-31 Thread Alexander Pivovarov
Hi Everyone

I noticed interesting behaviour for timestamp to date hive type conversion
for negative unix time.

For example:

select cast(cast('1966-01-01 00:00:01' as timestamp) as date);
1966-02-02

Should it work this way?

Another example
select last_day(cast('1966-01-31 00:00:01' as timestamp));
OK
1966-02-28


more details:
Date: 1966-01-01 00:00:01
unix time UTC: -126230399

daysSinceEpoch=−126230399000 / 8640 = -1460.88
int daysSinceEpoch = -1460
DateWritable having daysSinceEpoch=-1460 is 1966-01-02


Re: TIMESTAMP to DATE conversion for negative unix time

2015-03-31 Thread Alexander Pivovarov
correction for the first example
select cast(cast('1966-01-01 00:00:01' as timestamp) as date);
1966-01-02


On Tue, Mar 31, 2015 at 11:26 PM, Alexander Pivovarov 
wrote:

> Hi Everyone
>
> I noticed interesting behaviour for timestamp to date hive type conversion
> for negative unix time.
>
> For example:
>
> select cast(cast('1966-01-01 00:00:01' as timestamp) as date);
> 1966-02-02
>
> Should it work this way?
>
> Another example
> select last_day(cast('1966-01-31 00:00:01' as timestamp));
> OK
> 1966-02-28
>
>
> more details:
> Date: 1966-01-01 00:00:01
> unix time UTC: -126230399
>
> daysSinceEpoch=−126230399000 / 8640 = -1460.88
> int daysSinceEpoch = -1460
> DateWritable having daysSinceEpoch=-1460 is 1966-01-02
>
>
>
>
>
>