Hello,
Sorry for the late reply, but this problem is very interesting. How did you
end up solving it in the end?
I have an idea which is very ugly but might work:
Create a big view that is an union of all partitions
SELECT
'2019-10-01' as ds, *
FROM test_1 a
JOIN test_2 b ON a.id = b.id
JOIN te
Hello!
If that might help, I did this repo a while ago:
https://github.com/FurcyPin/docker-hive-spark
It provides a pre-installed Hive Metastore and a HiveServer running on
Spark (Spark-SQL not Hive on Spark)
I also did some config to acces AWS s3 data with it.
Cheers,
Furcy
On Thu, 21 Feb 201
Hi Mich,
Contrary to what you said, I can confirm you that BQ is able to read ORC
files compressed with Snappy.
However, BQ requires to perform a loading operation from the ORC file on
Google Storage and convert it into a BQ table.
The main advantages I see with BQ is the guaranteed very high sca
Hi,
I can't tell for sure where your problem is coming from, but from what you
said, I guess that the Hive Metastore is performing some list or scan
operation on the files
and that operation is taking a very long time.
maybe setting *hive.stats.autogather* to false might help.
Also, beware that
Hi Chintan,
Yes, this sounds weird...
"REFRESH TABLES" is the kind of statement required by SQL engines such as
Impala, Presto or Spark-SQL that cache metadata from the Metastore, but
vanilla Hive usually don't cache it and query the metastore every time
(unless some new feature was added recentl
>
> Sujeet Singh Pardeshi
>
> Software Specialist
>
> SAS Research and Development (India) Pvt. Ltd.
>
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra,
> 411 013
> *o*ff: +91-20-30418810
> [image: Description: untitled]
>
> *"Whe
It might sound silly, but isn't it what Hive is supposed to do, being a
distributed computation framework and all ?
Hive will write one file per reducer, called 0_0, 1_0, etc. where
the number corresponds to the number of your reducer.
Sometimes the _0 will be a _1 or _2 or more depending
Hi Elliot,
>From your description of the problem, I'm assuming that you are doing a
INSERT OVERWRITE table PARTITION(p1, p2) SELECT * FROM table
or something close, like a CREATE TABLE AS ... maybe.
If this is the case, I suspect that your shuffle phase comes from dynamic
partitioning, and in pa
Hello,
To load your data as parquet, you can either:
A. use spark:
https://docs.databricks.com/spark/latest/data-sources/read-csv.html and
write it directly as a parquet file
(df.write.format("parquet").saveAsTable("parquet_table"))
B. Load it as a csv file in Hive, and perform a CREATE TABLE par
Hi,
They are rigorously equivalent.
You can see this with the following queries:
CREATE TABLE t1 (a INT, b INT, c INT) ;
EXPLAIN
SELECT DISTINCT a,b,c
FROM t1
;
EXPLAIN
SELECT a,b,c
FROM t1
GROUP BY a,b,c
;
Both queries will return the exact same query plan:
Stage-0
Fetch Operator
op/hive/ql/udf/UDFType.html>
> )
>
> I think that I need something like 'eager evaluation' --> evaluate UDFs
> before build physical plan (if not you can't do partition pruning)
>
> On 15 May 2018 at 09:21, Furcy Pin wrote:
>
>> Hi Alberto,
>>
>
Hi Alberto,
If I'm not mistaken, to make sure that this work you need to give the
proper annotation in your UDF code (deterministic, and maybe some other).
You may also need to return a Constant Object Inspector in the unit method
so that Hive knows that it can perform partition pruning with it.
I suggest that you do a
DESCRIBE FORMATTED orc
and to check directly on s3 the location of your data.
Did you also try a SELECT * FROM orc LIMIT 10; ? Did it return something?
On Fri, 27 Apr 2018, 21:30 Sowjanya Kakarala, wrote:
>
>
>
>
>> Hi Guys,
>>
>> I am new to aws, I am not sure how t
Hi Tom,
High level manipulation with arrays of structural is generally where SQL
reaches its limit in terms of expressivity and where Spark starts to get my
preference. Unfortunately I don't know any public UDF that can do exactly
what you want, but coding one is not that hard.
If you use scala, s
Hi Oleksiy,
I must say that I don't know if partitioned HBase-backed tables are
supported in Hive, but I don't understand why you would need it. What are
you trying to do exactly? I suspect that you could do it by using composite
keys (Department, doc_id).
Also, I would advise against using mult
Hi Oleksiy,
I must say that I don't know if partitioned HBase-backed tables are
supported in Hive, but I don't understand why you would need it. What are
you trying to do exactly? I suspect that you could do it by using composite
keys (Department, doc_id).
Also, I would advise against using mult
r for this. I also have used it to intercept and
> augment query AST.
>
>
>
> Also, I would echo others sentiment that its quite ugly. It would be great
> if we can refactor/standardize this. That will make integrating other
> system a lot easier.
>
>
>
> Thanks,
>
¯\_(ツ)_/¯
On 5 April 2018 at 16:02, Richard A. Bross wrote:
> I can't duplicate the issue now. Works like it always has . .
>
> - Original Message -
> From: "Furcy Pin"
> To: user@hive.apache.org
> Sent: Thursday, April 5, 2018 9:57:39 AM
> Subject
61
> Bucket Columns: [crs_id]
> Sort Columns: []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.467 seconds, Fetched: 98 row(s)
>
>
> - Original Message -
> From: "Furcy Pin"
> To: user@hive.apache.org
&
Hi Richard,
could you please check if your table is EXTERNAL?
You can see it with a "DESCRIBE FORMATTED table_name ;"
That's what external tables are for, they don't delete underlying data when
you drop them.
On 5 April 2018 at 15:18, Richard A. Bross wrote:
> I think that someone put a file
Hi Mahender,
Did you look at this? https://www.snappydata.io/blog/the-spark-database
But I believe that most people handle this use case by either using:
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...)
if the data is not too big
- Their favorite New-SQL storage (Cassandr
be
>>> great if we can refactor/standardize this. That will make integrating
>>> other system a lot easier.
>>>
>>> Thanks,
>>> Dharmesh
>>>
>>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin wrote:
>>>
>>>> Hi
t; I got your point and thanks for the nice slides info.
>
>
> So the parquet filter is not an easy thing and I will try that according
> to the deck.
>
>
> Thanks !
> ------
> *From:* Furcy Pin
> *Sent:* Friday, February 23, 2018 3:37:52 AM
&
Hi,
Unless your table is partitioned or bucketed by myid, Hive generally
requires to read through all the records to find the records that match
your predicate.
In other words, Hive table are generally not indexed for single record
retrieval like you would expect RDBMs tables or Vertica tables to
Hi Elliot,
Actually, I have done quite similar work regarding Hive custom Parsing, you
should have a look at my project: https://github.com/flaminem/flamy
The Hive parsing related stuff is here:
https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
A good st
Hi, I might be wrong, but I would expect views to work as a simple
replacement at the query level (like a subquery or a CTE).
This means that partition pruning should work as long as predicate pushdown
properly works.
For your specific use case, it should be easy enough to test, and you can
use an
You need to enable nonstrict mode to be able to use dynamic partitioning.
Dynamic partitioning is more automatic, but might take more time because
Hive has to write to multiple outputs at once.
With MapReduce, this uses MultipleOutputs, and can fail with OOM when
writing too many partitions at onc
You can disable locking by setting
SET hive.support.concurrency=true ;
It can be used in a single session before a query.
However locking exists for a reason: what do you think will happen if you
overwrite files from a table while Tableau is reading them?
It really depends on your use case, but
Hi Scott,
I ran into similar issues before.
I think that you need to do two things to enable Hive access to s3:
1. add the following jars to your Hive classpath, for instance by using
symlinks
ln -s $HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar
$HIVE_HOME/lib/.
ln -s $HADOOP_HOME
Hello Jinchul,
in SQL, any type (even complex types such as structs) can be nulls.
And this happen as often as the use of (LEFT | RIGHT | FULL) JOINS:
when a record is not matched by the joined table, it will be NULL.
As far as I remember, directly creating NULL with complex types (for
testing pu
Hello,
As the table explains, it really depends on the number of concurrent
connections you will have on your HiveServer2,
but I confirm that you can experience OOM failures on either the
HiveServer2 or the Metastore if too many concurrent
queries are run simultaneously.
For larger clusters, as C
Hi,
I recommend you try using HiveServer2 and beeline as explained here:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-RunningHiveServer2andBeeline.1
In particular, please make sure you have run "bin/schematool -dbType derby
-initSchema" and that it returned succe
Hi Jacob,
Like Elliot and Alan said, cross-language exchanges seem to be handled
using Thrift, and there are some ODBC drivers too.
I didn't find a C library, even if they are python, ruby and nodeJS
libraries.
Also, when Googling, beware not to confuse HiveServer1 (sometimes refered
to as Thrift
Hi,
If you want to load pre-existing records, instead of inserting data in this
partition, you should use the ADD PARTITION statement
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions
or simply the MSCK REPAIR TABLE statement
https://cwiki.apache.
According to the documentation link you gave:
"A 'S' lock on table and relevant partition is acquired when a read is
being performed. For all other operations, an 'X' lock is taken on the
partition. However, if the change is only applicable to the newer
partitions, a 'S' lock is acquired on the ta
Hello,
Either one can receive the bigger row_num, in an underteministic fashion
(which is NOT equivalent to random).
Simply put, it will be whichever is treated last by Hive, which you have no
way to know.
If your two rows differ on other columns, you might want to add them to
your ORDER BY claus
d do
> a bulk load instead of using single insertion for each record.
>
>
> Jorn and Furcy, thanks again for the explanations and suggestions!
>
>
> On Sep 12, 2017 4:53 AM, "Furcy Pin" wrote:
>
> Hi,
>
> this is a very common question, as many people
Hi,
this is a very common question, as many people knowing SQL are used to
RDBMS like MySQL, Oracle, or SQL Server.
The first thing you need to know about Hive is that, in the first place, it
has not been designed to replace
such databases. Not when they are used for transaction processing anyway.
As far as I know, Spark can't read Hive's transactionnal tables yet:
https://issues.apache.org/jira/browse/SPARK-16996
On Thu, Aug 24, 2017 at 4:34 AM, Aviral Agarwal
wrote:
> So, there is no way possible right now for Spark to read Hive 2.x data ?
>
> On Thu, Aug 24, 2017 at 12:17 AM, Eugene
Hello,
You can use the split(string, pattern) UDF, that returns an array.
You can compute this array in a subquery, and then assign a[0], a[1], a[2]
... to each column.
The split UDF will only be called once per row.
On Thu, Aug 24, 2017 at 12:11 AM, Deepak Khandelwal <
dkhandelwal@gmail.co
with it.
On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
ramasubramanian.naraya...@gmail.com> wrote:
> Hi,
>
> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
> It needs to be equated.
>
>
> regards,
> Rams
>
> On We
I would suggest to use a subquery
WITH unique_currency AS (
SELECT
CCY_CD,
MAX(CNTRY_DESC) as CNTRY_DESC
FROM CURRENCY
GROUP BY CCY_CD
)
and then perform your left join on it.
Some SQL engine (e.g. Presto) have aggregation functions like arbitrary(col)
that take any value and are a
Hi Giles,
Did you try running the query with the EXPLAIN AUTHORIZATION clause?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain#LanguageManualExplain-TheAUTHORIZATIONClause
On Wed, Aug 16, 2017 at 3:08 AM, Bear Giles wrote:
> Hi, an issue came up during recent AT and I'm
re to star the repository and to spread the word!
Thanks,
Furcy Pin
CTO @ Flaminem
Another option would be to try Facebook's Presto https://prestodb.io/
Like Impala, Presto is designed for fast interactive querying over Hive
tables, but it is also capable of querying data from many other SQL sources
(mySQL, postgreSQL, Kafka, Cassandra, ...
https://prestodb.io/docs/current/conne
2017 at 10:07 AM, Furcy Pin
> wrote:
>
>> Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked.
>>
>>
>> On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar > > wrote:
>>
>>> This is interesting and possibly a bug.
Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked.
On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar
wrote:
> This is interesting and possibly a bug. Did you try changing them to
> managed tables and then dropping or truncating them? How do we reproduce
> this on our
Hi Dmitry,
I believe what you are referring to is similar to what the MSCK REPAIR
TABLE command does in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)
Simply put: if you define a partitioned Hive table via a CREATE st
Hi Dmitry,
If I understand what you said correctly:
At the beginning you have csv files on hdfs,
and at the end you want a partitioned Hive table as parquet.
And your question is: "can I do this using only one Hive table and a LOAD
statement?"
The answer to that question is "no".
The correct
Hello Jan,
I guess this is because you wrote ${hivevar=PNAME} with an = instead of a :
regards,
Furcy
On Tue, Mar 21, 2017 at 5:50 PM, Brotanek, Jan
wrote:
> Hi, I have following shell script:
>
>
>
> #!/bin/bash
>
> PNAME=20170320_203225
>
> TNAME=tablea1
>
> /usr/bin/hive -hiveconf hive.cli
Actually, there is a legacy feature in Hive that would do exactly what you
need :
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification
Another way would simply to use dynamic partitioning :
INSERT INTO TABLE employee_orc PARTITION *
Hi Joaquin,
Suresh was faster than me ...
Also, you should check this :
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert
On Tue, Dec 13, 2016 at 3:19 PM, Suresh Kumar Sethuramaswamy <
rock...@gmail.com> wrote:
> Hi Joaquin
>
> In hive , when u run '
Hi Mahender,
it's hard to say what happen without seeing the actual query.
Hive has several ways to perform joins. There is a complete description of
how it does it here:
https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization
Sadly, the illustrations are broken.
There is also this
Hi,
you should replace
WITH table AS (subquery)
SELECT ...
FROM table
with
SELECT ...
FROM(
subquery
) table
Regards.
On Thu, Dec 1, 2016 at 12:32 PM, Priyanka Raghuvanshi
wrote:
> Hi All
>
>
> Getting error 'Table not found in the definition of view ' if a view is
> created using 'WITH'
ClassNotFoundException generally means that jars are missing from your
class path.
You probably need to link the spark jar to $HIVE_HOME/lib
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive
On Tue, Nov 29, 2016 at 2:03 AM
of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 October 2016 at 13:44, Fur
Hi Mich,
I mostly agree with you, but I would comment on the part about using HBase
as a maintenance free core product:
I would say that most medium company using Hadoop rely on Hortonworks or
Cloudera, that both provides a pre-packaged HBase installation. It would
probably make sense for them to
Hi Mich,
the umbrella JIRA for this gives a few reason.
https://issues.apache.org/jira/browse/HIVE-9452
(with even more details in the attached pdf
https://issues.apache.org/jira/secure/attachment/12697601/HBaseMetastoreApproach.pdf
)
In my experience, Hive tables with a lot of partitions (> 10 0
Hi Rakesh,
this is an interesting question,
Did you look at the cube and rollup possibilities?
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
I believe that you could use this feature to merge your table 1 and table 2
in one single table with a
Hi Nitin,
I confirm that there is something odd here.
I did the following test :
create table test_orc (id int, name string, dept string) stored as ORC;
insert into table test_orc values (1, 'abc', 'xyz');
insert into table test_orc values (2, 'def', 'xyz');
insert into table test_orc values (3,
Hi Elliot,
I guess you can use IMetaStoreClient.listPartitionsNames instead, and then
use IMetaStoreClient.getPartition for each partition.
This might be slow though, as you will have to make 10 000 calls to get
them.
Another option I'd consider is connecting directly to the Hive metastore.
This
I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to
work.
Since MR is deprecated in 2.0, I assume we should not expect any future
perf optimisation on this side.
It has been implemented for Hive on Spark, though.
https://issues.apache.org/jira/browse/HIVE-9152
On Tue, Aug
Hi Abhishek,
Yes, it can happen.
The only such scenarios I can think of are when you use a WHERE clause with
a non-constant clause.
As far as I know, partition only work on constant clauses, because it has
to evaluate them *before* starting the query in order to prune the
partitions.
For instanc
Hi,
I think I had similar issues to yours.
Did you look in the Hive documentation at what the CASCADE keyword does on
ADD or CHANGE COLUMNS statements?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterColumn
>From what I understand, the behavior of Hive
Hi,
This is feasible.
I already did it in some tool I developed.
Unfortunately I can't open-source it yet, but I can tell you how I dit it.
There is this class in Hive source code
https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/ql/src/java/org/apache/hadoop/hive/ql
Hi Hive users,
I would like to pursue the discussion that happened during the design of
the feature:
https://issues.apache.org/jira/browse/HIVE-6167
Some concern where raised back then, and I think that maybe now that it has
been implemented, some user feedbacks could bring water to the mill.
Ev
ntended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this em
maybe you compile and run against different versions of spark?
On Thu, Dec 3, 2015 at 6:54 PM, Mich Talebzadeh wrote:
> Trying to run Hive on Spark 1.3 engine, I get
>
>
>
> conf hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.
Hi Vijaya,
If you need some nice ETL capabilities, you may want to try
https://github.com/databricks/spark-csv
Among other things, spark-csv let you read the csv as is and create and
insert a copy of the
data into a Hive table with any format you like (Parquet, ORC, etc.)
If you have a header ro
SHARED_READ
>> NULL1445586247044 1445585940654 hduser rhes564
>>
>> 14030 asehadoop t NULLWAITING EXCLUSIVE NULL
>> 1445586471827 NULLhduser rhes564
>>
>>
>>
>> I am not sure this is behaving like classic RDB
Hi Mich,
I believe the duration of locks is defined by hive.txn.timeout, which is 5
min by default.
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
Retry your SHOW LOCKS command and check that the Last HeartBeat is not
changing.
If it is, it means your query is still act
Hi folks,
I would like to start a discussion with the Hive user and developper
community about an element of syntax present in SQL Server that could be
nice to have in Hive.
Back in 2012, before I started Hive, and was using SQL Server, I came
accross this post :
http://sqlblog.com/blogs/aaron
I just created it :
1. HIVE-11933 <https://issues.apache.org/jira/browse/HIVE-11933>
On Mon, Sep 21, 2015 at 6:37 PM, Alan Gates wrote:
> I am not aware of this issue. Please file a JIRA, and if it does turn out
> to be a duplicate we can mark it as such.
>
> A
Hi,
We bumped into a bug when using vectorization on a transactional table.
Here is a minimal example :
DROP TABLE IF EXISTS vectorization_transactional_test ;
CREATE TABLE vectorization_transactional_test (
id INT
)
CLUSTERED BY (id) into 3 buckets
STORED AS ORC
TBLPROPERTIES('transactional
er if
> used as <> ,=, IN, NOT IN.
> IS (NOT) NULL is the right way to handle such cases. COALESCE is some
> alternative too.
>
> Thank you,
> Kind Regards
> ~Maciek
>
> On Tue, Jul 7, 2015 at 11:46 AM, Furcy Pin wrote:
>
>> Hi folks,
>>
>
Hi folks,
just to let my fellow Hive users know that we found a bug with subquery in
where clauses and created a JIRA for it.
https://issues.apache.org/jira/browse/HIVE-11192
The latest version seems to be affected.
Regards,
Furcy Pin
Hi Sreeman,
Unfortunately, I don't think that Hive built-in format can currently read
csv files with fields enclosed in double quotes.
More generally, for having ingested quite a lot of messy csv files myself,
I would recommend you to write a MapReduce (or Spark) job
for cleaning your csv before g
Hi Murali,
As you can see from the source code of Hive's hash UDF:
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFHash.java
https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspector
by*
ACCOUNTID;
I am still using 0.12 though, maybe later versions changed some things.
Hope this helps,
Furcy Pin
PS:
On the other hand, Hive has the opposite behavior in GROUP BY clauses, eg
this will not work:
SELECT *concat('test_',ACCOUNTID) as **tempColumn, COUNT(1)* from table
G
Hi Thimut,
I believe that the UDF unix_timestamp uses the java class SimpleDateFormat.
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
>From the doc, you can see that "m" denotes a minute while "M" denotes a
month.
For your problem, "-MMM-dd" should do the trick.
Reg
Hi,
what you are trying to do looks very much like what the LAG windowing
function does.
If your version of Hive is 0.11 or higher, I suggest trying it.
The hive doc for windowing function is here (but is quite poor):
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnal
yoException:
> Unable to find class: fun_name*
> Serialization trace:
> genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc)
> conf (org.apache.hadoop.hive.ql.exec.UDTFOperator)
> childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
> childOperators (org.apache.hadoop.
Hi,
I think I encountered this kind of serialization problem when writing UDFs.
Usually, marking every fields of the UDF as *transient* does the trick.
I guess the error means that Kryo tries to serialize the UDF class and
everything that is inside, and by marking them as transient
you ensure tha
precision :
A.Z*.A" with Z = not A matched against "ABABAB" will match ABA twice, but
will not match ABABA
2014-09-05 16:48 GMT+02:00 Furcy Pin :
> Hi all,
>
> I've just spent some time trying to understand how the 'regex' syntax
> worked for matc
asi <
> asif.abb...@gmail.com> wrote:
>
>> Hi Furcy,
>>
>> Many thanks for your email :)
>>
>> My latest info was that the rename took place due to objections by
>> Teradata, but didn't know if they had actually requested to take it off the
>>
Hi Muhammad,
>From what I've googled a few months ago on the subject, MatchPath UDF has
been removed from Cloudera and Hortonworks releases because TeraData
claims it violates one of their patent (apparently renaming it did not
suffice).
I guess that if you really need it, it might be possible to
s://issues.apache.org/jira/browse/HIVE-7143
> 5. https://issues.apache.org/jira/browse/HIVE-7344
>
> Hope it helps,
> Ashutosh
>
>
> On Wed, Aug 20, 2014 at 5:34 AM, Furcy Pin wrote:
>
>> Hi all,
>>
>> I have an event table with (user_id, timestamp, event)
Hi all,
I have an event table with (user_id, timestamp, event)
and I'm trying to write a query to get the first 10 events for each user.
My query goes like this :
SELECT user_id, event
FROM
(
SELECT
user_id,
event,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum
FROM eventTable
Hi all,
I just wanted to point out a little gotcha we got while using the
collect_set UDF :
You should not perform directly a GROUP BY over a collect_set(...), because
the
set is cast as an array and is not necessarilly sorted.
For example, we ran a query looking like this;
SELECT
set,
COUNT(1)
think?
2014-06-30 9:15 GMT+02:00 Furcy Pin :
> Hi all,
>
>
> Does anyone know what is the expected behavior on a query like this :
>
> FROM source
> INSERT OVERWRITE TABLE dest
> SELECT *
> INSERT INTO TABLE dest
> SELECT *
>
> is it the same as
>
>
Hi all,
Does anyone know what is the expected behavior on a query like this :
FROM source
INSERT OVERWRITE TABLE dest
SELECT *
INSERT INTO TABLE dest
SELECT *
is it the same as
FROM source
INSERT INTO TABLE dest
SELECT *
INSERT OVERWRITE TABLE dest
SELECT *
?
I'm asking because I ran a query
y equality join because you're using the partitioning step
> in order to enforce the join predicate instead of actually evaluating it.
> This could be much more efficient in some cases, so I was wondering if Hive
> can do it and if so, how can I trigger it?
>
> Cheers,
> Mar
Hi Mark,
I'm not sure if I understand what your trying to do correctly, do you know
the reference id on which you want to do the join beforehand?
Or is one of your tables small?
Or are they all big with a small intersection?
I you haven't yet, I would suggest you to have a look at MapJoin:
https:
ent from my iPhone
>
> On Apr 23, 2014, at 2:51 AM, Furcy Pin wrote:
>
> Hi,
>
> note that if your table contains the max value several time, all the
> occurences will be returned. Note also that if it contains a null it will
> be returned too.
>
> How about this?
Hi,
note that if your table contains the max value several time, all the
occurences will be returned. Note also that if it contains a null it will
be returned too.
How about this?
SELECT score FROM student ORDER BY score DESC LIMIT 1 ;
Note that on this query Impala is incoherent with Hive or
95 matches
Mail list logo