from:"Furcy Pin"

Re: How to manage huge partitioned table with 1000+ columns in Hive

2019-11-26 Thread Furcy Pin

Hello, Sorry for the late reply, but this problem is very interesting. How did you end up solving it in the end? I have an idea which is very ugly but might work: Create a big view that is an union of all partitions SELECT '2019-10-01' as ds, * FROM test_1 a JOIN test_2 b ON a.id = b.id JOIN te

Re: just released: Docker image of a minimal Hive server

2019-02-21 Thread Furcy Pin

Hello! If that might help, I did this repo a while ago: https://github.com/FurcyPin/docker-hive-spark It provides a pre-installed Hive Metastore and a HiveServer running on Spark (Spark-SQL not Hive on Spark) I also did some config to acces AWS s3 data with it. Cheers, Furcy On Thu, 21 Feb 201

Re: Comparing Google Cloud Platform BiqQuery with Hive

2019-01-14 Thread Furcy Pin

Hi Mich, Contrary to what you said, I can confirm you that BQ is able to read ORC files compressed with Snappy. However, BQ requires to perform a loading operation from the ORC file on Google Storage and convert it into a BQ table. The main advantages I see with BQ is the guaranteed very high sca

Re: External Table Creation is slow/hangs

2018-08-16 Thread Furcy Pin

Hi, I can't tell for sure where your problem is coming from, but from what you said, I guess that the Hive Metastore is performing some list or scan operation on the files and that operation is taking a very long time. maybe setting *hive.stats.autogather* to false might help. Also, beware that

Re: Auto Refresh Hive Table Metadata

2018-08-10 Thread Furcy Pin

Hi Chintan, Yes, this sounds weird... "REFRESH TABLES" is the kind of statement required by SQL engines such as Impala, Presto or Spark-SQL that cache metadata from the Metastore, but vanilla Hive usually don't cache it and query the metastore every time (unless some new feature was added recentl

Re: Hive output file 000000_0

2018-08-08 Thread Furcy Pin

> > Sujeet Singh Pardeshi > > Software Specialist > > SAS Research and Development (India) Pvt. Ltd. > > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, > 411 013 > *o*ff: +91-20-30418810 > [image: Description: untitled] > > *"Whe

Re: Hive output file 000000_0

2018-08-08 Thread Furcy Pin

It might sound silly, but isn't it what Hive is supposed to do, being a distributed computation framework and all ? Hive will write one file per reducer, called 0_0, 1_0, etc. where the number corresponds to the number of your reducer. Sometimes the _0 will be a _1 or _2 or more depending

Re: Optimal approach for changing file format of a partitioned table

2018-08-06 Thread Furcy Pin

Hi Elliot, >From your description of the problem, I'm assuming that you are doing a INSERT OVERWRITE table PARTITION(p1, p2) SELECT * FROM table or something close, like a CREATE TABLE AS ... maybe. If this is the case, I suspect that your shuffle phase comes from dynamic partitioning, and in pa

Re: what's the best practice to create an external hive table based on a csv file on HDFS with 618 columns in header?

2018-07-24 Thread Furcy Pin

Hello, To load your data as parquet, you can either: A. use spark: https://docs.databricks.com/spark/latest/data-sources/read-csv.html and write it directly as a parquet file (df.write.format("parquet").saveAsTable("parquet_table")) B. Load it as a csv file in Hive, and perform a CREATE TABLE par

Re: Question about efficiency of SELECT DISTINCT

2018-07-02 Thread Furcy Pin

Hi, They are rigorously equivalent. You can see this with the following queries: CREATE TABLE t1 (a INT, b INT, c INT) ; EXPLAIN SELECT DISTINCT a,b,c FROM t1 ; EXPLAIN SELECT a,b,c FROM t1 GROUP BY a,b,c ; Both queries will return the exact same query plan: Stage-0 Fetch Operator

Re: Partition Pruning using UDF

2018-05-15 Thread Furcy Pin

op/hive/ql/udf/UDFType.html> > ) > > I think that I need something like 'eager evaluation' --> evaluate UDFs > before build physical plan (if not you can't do partition pruning) > > On 15 May 2018 at 09:21, Furcy Pin wrote: > >> Hi Alberto, >> >

Re: Partition Pruning using UDF

2018-05-15 Thread Furcy Pin

Hi Alberto, If I'm not mistaken, to make sure that this work you need to give the proper annotation in your UDF code (deterministic, and maybe some other). You may also need to return a Constant Object Inspector in the unit method so that Hive knows that it can perform partition pruning with it.

Re: insert overwrite to hive orc table in aws

2018-05-01 Thread Furcy Pin

I suggest that you do a DESCRIBE FORMATTED orc and to check directly on s3 the location of your data. Did you also try a SELECT * FROM orc LIMIT 10; ? Did it return something? On Fri, 27 Apr 2018, 21:30 Sowjanya Kakarala, wrote: > > > > >> Hi Guys, >> >> I am new to aws, I am not sure how t

Re: Exploding arrays of structs

2018-05-01 Thread Furcy Pin

Hi Tom, High level manipulation with arrays of structural is generally where SQL reaches its limit in terms of expressivity and where Spark starts to get my preference. Unfortunately I don't know any public UDF that can do exactly what you want, but coding one is not that hard. If you use scala, s

Re: Does Hive support Hbase-synced partitioned tables?

2018-04-22 Thread Furcy Pin

Hi Oleksiy, I must say that I don't know if partitioned HBase-backed tables are supported in Hive, but I don't understand why you would need it. What are you trying to do exactly? I suspect that you could do it by using composite keys (Department, doc_id). Also, I would advise against using mult

Re: Does Hive support Hbase-synced partitioned tables?

2018-04-22 Thread Furcy Pin

Hi Oleksiy, I must say that I don't know if partitioned HBase-backed tables are supported in Hive, but I don't understand why you would need it. What are you trying to do exactly? I suspect that you could do it by using composite keys (Department, doc_id). Also, I would advise against using mult

Re: HQL parser internals

2018-04-13 Thread Furcy Pin

r for this. I also have used it to intercept and > augment query AST. > > > > Also, I would echo others sentiment that its quite ugly. It would be great > if we can refactor/standardize this. That will make integrating other > system a lot easier. > > > > Thanks, >

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin

¯\_(ツ)_/¯ On 5 April 2018 at 16:02, Richard A. Bross wrote: > I can't duplicate the issue now. Works like it always has . . > > - Original Message - > From: "Furcy Pin" > To: user@hive.apache.org > Sent: Thursday, April 5, 2018 9:57:39 AM > Subject

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin

61 > Bucket Columns: [crs_id] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.467 seconds, Fetched: 98 row(s) > > > - Original Message - > From: "Furcy Pin" > To: user@hive.apache.org &

Re: ALTER TABLE DROP PARTITION not working on S3

2018-04-05 Thread Furcy Pin

Hi Richard, could you please check if your table is EXTERNAL? You can see it with a "DESCRIBE FORMATTED table_name ;" That's what external tables are for, they don't delete underlying data when you drop them. On 5 April 2018 at 15:18, Richard A. Bross wrote: > I think that someone put a file

Re: Building Datwarehouse Application in Spark

2018-04-04 Thread Furcy Pin

Hi Mahender, Did you look at this? https://www.snappydata.io/blog/the-spark-database But I believe that most people handle this use case by either using: - Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if the data is not too big - Their favorite New-SQL storage (Cassandr

Re: HQL parser internals

2018-03-20 Thread Furcy Pin

be >>> great if we can refactor/standardize this. That will make integrating >>> other system a lot easier. >>> >>> Thanks, >>> Dharmesh >>> >>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin wrote: >>> >>>> Hi

Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Furcy Pin

t; I got your point and thanks for the nice slides info. > > > So the parquet filter is not an easy thing and I will try that according > to the deck. > > > Thanks ! > ------ > *From:* Furcy Pin > *Sent:* Friday, February 23, 2018 3:37:52 AM &

Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Furcy Pin

Hi, Unless your table is partitioned or bucketed by myid, Hive generally requires to read through all the records to find the records that match your predicate. In other words, Hive table are generally not indexed for single record retrieval like you would expect RDBMs tables or Vertica tables to

Re: HQL parser internals

2018-02-17 Thread Furcy Pin

Hi Elliot, Actually, I have done quite similar work regarding Hive custom Parsing, you should have a look at my project: https://github.com/flaminem/flamy The Hive parsing related stuff is here: https://github.com/flaminem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive A good st

Re: Will partition pruning will happen on HIVE views?

2018-01-23 Thread Furcy Pin

Hi, I might be wrong, but I would expect views to work as a simple replacement at the query level (like a subquery or a CTE). This means that partition pruning should work as long as predicate pushdown properly works. For your specific use case, it should be easy enough to test, and you can use an

Re: Dynamic vs Static partition

2018-01-09 Thread Furcy Pin

You need to enable nonstrict mode to be able to use dynamic partitioning. Dynamic partitioning is more automatic, but might take more time because Hive has to write to multiple outputs at once. With MapReduce, this uses MultipleOutputs, and can fail with OOM when writing too many partitions at onc

Re: 答复: Re: Does Hive SQL support reading data without locking?

2018-01-02 Thread Furcy Pin

You can disable locking by setting SET hive.support.concurrency=true ; It can be used in a single session before a query. However locking exists for a reason: what do you think will happen if you overwrite files from a table while Tableau is reading them? It really depends on your use case, but

Re: Cannot create external table on S3; class S3AFileSystem not found

2017-12-10 Thread Furcy Pin

Hi Scott, I ran into similar issues before. I think that you need to do two things to enable Hive access to s3: 1. add the following jars to your Hive classpath, for instance by using symlinks ln -s $HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar $HIVE_HOME/lib/. ln -s $HADOOP_HOME

Re: "IS [NOT] NULL for a complex type"

2017-11-29 Thread Furcy Pin

Hello Jinchul, in SQL, any type (even complex types such as structs) can be nulls. And this happen as often as the use of (LEFT | RIGHT | FULL) JOINS: when a record is not matched by the joined table, it will be NULL. As far as I remember, directly creating NULL with complex types (for testing pu

Re: migrate hive cli to beeline

2017-11-22 Thread Furcy Pin

Hello, As the table explains, it really depends on the number of concurrent connections you will have on your HiveServer2, but I confirm that you can experience OOM failures on either the HiveServer2 or the Metastore if too many concurrent queries are run simultaneously. For larger clusters, as C

Re: Error "Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient"

2017-11-21 Thread Furcy Pin

Hi, I recommend you try using HiveServer2 and beeline as explained here: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-RunningHiveServer2andBeeline.1 In particular, please make sure you have run "bin/schematool -dbType derby -initSchema" and that it returned succe

Re: Options for connecting to Apache Hive

2017-11-11 Thread Furcy Pin

Hi Jacob, Like Elliot and Alan said, cross-language exchanges seem to be handled using Thrift, and there are some ODBC drivers too. I didn't find a C library, even if they are python, ruby and nodeJS libraries. Also, when Googling, beware not to confuse HiveServer1 (sometimes refered to as Thrift

Re: partitioned hive table

2017-10-31 Thread Furcy Pin

Hi, If you want to load pre-existing records, instead of inserting data in this partition, you should use the ADD PARTITION statement https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions or simply the MSCK REPAIR TABLE statement https://cwiki.apache.

Re: Hive locking mechanism on read partition.

2017-10-12 Thread Furcy Pin

According to the documentation link you gave: "A 'S' lock on table and relevant partition is acquired when a read is being performed. For all other operations, an 'X' lock is taken on the partition. However, if the change is only applicable to the newer partitions, a 'S' lock is acquired on the ta

Re: [Questio]Which record does Hive give the bigger number when I use row_number

2017-10-11 Thread Furcy Pin

Hello, Either one can receive the bigger row_num, in an underteministic fashion (which is NOT equivalent to random). Simply put, it will be whichever is treated last by Hive, which you have no way to know. If your two rows differ on other columns, you might want to add them to your ORDER BY claus

Re: Why is a single INSERT very slow in Hive?

2017-09-12 Thread Furcy Pin

d do > a bulk load instead of using single insertion for each record. > > > Jorn and Furcy, thanks again for the explanations and suggestions! > > > On Sep 12, 2017 4:53 AM, "Furcy Pin" wrote: > > Hi, > > this is a very common question, as many people

Re: Why is a single INSERT very slow in Hive?

2017-09-12 Thread Furcy Pin

Hi, this is a very common question, as many people knowing SQL are used to RDBMS like MySQL, Oracle, or SQL Server. The first thing you need to know about Hive is that, in the first place, it has not been designed to replace such databases. Not when they are used for transaction processing anyway.

Re: ORC Transaction Table - Spark

2017-08-24 Thread Furcy Pin

As far as I know, Spark can't read Hive's transactionnal tables yet: https://issues.apache.org/jira/browse/SPARK-16996 On Thu, Aug 24, 2017 at 4:34 AM, Aviral Agarwal wrote: > So, there is no way possible right now for Spark to read Hive 2.x data ? > > On Thu, Aug 24, 2017 at 12:17 AM, Eugene

Re: One column into multiple column.

2017-08-24 Thread Furcy Pin

Hello, You can use the split(string, pattern) UDF, that returns an array. You can compute this array in a subquery, and then assign a[0], a[1], a[2] ... to each column. The split UDF will only be called once per row. On Thu, Aug 24, 2017 at 12:11 AM, Deepak Khandelwal < dkhandelwal@gmail.co

Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Furcy Pin

with it. On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan < ramasubramanian.naraya...@gmail.com> wrote: > Hi, > > TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. > It needs to be equated. > > > regards, > Rams > > On We

Re: LEFT JOIN and WHERE CLAUSE - How to handle

2017-08-23 Thread Furcy Pin

I would suggest to use a subquery WITH unique_currency AS ( SELECT CCY_CD, MAX(CNTRY_DESC) as CNTRY_DESC FROM CURRENCY GROUP BY CCY_CD ) and then perform your left join on it. Some SQL engine (e.g. Presto) have aggregation functions like arbitrary(col) that take any value and are a

Re: How can I identify permissions issues given problematic SQL?

2017-08-16 Thread Furcy Pin

Hi Giles, Did you try running the query with the EXPLAIN AUTHORIZATION clause? https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain#LanguageManualExplain-TheAUTHORIZATIONClause On Wed, Aug 16, 2017 at 3:08 AM, Bear Giles wrote: > Hi, an issue came up during recent AT and I'm

Introducing a new open source project: flamy

2017-07-05 Thread Furcy Pin

re to star the repository and to spread the word! Thanks, Furcy Pin CTO @ Flaminem

Re: Format dillema

2017-06-20 Thread Furcy Pin

Another option would be to try Facebook's Presto https://prestodb.io/ Like Impala, Presto is designed for fast interactive querying over Hive tables, but it is also capable of querying data from many other SQL sources (mySQL, postgreSQL, Kafka, Cassandra, ... https://prestodb.io/docs/current/conne

Re: drop table - external - aws

2017-05-17 Thread Furcy Pin

2017 at 10:07 AM, Furcy Pin > wrote: > >> Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked. >> >> >> On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar > > wrote: >> >>> This is interesting and possibly a bug.

Re: drop table - external - aws

2017-05-17 Thread Furcy Pin

Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked. On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar wrote: > This is interesting and possibly a bug. Did you try changing them to > managed tables and then dropping or truncating them? How do we reproduce > this on our

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-07 Thread Furcy Pin

Hi Dmitry, I believe what you are referring to is similar to what the MSCK REPAIR TABLE command does in Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE) Simply put: if you define a partitioned Hive table via a CREATE st

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-06 Thread Furcy Pin

Hi Dmitry, If I understand what you said correctly: At the beginning you have csv files on hdfs, and at the end you want a partitioned Hive table as parquet. And your question is: "can I do this using only one Hive table and a LOAD statement?" The answer to that question is "no". The correct

Re: hivevar in partition clause

2017-03-21 Thread Furcy Pin

Hello Jan, I guess this is because you wrote ${hivevar=PNAME} with an = instead of a : regards, Furcy On Tue, Mar 21, 2017 at 5:50 PM, Brotanek, Jan wrote: > Hi, I have following shell script: > > > > #!/bin/bash > > PNAME=20170320_203225 > > TNAME=tablea1 > > /usr/bin/hive -hiveconf hive.cli

Re: PARTITION error because different columns size

2016-12-13 Thread Furcy Pin

Actually, there is a legacy feature in Hive that would do exactly what you need : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification Another way would simply to use dynamic partitioning : INSERT INTO TABLE employee_orc PARTITION *

Re: PARTITION error because different columns size

2016-12-13 Thread Furcy Pin

Hi Joaquin, Suresh was faster than me ... Also, you should check this : https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert On Tue, Dec 13, 2016 at 3:19 PM, Suresh Kumar Sethuramaswamy < rock...@gmail.com> wrote: > Hi Joaquin > > In hive , when u run '

Re: How to move tasks under reducer to Mapper phase

2016-12-10 Thread Furcy Pin

Hi Mahender, it's hard to say what happen without seeing the actual query. Hive has several ways to perform joins. There is a complete description of how it does it here: https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization Sadly, the illustrations are broken. There is also this

Re: Table not found in the definition of view

2016-12-01 Thread Furcy Pin

Hi, you should replace WITH table AS (subquery) SELECT ... FROM table with SELECT ... FROM( subquery ) table Regards. On Thu, Dec 1, 2016 at 12:32 PM, Priyanka Raghuvanshi wrote: > Hi All > > > Getting error 'Table not found in the definition of view ' if a view is > created using 'WITH'

Re: Hive on Spark not working

2016-11-28 Thread Furcy Pin

ClassNotFoundException generally means that jars are missing from your class path. You probably need to link the spark jar to $HIVE_HOME/lib https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive On Tue, Nov 29, 2016 at 2:03 AM

Re: Hive metadata on Hbase

2016-10-25 Thread Furcy Pin

of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 25 October 2016 at 13:44, Fur

Re: Hive metadata on Hbase

2016-10-25 Thread Furcy Pin

Hi Mich, I mostly agree with you, but I would comment on the part about using HBase as a maintenance free core product: I would say that most medium company using Hadoop rely on Hortonworks or Cloudera, that both provides a pre-packaged HBase installation. It would probably make sense for them to

Re: Hive metadata on Hbase

2016-10-24 Thread Furcy Pin

Hi Mich, the umbrella JIRA for this gives a few reason. https://issues.apache.org/jira/browse/HIVE-9452 (with even more details in the attached pdf https://issues.apache.org/jira/secure/attachment/12697601/HBaseMetastoreApproach.pdf ) In my experience, Hive tables with a lot of partitions (> 10 0

Re: Table substitution

2016-08-25 Thread Furcy Pin

Hi Rakesh, this is an interesting question, Did you look at the cube and rollup possibilities? https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup I believe that you could use this feature to merge your table 1 and table 2 in one single table with a

Re: Populating tables using hive and spark

2016-08-22 Thread Furcy Pin

Hi Nitin, I confirm that there is something odd here. I did the following test : create table test_orc (id int, name string, dept string) stored as ORC; insert into table test_orc values (1, 'abc', 'xyz'); insert into table test_orc values (2, 'def', 'xyz'); insert into table test_orc values (3,

Re: Iterating over partitions using the metastore API

2016-08-04 Thread Furcy Pin

Hi Elliot, I guess you can use IMetaStoreClient.listPartitionsNames instead, and then use IMetaStoreClient.getPartition for each partition. This might be slow though, as you will have to make 10 000 calls to get them. Another option I'd consider is connecting directly to the Hive metastore. This

Re: Doubt on Hive Partitioning.

2016-08-02 Thread Furcy Pin

I'm using Hive 1.1 on MR and dynamic partition pruning does not seem to work. Since MR is deprecated in 2.0, I assume we should not expect any future perf optimisation on this side. It has been implemented for Hive on Spark, though. https://issues.apache.org/jira/browse/HIVE-9152 On Tue, Aug

Re: Doubt on Hive Partitioning.

2016-08-01 Thread Furcy Pin

Hi Abhishek, Yes, it can happen. The only such scenarios I can think of are when you use a WHERE clause with a non-constant clause. As far as I know, partition only work on constant clauses, because it has to evaluate them *before* starting the query in order to prune the partitions. For instanc

Re: Re-arrange columns

2016-07-26 Thread Furcy Pin

Hi, I think I had similar issues to yours. Did you look in the Hive documentation at what the CASCADE keyword does on ADD or CHANGE COLUMNS statements? https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterColumn >From what I understand, the behavior of Hive

Re: Could i use Hive SQL parser in our application?

2016-05-18 Thread Furcy Pin

Hi, This is feasible. I already did it in some tool I developed. Unfortunately I can't open-source it yet, but I can tell you how I dit it. There is this class in Hive source code https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/ql/src/java/org/apache/hadoop/hive/ql

Discussion: permanent UDF with database name

2015-12-17 Thread Furcy Pin

Hi Hive users, I would like to pursue the discussion that happened during the design of the feature: https://issues.apache.org/jira/browse/HIVE-6167 Some concern where raised back then, and I think that maybe now that it has been implemented, some user feedbacks could bring water to the mill. Ev

Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Furcy Pin

ntended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this em

Re: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Furcy Pin

maybe you compile and run against different versions of spark? On Thu, Dec 3, 2015 at 6:54 PM, Mich Talebzadeh wrote: > Trying to run Hive on Spark 1.3 engine, I get > > > > conf hive.spark.client.channel.log.level=null --conf > hive.spark.client.rpc.max.size=52428800 --conf > hive.spark.client.

Re: Best way to load CSV file into Hive

2015-11-01 Thread Furcy Pin

Hi Vijaya, If you need some nice ETL capabilities, you may want to try https://github.com/databricks/spark-csv Among other things, spark-csv let you read the csv as is and create and insert a copy of the data into a Hive table with any format you like (Parquet, ORC, etc.) If you have a header ro

Re: locks are held on tables even when no job running

2015-10-24 Thread Furcy Pin

SHARED_READ >> NULL1445586247044 1445585940654 hduser rhes564 >> >> 14030 asehadoop t NULLWAITING EXCLUSIVE NULL >> 1445586471827 NULLhduser rhes564 >> >> >> >> I am not sure this is behaving like classic RDB

Re: locks are held on tables even when no job running

2015-10-23 Thread Furcy Pin

Hi Mich, I believe the duration of locks is defined by hive.txn.timeout, which is 5 min by default. https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties Retry your SHOW LOCKS command and check that the Last HeartBeat is not changing. If it is, it means your query is still act

Alias vs Assignment

2015-10-08 Thread Furcy Pin

Hi folks, I would like to start a discussion with the Hive user and developper community about an element of syntax present in SQL Server that could be nice to have in Hive. Back in 2012, before I started Hive, and was using SQL Server, I came accross this post : http://sqlblog.com/blogs/aaron

Re: transactional table + vectorization + where = bug

2015-09-23 Thread Furcy Pin

I just created it : 1. HIVE-11933 <https://issues.apache.org/jira/browse/HIVE-11933> On Mon, Sep 21, 2015 at 6:37 PM, Alan Gates wrote: > I am not aware of this issue. Please file a JIRA, and if it does turn out > to be a duplicate we can mark it as such. > > A

transactional table + vectorization + where = bug

2015-09-19 Thread Furcy Pin

Hi, We bumped into a bug when using vectorization on a transactional table. Here is a minimal example : DROP TABLE IF EXISTS vectorization_transactional_test ; CREATE TABLE vectorization_transactional_test ( id INT ) CLUSTERED BY (id) into 3 buckets STORED AS ORC TBLPROPERTIES('transactional

Re: WHERE ... NOT IN (...) + NULL values = BUG

2015-07-07 Thread Furcy Pin

er if > used as <> ,=, IN, NOT IN. > IS (NOT) NULL is the right way to handle such cases. COALESCE is some > alternative too. > > Thank you, > Kind Regards > ~Maciek > > On Tue, Jul 7, 2015 at 11:46 AM, Furcy Pin wrote: > >> Hi folks, >> >

WHERE ... NOT IN (...) + NULL values = BUG

2015-07-07 Thread Furcy Pin

Hi folks, just to let my fellow Hive users know that we found a bug with subquery in where clauses and created a JIRA for it. https://issues.apache.org/jira/browse/HIVE-11192 The latest version seems to be affected. Regards, Furcy Pin

Re: CSV file reading in hive

2015-02-13 Thread Furcy Pin

Hi Sreeman, Unfortunately, I don't think that Hive built-in format can currently read csv files with fields enclosed in double quotes. More generally, for having ingested quite a lot of messy csv files myself, I would recommend you to write a MapReduce (or Spark) job for cleaning your csv before g

Re: Hash algorithm issues

2015-01-21 Thread Furcy Pin

Hi Murali, As you can see from the source code of Hive's hash UDF: https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFHash.java https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspector

Re: Hive order by not working properly

2015-01-07 Thread Furcy Pin

by* ACCOUNTID; I am still using 0.12 though, maybe later versions changed some things. Hope this helps, Furcy Pin PS: On the other hand, Hive has the opposite behavior in GROUP BY clauses, eg this will not work: SELECT *concat('test_',ACCOUNTID) as **tempColumn, COUNT(1)* from table G

Re: How to match three letter month name in Hive queries

2014-12-21 Thread Furcy Pin

Hi Thimut, I believe that the UDF unix_timestamp uses the java class SimpleDateFormat. http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html >From the doc, you can see that "m" denotes a minute while "M" denotes a month. For your problem, "-MMM-dd" should do the trick. Reg

Re: Correlated Subqueries Workaround in Hive!

2014-09-15 Thread Furcy Pin

Hi, what you are trying to do looks very much like what the LAG windowing function does. If your version of Hive is 0.11 or higher, I suggest trying it. The hive doc for windowing function is here (but is quite poor): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnal

Re: UDTF "KryoException: unable create/find class" error in hive 0.13

2014-09-10 Thread Furcy Pin

yoException: > Unable to find class: fun_name* > Serialization trace: > genericUDTF (org.apache.hadoop.hive.ql.plan.UDTFDesc) > conf (org.apache.hadoop.hive.ql.exec.UDTFOperator) > childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) > childOperators (org.apache.hadoop.

Re: UDTF "KryoException: unable create/find class" error in hive 0.13

2014-09-09 Thread Furcy Pin

Hi, I think I encountered this kind of serialization problem when writing UDFs. Usually, marking every fields of the UDF as *transient* does the trick. I guess the error means that Kryo tries to serialize the UDF class and everything that is inside, and by marking them as transient you ensure tha

Re: MatchPath UDF usage info ?

2014-09-05 Thread Furcy Pin

precision : A.Z*.A" with Z = not A matched against "ABABAB" will match ABA twice, but will not match ABABA 2014-09-05 16:48 GMT+02:00 Furcy Pin : > Hi all, > > I've just spent some time trying to understand how the 'regex' syntax > worked for matc

Re: MatchPath UDF usage info ?

2014-09-05 Thread Furcy Pin

asi < > asif.abb...@gmail.com> wrote: > >> Hi Furcy, >> >> Many thanks for your email :) >> >> My latest info was that the rename took place due to objections by >> Teradata, but didn't know if they had actually requested to take it off the >>

Re: MatchPath UDF usage info ?

2014-09-03 Thread Furcy Pin

Hi Muhammad, >From what I've googled a few months ago on the subject, MatchPath UDF has been removed from Cloudera and Hortonworks releases because TeraData claims it violates one of their patent (apparently renaming it did not suffice). I guess that if you really need it, it might be possible to

Re: Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin

s://issues.apache.org/jira/browse/HIVE-7143 > 5. https://issues.apache.org/jira/browse/HIVE-7344 > > Hope it helps, > Ashutosh > > > On Wed, Aug 20, 2014 at 5:34 AM, Furcy Pin wrote: > >> Hi all, >> >> I have an event table with (user_id, timestamp, event)

Help needed: Out of memory with windowing functions

2014-08-20 Thread Furcy Pin

Hi all, I have an event table with (user_id, timestamp, event) and I'm trying to write a query to get the first 10 events for each user. My query goes like this : SELECT user_id, event FROM ( SELECT user_id, event, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY time) as rownum FROM eventTable

about collect_set and ordering

2014-07-30 Thread Furcy Pin

Hi all, I just wanted to point out a little gotcha we got while using the collect_set UDF : You should not perform directly a GROUP BY over a collect_set(...), because the set is cast as an array and is not necessarilly sorted. For example, we ran a query looking like this; SELECT set, COUNT(1)

Re: Multiple insert on same table with INSERT OVERWRITE + INSERT INTO ?

2014-07-01 Thread Furcy Pin

think? 2014-06-30 9:15 GMT+02:00 Furcy Pin : > Hi all, > > > Does anyone know what is the expected behavior on a query like this : > > FROM source > INSERT OVERWRITE TABLE dest > SELECT * > INSERT INTO TABLE dest > SELECT * > > is it the same as > >

Multiple insert on same table with INSERT OVERWRITE + INSERT INTO ?

2014-06-30 Thread Furcy Pin

Hi all, Does anyone know what is the expected behavior on a query like this : FROM source INSERT OVERWRITE TABLE dest SELECT * INSERT INTO TABLE dest SELECT * is it the same as FROM source INSERT INTO TABLE dest SELECT * INSERT OVERWRITE TABLE dest SELECT * ? I'm asking because I ran a query

Re: Efficient Equality Joins of Large Tables

2014-06-10 Thread Furcy Pin

y equality join because you're using the partitioning step > in order to enforce the join predicate instead of actually evaluating it. > This could be much more efficient in some cases, so I was wondering if Hive > can do it and if so, how can I trigger it? > > Cheers, > Mar

Re: Efficient Equality Joins of Large Tables

2014-06-09 Thread Furcy Pin

Hi Mark, I'm not sure if I understand what your trying to do correctly, do you know the reference id on which you want to do the join beforehand? Or is one of your tables small? Or are they all big with a small intersection? I you haven't yet, I would suggest you to have a look at MapJoin: https:

Re: Finding Max of a column without using any Aggregation functions

2014-04-23 Thread Furcy Pin

ent from my iPhone > > On Apr 23, 2014, at 2:51 AM, Furcy Pin wrote: > > Hi, > > note that if your table contains the max value several time, all the > occurences will be returned. Note also that if it contains a null it will > be returned too. > > How about this?

Re: Finding Max of a column without using any Aggregation functions

2014-04-23 Thread Furcy Pin

Hi, note that if your table contains the max value several time, all the occurences will be returned. Note also that if it contains a null it will be returned too. How about this? SELECT score FROM student ORDER BY score DESC LIMIT 1 ; Note that on this query Impala is incoherent with Hive or

95 matches

Mail list logo