Re: Dynamic partitioned parquet tables
When hive.optimize.sort.dynamic.partition is off hive opens a file writer for each new partition key as it is encountered and writes records to those appropriate files. Since the parquet writer buffers writes in memory before flushing to disk it can lead to OOMs when you have lots of partitions/open files. hive.optimize.sort.dynamic.partition sorts the records based on the partition key before starting writing. This means that all records for a partition are written in a contiguous chunk before opening a new file and writing that partition. The issue you're encountering is partition creation on the metastore is slow. I think that's a fact that isn't avoidable at the moment. I provided a patch (see HIVE-10385) but it's not for everyone. Since your size per partition is so small I'd recommend not partitioning by day and simply making it a column. For queries that span months or years you'll probably spend more time on listing files and getting partitions during query planning than actually scanning your data. -Slava On Fri, Oct 9, 2015 at 4:12 PM, Yogesh Keshetty wrote: > > Any one tried this? Please help me if you have any knowledge on this kind > of use case. > > > -- > From: yogesh.keshe...@outlook.com > To: user@hive.apache.org > Subject: Dynamic partitioned parquet tables > Date: Fri, 9 Oct 2015 11:20:57 +0530 > > > Hello, > > I have a question regarding parquet tables. We have POS data, we want to > store the data on per day partition basis. We sqoop the data into an > external table which is in text file format and then try to insert into an > external table which is partitioned by date and, due to some > requirements, we wanted to keep these files as parquet files. The average > file size per day is around 2 MB. I know that parquet is not meant to be > for lot of small files. But, we wanted to keep it that way. The problem is > during the initial historical data load we are trying to create dynamic > partitions, however no matter how much memory I set the jobs keeps failing > because of memory issues. But after some research I found out that > turning ,"set hive.optimize.sort.dynamic.partition = true", this property > on we could create dynamic partitioned tables. But this is taking longer > time than what we expected, is there anyway that we can boost the > performance? Also, in spite of turning the property on when we try to > create dynamic partitions for multiple years data at a time we are again > running into heap error. How can we handle this problem? Please help us. > > Thanks in advance! > > Thank you, > Yogesh > -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
nested join issue
I'm running into a peculiar issue with nested joins and outer select. I see this error on 1.1.0 and 1.2.0 but not 0.13 which seems like a regression. The following query produces no results: select sfrom ( select last.*, action.st2, action.n from ( select purchase.s, purchase.timestamp, max (mevt.timestamp) as last_stage_timestamp from (select * from purchase_history) purchase join (select * from cart_history) mevt on purchase.s = mevt.s where purchase.timestamp > mevt.timestamp group by purchase.s, purchase.timestamp ) last join (select * from events) action on last.s = action.s and last.last_stage_timestamp = action.timestamp ) list; While this one does produce results select *from ( select last.*, action.st2, action.n from ( select purchase.s, purchase.timestamp, max (mevt.timestamp) as last_stage_timestamp from (select * from purchase_history) purchase join (select * from cart_history) mevt on purchase.s = mevt.s where purchase.timestamp > mevt.timestamp group by purchase.s, purchase.timestamp ) last join (select * from events) action on last.s = action.s and last.last_stage_timestamp = action.timestamp ) list; 1 21 20 Bob 1234 1 31 30 Bob 1234 3 51 50 Jeff 1234 The setup to test this is: create table purchase_history (s string, product string, price double, timestamp int); insert into purchase_history values ('1', 'Belt', 20.00, 21); insert into purchase_history values ('1', 'Socks', 3.50, 31); insert into purchase_history values ('3', 'Belt', 20.00, 51); insert into purchase_history values ('4', 'Shirt', 15.50, 59); create table cart_history (s string, cart_id int, timestamp int); insert into cart_history values ('1', 1, 10); insert into cart_history values ('1', 2, 20); insert into cart_history values ('1', 3, 30); insert into cart_history values ('1', 4, 40); insert into cart_history values ('3', 5, 50); insert into cart_history values ('4', 6, 60); create table events (s string, st2 string, n int, timestamp int); insert into events values ('1', 'Bob', 1234, 20); insert into events values ('1', 'Bob', 1234, 30); insert into events values ('1', 'Bob', 1234, 25); insert into events values ('2', 'Sam', 1234, 30); insert into events values ('3', 'Jeff', 1234, 50); insert into events values ('4', 'Ted', 1234, 60); I realize select * and select s are not all that interesting in this context but what lead me to this issue was select count(distinct s) was not returning results. The above queries are the simplified queries that produce the issue. I will note that if I convert the inner join to a table and select from that the issue does not appear. -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
Re: Very slow dynamic partition load
This is something that a few of us have run into. I think the bottleneck is in partition creation calls to the metastore. My work around was HIVE-10385 which optionally removed partition creation in the metastore but this isn't a solution for everyone. If you don't require actual partitions in the table but simply partitioned data in hdfs give it a shot. It may be worthwhile looking into optimizations for this use case. -Slava On Thu, Jun 11, 2015 at 11:56 AM, Pradeep Gollakota wrote: > Hi All, > > I have a table which is partitioned on two columns (customer, date). I'm > loading some data into the table using a Hive query. The MapReduce job > completed within a few minutes and needs to "commit" the data to the > appropriate partitions. There were about 32000 partitions generated. The > commit phase has been running for almost 16 hours and has not finished yet. > I've been monitoring jmap, and don't believe it's a memory or gc issue. > I've also been looking at jstack and not sure why it's so slow. I'm not > sure what the problem is, but seems to be a Hive performance issue when it > comes to "highly partitioned" tables. > > Any thoughts on this issue would be greatly appreciated. > > Thanks in advance, > Pradeep > -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
Re: Hive 1.2.0 Unable to start metastore
Sounds like you ran into this: https://issues.apache.org/jira/browse/HIVE-9198 On Mon, Jun 8, 2015 at 1:06 PM, James Pirz wrote: > Thanks ! > There was a similar problem: Conflicting Jars, but between Hive and Spark. > My eventual goal is running Spark with Hive's tables, and having Spark's > libraries on my path as well, there were conflicting Jar files. > I removed Spark libraries from my PATH and Hive's services (remote > metastore) just started all well. > For now I am good, but I am just wondering what is the correct way to fix > this ? Once I wanna start Spark, I need to include its libraries to the > PATH, and the conflicts seems inevitable. > > > > On Mon, Jun 8, 2015 at 12:09 PM, Slava Markeyev < > slava.marke...@upsight.com> wrote: > >> It sounds like you are running into a jar conflict between the hive >> packaged derby and hadoop distro packaged derby. Look for derby jars on >> your system to confirm. >> >> In the mean time try adding this to your hive-env.sh or hadoop-env.sh >> file: >> >> export HADOOP_USER_CLASSPATH_FIRST=true >> >> On Mon, Jun 8, 2015 at 11:52 AM, James Pirz wrote: >> >>> I am trying to run Hive 1.2.0 on Hadoop 2.6.0 (on a cluster, running >>> CentOS). I am able to start Hive CLI and run queries. But once I try to >>> start Hive's metastore (I trying to use the builtin derby) using: >>> >>> hive --service metastore >>> >>> I keep getting Class Not Found Exceptions for >>> "org.apache.derby.jdbc.EmbeddedDriver" (See below). >>> >>> I have exported $HIVE_HOME and added $HIVE_HOME/bin and $HIVE_HOME/lib >>> to the $PATH, and I see that there is "derby-10.11.1.1.jar" file under >>> $HIVE_HOME/lib . >>> >>> In my hive-site.xml (under $HIVE_HOME/conf) I have: >>> >>> >>> javax.jdo.option.ConnectionDriverName >>> org.apache.derby.jdbc.EmbeddedDriver >>> Driver class name for a JDBC metastore >>> >>> >>> >>> javax.jdo.option.ConnectionURL >>> jdbc:derby:;databaseName=metastore_db;create=true >>> JDBC connect string for a JDBC metastore >>> >>> >>> So I am not sure, why it can not find it. >>> Any suggestion or hint would be highly appreciated. >>> >>> >>> Here is the error: >>> >>> javax.jdo.JDOFatalInternalException: Error creating transactional >>> connection factory >>> ... >>> Caused by: java.lang.NoClassDefFoundError: Could not initialize class >>> org.apache.derby.jdbc.EmbeddedDriver >>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >>> at >>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>> at >>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>> at java.lang.Class.newInstance(Class.java:379) >>> at >>> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) >>> at >>> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) >>> at >>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) >>> at >>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) >>> at >>> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) >>> >>> >> >> >> -- >> >> Slava Markeyev | Engineering | Upsight >> >> Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev> >> <http://www.linkedin.com/in/slavamarkeyev> >> > > -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
Re: Hive 1.2.0 Unable to start metastore
It sounds like you are running into a jar conflict between the hive packaged derby and hadoop distro packaged derby. Look for derby jars on your system to confirm. In the mean time try adding this to your hive-env.sh or hadoop-env.sh file: export HADOOP_USER_CLASSPATH_FIRST=true On Mon, Jun 8, 2015 at 11:52 AM, James Pirz wrote: > I am trying to run Hive 1.2.0 on Hadoop 2.6.0 (on a cluster, running > CentOS). I am able to start Hive CLI and run queries. But once I try to > start Hive's metastore (I trying to use the builtin derby) using: > > hive --service metastore > > I keep getting Class Not Found Exceptions for > "org.apache.derby.jdbc.EmbeddedDriver" (See below). > > I have exported $HIVE_HOME and added $HIVE_HOME/bin and $HIVE_HOME/lib to > the $PATH, and I see that there is "derby-10.11.1.1.jar" file under > $HIVE_HOME/lib . > > In my hive-site.xml (under $HIVE_HOME/conf) I have: > > > javax.jdo.option.ConnectionDriverName > org.apache.derby.jdbc.EmbeddedDriver > Driver class name for a JDBC metastore > > > > javax.jdo.option.ConnectionURL > jdbc:derby:;databaseName=metastore_db;create=true > JDBC connect string for a JDBC metastore > > > So I am not sure, why it can not find it. > Any suggestion or hint would be highly appreciated. > > > Here is the error: > > javax.jdo.JDOFatalInternalException: Error creating transactional > connection factory > ... > Caused by: java.lang.NoClassDefFoundError: Could not initialize class > org.apache.derby.jdbc.EmbeddedDriver > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at java.lang.Class.newInstance(Class.java:379) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > > -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
Re: Hive 1.2.0 fails on Hadoop 2.6.0
What's the rest of that stack trace? Do you see a NoClassDefFoundError? -Slava On Fri, Jun 5, 2015 at 7:28 PM, James Pirz wrote: > I am trying to run Apache Hive 1.2.0 on Hadoop 2.6.0 on a cluster. My > hadoop cluster comes up fine (I start hdfs and yarn) and then I create > required tmp and warehouse directories in HDFS and I try to start Hive CLI > (I do not do anything with HCatalog or Hiveserver2) but I keep getting > errors related to metastore (See below). Replacing Hive 1.2.0 with Hive > 0.13, it just works fine. > > Is there anything changed regarding starting Hive 1.X on Hadoop 2.X from > Hive 0.X ? (This is the first time I am trying Hive on Hadoop 2) > > Exception in thread "main" java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:519) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.run(RunJar.java:221) > at org.apache.hadoop.util.RunJar.main(RunJar.java:136) > Caused by: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523) > …. > -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions
I've created HIVE-10385 and attached a patch. Unit tests to come. -Slava On Fri, Apr 17, 2015 at 1:34 PM, Chris Roblee wrote: > Hi Slava, > > We would be interested in reviewing your patch. Can you please provide > more details? > > Is there any other way to disable the partition creation step? > > Thanks, > Chris > > On 4/13/15 10:59 PM, Slava Markeyev wrote: > >> This is something I've encountered when doing ETL with hive and having it >> create 10's of thousands partitions. The issue >> is each partition needs to be added to the metastore and this is an >> expensive operation to perform. My work around was >> adding a flag to hive that optionally disables the metastore partition >> creation step. This may not be a solution for >> everyone as that table then has no partitions and you would have to run >> msck repair but depending on your use case, you >> may just want the data in hdfs. >> >> If there is interest in having this be an option I'll make a ticket and >> submit the patch. >> >> -Slava >> >> On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A > <mailto:cheng.a...@intel.com>> wrote: >> >> Hi Tianqi, >> >> Can you attach hive.log as more detailed information? >> >> +Sergio >> >> __ __ >> >> Yours, >> >> Ferdinand Xu >> >> __ __ >> >> *From:*Tianqi Tong [mailto:tt...@brightedge.com > tt...@brightedge.com>] >> *Sent:* Friday, April 10, 2015 1:34 AM >> *To:* user@hive.apache.org <mailto:user@hive.apache.org> >> *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k >> Partitions >> >> __ __ >> >> Hello Hive, >> >> I'm a developer using Hive to process TB level data, and I'm having >> some difficulty loading the data to table. >> >> I have 2 tables now: >> >> __ __ >> >> -- table_1: >> >> CREATE EXTERNAL TABLE `table_1`( >> >>`keyword` string, >> >>`domain` string, >> >>`url` string >> >>) >> >> PARTITIONED BY (yearmonth INT, partition1 STRING) >> >> STORED AS RCfile >> >> __ __ >> >> -- table_2: >> >> CREATE EXTERNAL TABLE `table_2`( >> >>`keyword` string, >> >>`domain` string, >> >>`url` string >> >>) >> >> PARTITIONED BY (yearmonth INT, partition2 STRING) >> >> STORED AS Parquet >> >> __ __ >> >> I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 >> with dynamic partitioning, and the number of >> partitions grows dramatically from 1500 to 40k (because I want to use >> something else as partitioning). >> >> The mapreduce job was fine. >> >> Somehow the process stucked at " Loading data to table >> default.table_2 (yearmonth=null, domain_prefix=null) ", and >> I've been waiting for hours. >> >> __ __ >> >> Is this expected when we have 40k partitions? >> >> __ __ >> >> ------ >> >> Refs - Here are the parameters that I used: >> >> export HADOOP_HEAPSIZE=16384 >> >> set PARQUET_FILE_SIZE=268435456; >> >> set parquet.block.size=268435456; >> >> set dfs.blocksize=268435456; >> >> set parquet.compression=SNAPPY; >> >> SET hive.exec.dynamic.partition.mode=nonstrict; >> >> SET hive.exec.max.dynamic.partitions=50; >> >> SET hive.exec.max.dynamic.partitions.pernode=5; >> >> SET hive.exec.max.created.files=100; >> >> __ __ >> >> __ __ >> >> Thank you very much! >> >> Tianqi Tong >> >> >> >> >> -- >> >> Slava Markeyev | Engineering | Upsight >> >> > -- Slava Markeyev | Engineering | Upsight <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions
This is something I've encountered when doing ETL with hive and having it create 10's of thousands partitions. The issue is each partition needs to be added to the metastore and this is an expensive operation to perform. My work around was adding a flag to hive that optionally disables the metastore partition creation step. This may not be a solution for everyone as that table then has no partitions and you would have to run msck repair but depending on your use case, you may just want the data in hdfs. If there is interest in having this be an option I'll make a ticket and submit the patch. -Slava On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A wrote: > Hi Tianqi, > > Can you attach hive.log as more detailed information? > > +Sergio > > > > Yours, > > Ferdinand Xu > > > > *From:* Tianqi Tong [mailto:tt...@brightedge.com] > *Sent:* Friday, April 10, 2015 1:34 AM > *To:* user@hive.apache.org > *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k > Partitions > > > > Hello Hive, > > I'm a developer using Hive to process TB level data, and I'm having some > difficulty loading the data to table. > > I have 2 tables now: > > > > -- table_1: > > CREATE EXTERNAL TABLE `table_1`( > > `keyword` string, > > `domain` string, > > `url` string > > ) > > PARTITIONED BY (yearmonth INT, partition1 STRING) > > STORED AS RCfile > > > > -- table_2: > > CREATE EXTERNAL TABLE `table_2`( > > `keyword` string, > > `domain` string, > > `url` string > > ) > > PARTITIONED BY (yearmonth INT, partition2 STRING) > > STORED AS Parquet > > > > I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with > dynamic partitioning, and the number of partitions grows dramatically from > 1500 to 40k (because I want to use something else as partitioning). > > The mapreduce job was fine. > > Somehow the process stucked at " Loading data to table default.table_2 > (yearmonth=null, domain_prefix=null) ", and I've been waiting for hours. > > > > Is this expected when we have 40k partitions? > > > > -- > > Refs - Here are the parameters that I used: > > export HADOOP_HEAPSIZE=16384 > > set PARQUET_FILE_SIZE=268435456; > > set parquet.block.size=268435456; > > set dfs.blocksize=268435456; > > set parquet.compression=SNAPPY; > > SET hive.exec.dynamic.partition.mode=nonstrict; > > SET hive.exec.max.dynamic.partitions=50; > > SET hive.exec.max.dynamic.partitions.pernode=5; > > SET hive.exec.max.created.files=100; > > > > > > Thank you very much! > > Tianqi Tong > -- Slava Markeyev | Engineering | Upsight
Re: rename a database
Just to note the current patch for HIVE-4847 <https://issues.apache.org/jira/browse/HIVE-4847> doesn't handle errors really well and you can potentially get in an inconsistent state if there is a failure along the way. Also, iirc external tables aren't handled properly either. -Slava On Fri, Mar 27, 2015 at 9:14 AM, @Sanjiv Singh wrote: > There is already JIRA raised for this functionality and patch available > with ticket. > > Patch moves the database directory on HDFS and changes its related > metadata entities. > Unit tests are also included. > > https://issues.apache.org/jira/browse/HIVE-4847 > > I have not tried it yet. > > Regards, > Sanjiv Singh > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Fri, Mar 27, 2015 at 9:31 PM, Dr Mich Talebzadeh > wrote: > >> >> yep it can can happen in any server hosting a database or schema. >> >> I believe you will need to rename the database directory underr hive >> warehouse ditectory .db. >> >> Then you can hack Hive metasttore database. Mine is on SAP ASE. Certain >> tables like DBS etc in metastoreDB need to be changed. I cannot >> fremember on top of my head. Best to basckup database before hacking it >> and do it out of business hours. >> >> That is an approch. I am sure there are better ways. >> >> HTH, >> >> Mich >> >> On 27/3/2015, "Fabio C." wrote: >> >> >Maybe they just typed time_shit instead of time_shift and found it out >> >after 3 hours of tables compression... I don't think it's too important, >> >but which is the workaround? I'm also interested in this. >> >Maybe it's just a matter of metastore and one could try to explore the >> >metastore db to change how the database is referenced, possibly having to >> >change also the DB folder name on HDFS... My two cents ;) >> > >> >Regards >> > >> >On Fri, Mar 27, 2015 at 3:10 PM, @Sanjiv Singh >> >wrote: >> > >> >> Can I know why do you want to do so? >> >> >> >> Currently There is no command or direct way to do that..then I can >> suggest >> >> workaround for this. >> >> >> >> Thanks, >> >> Sanjiv Singh >> >> >> >> >> >> Regards >> >> Sanjiv Singh >> >> Mob : +091 9990-447-339 >> >> >> >> On Wed, Mar 25, 2015 at 10:01 AM, Shushant Arora < >> >> shushantaror...@gmail.com> wrote: >> >> >> >>> Hi >> >>> >> >>> Is there any way in hive0.10 to rename a database ? >> >>> >> >>> Thanks >> >>> >> >>> >> >> >> > >> > >> > > -- Slava Markeyev | Engineering | Upsight <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
Re: CSV file reading in hive
You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\'. Check the DDL for details https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL On Thu, Feb 12, 2015 at 8:19 PM, Sreeman wrote: > Hi All, > > How all of you are creating hive/Impala table when the CSV file has some > values with COMMA in between. it is like > > sree,12345,"payment made,but it is not successful" > > > > > > I know opencsv serde is there but it is not available in lower versions of > Hive 14.0 > > > -- Slava Markeyev | Engineering | Upsight Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>
Re: Hive Insert overwrite creating a single file with large block size
You can control block size by setting dfs.block.size. However, I think you might be asking how to control the size of and number of files generated on insert. Is that correct? On Fri, Jan 9, 2015 at 4:41 PM, Buntu Dev wrote: > I got a bunch of small Avro files (<5MB) and have a table against those > files. I created a new table and did an 'INSERT OVERWRITE' selecting from > the existing table but did not find any option to provide the file block > size. It currently creates a single file per partition. > > How do I specify the output block size during the 'INSERT OVERWRITE'? > > Thanks! > -- Slava Markeyev | Engineering | Upsight <http://www.linkedin.com/in/slavamarkeyev> <http://www.linkedin.com/in/slavamarkeyev>