Indeed, if your table is big, then you DO want to parallelise and setting the number of reducers to 1 is clearly not a good idea.
>From what you explained, I still don't understand what your exact problem is. Can't you just leave your query as is and be okay with many part_0000X files? On Wed, 8 Aug 2018 at 16:20, Sujeet Pardeshi <sujeet.parde...@sas.com> wrote: > Thanks Furcy. Will the below setting not hit performance? Just one > reducer? I am processing huge data and cannot compromise on performance. > > > > SET mapred.reduce.tasks = 1 > > > > > > Regards, > > Sujeet Singh Pardeshi > > Software Specialist > > SAS Research and Development (India) Pvt. Ltd. > > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, > 411 013 > *o*ff: +91-20-30418810 > [image: Description: untitled] > > *"When the solution is simple, God is answering…" * > > > > *From:* Furcy Pin <pin.fu...@gmail.com> > *Sent:* 08 August 2018 PM 06:37 > *To:* user@hive.apache.org > *Subject:* Re: Hive output file 000000_0 > > > > *EXTERNAL* > > It might sound silly, but isn't it what Hive is supposed to do, being a > distributed computation framework and all ? > > > > Hive will write one file per reducer, called 00000_0, 00001_0, etc. where > the number corresponds to the number of your reducer. > > Sometimes the _0 will be a _1 or _2 or more depending on the retries or > speculative execution. > > > > This is how MapReduce works, and this is how Tez and Spark work too: you > can't have several executors writing simultaneously in the same file on > HDFS > > (maybe you can on s3, but not through the HDFS api). So each executor > writes in its own file. > > If you want to read back the data with MapReduce, Tez or Spark, you simply > specify the folder path, and all files in it will be read. That's what Hive > does too. > > > > Now, if you really want only one file (for instance to export it as a csv > file), I guess you can try to add this before your query > > (if you use MapReduce, there probably are similar configurations for Hive > on Tez or Hive on Spark) > > SET mapred.reduce.tasks = 1 > > > > And if you really suspect that some files are not correctly overwritten > (like it happened to me once with Spark on Azure blob storage), I suggest > that you check the file timestamp > > to determine which execution of your query wrote it. > > > > > > Hope this helps, > > > > Furcy > > > > On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi <sujeet.parde...@sas.com> > wrote: > > Hi Deepak, > Thanks for your response. The table is not bucketed or clustered. It can > be seen below. > > DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary; > CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary > ( > bouncer VARCHAR(12), > device_type VARCHAR(52), > visitor_type VARCHAR(10), > visit_origination_type VARCHAR(65), > visit_origination_name VARCHAR(260), > pg_domain_name VARCHAR(215), > class1_id VARCHAR(650), > class2_id VARCHAR(650), > bouncers INT, > rv_revenue DECIMAL(17,2), > visits INT, > active_page_view_time INT, > total_page_view_time BIGINT, > average_visit_duration INT, > co_conversions INT, > page_views INT, > landing_page_url VARCHAR(1332), > dt DATE > > ) > PARTITIONED BY (datelocal DATE) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\001' > LOCATION '${OUTPUT_PATH}/daily_summary/' > TBLPROPERTIES ('serialization.null.format'=''); > > MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary; > > Regards, > Sujeet Singh Pardeshi > Software Specialist > SAS Research and Development (India) Pvt. Ltd. > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, > 411 013 > off: +91-20-30418810 > > "When the solution is simple, God is answering…" > > -----Original Message----- > From: Deepak Jaiswal <djais...@hortonworks.com> > Sent: 07 August 2018 PM 11:19 > To: user@hive.apache.org > Subject: Re: Hive output file 000000_0 > > EXTERNAL > > Hi Sujeet, > > I am assuming that the table is bucketed? If so, then the name represents > which bucket the file belongs to as Hive creates 1 file per bucket for each > operation. > > In this case, the file 000003_0 belongs to bucket 3. > To always have files named 000000_0, the table must be unbucketed. > I hope it helps. > > Regards, > Deepak > > On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <sujeet.parde...@sas.com> wrote: > > Hi All, > I am doing an Insert overwrite operation through a hive external table > onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times > I am noticing that it creates file with other names like 0000003_0 etc. I > always need to overwrite the existing file but with inconsistent file names > I am unable to do so. How do I force hive to always create a consistent > filename like 000000_0? Below is an example of how my code looks like, > where tab_content is a hive external table. > > INSERT OVERWRITE TABLE tab_content > PARTITION(datekey) > select * from source > > Regards, > Sujeet Singh Pardeshi > Software Specialist > SAS Research and Development (India) Pvt. Ltd. > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, > Maharashtra, 411 013 > off: +91-20-30418810 > > "When the solution is simple, God is answering…" > >