RE: Hive output file 000000_0
Also, I believe that the output format matters. If your output is TEXTFILE I think that all of the reducers can append to the same file concurrently. However for block-based output formats, that isn’t possible. From: Furcy Pin [mailto:pin.fu...@gmail.com] Sent: Wednesday, August 08, 2018 9:58 AM To: user@hive.apache.org Subject: Re: Hive output file 00_0 Indeed, if your table is big, then you DO want to parallelise and setting the number of reducers to 1 is clearly not a good idea. From what you explained, I still don't understand what your exact problem is. Can't you just leave your query as is and be okay with many part_X files? On Wed, 8 Aug 2018 at 16:20, Sujeet Pardeshi mailto:sujeet.parde...@sas.com>> wrote: Thanks Furcy. Will the below setting not hit performance? Just one reducer? I am processing huge data and cannot compromise on performance. SET mapred.reduce.tasks = 1 Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 [Description: untitled] "When the solution is simple, God is answering…" From: Furcy Pin mailto:pin.fu...@gmail.com>> Sent: 08 August 2018 PM 06:37 To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Hive output file 00_0 EXTERNAL It might sound silly, but isn't it what Hive is supposed to do, being a distributed computation framework and all ? Hive will write one file per reducer, called 0_0, 1_0, etc. where the number corresponds to the number of your reducer. Sometimes the _0 will be a _1 or _2 or more depending on the retries or speculative execution. This is how MapReduce works, and this is how Tez and Spark work too: you can't have several executors writing simultaneously in the same file on HDFS (maybe you can on s3, but not through the HDFS api). So each executor writes in its own file. If you want to read back the data with MapReduce, Tez or Spark, you simply specify the folder path, and all files in it will be read. That's what Hive does too. Now, if you really want only one file (for instance to export it as a csv file), I guess you can try to add this before your query (if you use MapReduce, there probably are similar configurations for Hive on Tez or Hive on Spark) SET mapred.reduce.tasks = 1 And if you really suspect that some files are not correctly overwritten (like it happened to me once with Spark on Azure blob storage), I suggest that you check the file timestamp to determine which execution of your query wrote it. Hope this helps, Furcy On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi mailto:sujeet.parde...@sas.com>> wrote: Hi Deepak, Thanks for your response. The table is not bucketed or clustered. It can be seen below. DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary; CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary ( bouncer VARCHAR(12), device_type VARCHAR(52), visitor_type VARCHAR(10), visit_origination_type VARCHAR(65), visit_origination_name VARCHAR(260), pg_domain_name VARCHAR(215), class1_id VARCHAR(650), class2_id VARCHAR(650), bouncers INT, rv_revenue DECIMAL(17,2), visits INT, active_page_view_time INT, total_page_view_time BIGINT, average_visit_duration INT, co_conversions INT, page_views INT, landing_page_url VARCHAR(1332), dt DATE ) PARTITIONED BY (datelocal DATE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LOCATION '${OUTPUT_PATH}/daily_summary/' TBLPROPERTIES ('serialization.null.format'=''); MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary; Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 "When the solution is simple, God is answering…" -Original Message- From: Deepak Jaiswal mailto:djais...@hortonworks.com>> Sent: 07 August 2018 PM 11:19 To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Hive output file 00_0 EXTERNAL Hi Sujeet, I am assuming that the table is bucketed? If so, then the name represents which bucket the file belongs to as Hive creates 1 file per bucket for each operation. In this case, the file 03_0 belongs to bucket 3. To always have files named 00_0, the table must be unbucketed. I hope it helps. Regards, Deepak On 8/7/18, 1:33 AM, "Sujeet Pardeshi" mailto:sujeet.parde...@sas.com>> wrote: Hi All, I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 00_0 onto S3. However at times I am noticing that it creates file with other names like 003_0 etc. I always need to overwrite the existing file but with inconsistent file n
Re: Hive output file 000000_0
Indeed, if your table is big, then you DO want to parallelise and setting the number of reducers to 1 is clearly not a good idea. >From what you explained, I still don't understand what your exact problem is. Can't you just leave your query as is and be okay with many part_X files? On Wed, 8 Aug 2018 at 16:20, Sujeet Pardeshi wrote: > Thanks Furcy. Will the below setting not hit performance? Just one > reducer? I am processing huge data and cannot compromise on performance. > > > > SET mapred.reduce.tasks = 1 > > > > > > Regards, > > Sujeet Singh Pardeshi > > Software Specialist > > SAS Research and Development (India) Pvt. Ltd. > > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, > 411 013 > *o*ff: +91-20-30418810 > [image: Description: untitled] > > *"When the solution is simple, God is answering…" * > > > > *From:* Furcy Pin > *Sent:* 08 August 2018 PM 06:37 > *To:* user@hive.apache.org > *Subject:* Re: Hive output file 00_0 > > > > *EXTERNAL* > > It might sound silly, but isn't it what Hive is supposed to do, being a > distributed computation framework and all ? > > > > Hive will write one file per reducer, called 0_0, 1_0, etc. where > the number corresponds to the number of your reducer. > > Sometimes the _0 will be a _1 or _2 or more depending on the retries or > speculative execution. > > > > This is how MapReduce works, and this is how Tez and Spark work too: you > can't have several executors writing simultaneously in the same file on > HDFS > > (maybe you can on s3, but not through the HDFS api). So each executor > writes in its own file. > > If you want to read back the data with MapReduce, Tez or Spark, you simply > specify the folder path, and all files in it will be read. That's what Hive > does too. > > > > Now, if you really want only one file (for instance to export it as a csv > file), I guess you can try to add this before your query > > (if you use MapReduce, there probably are similar configurations for Hive > on Tez or Hive on Spark) > > SET mapred.reduce.tasks = 1 > > > > And if you really suspect that some files are not correctly overwritten > (like it happened to me once with Spark on Azure blob storage), I suggest > that you check the file timestamp > > to determine which execution of your query wrote it. > > > > > > Hope this helps, > > > > Furcy > > > > On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi > wrote: > > Hi Deepak, > Thanks for your response. The table is not bucketed or clustered. It can > be seen below. > > DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary; > CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary > ( > bouncer VARCHAR(12), > device_type VARCHAR(52), > visitor_type VARCHAR(10), > visit_origination_type VARCHAR(65), > visit_origination_name VARCHAR(260), > pg_domain_name VARCHAR(215), > class1_id VARCHAR(650), > class2_id VARCHAR(650), > bouncers INT, > rv_revenue DECIMAL(17,2), > visits INT, > active_page_view_time INT, > total_page_view_time BIGINT, > average_visit_duration INT, > co_conversions INT, > page_views INT, > landing_page_url VARCHAR(1332), > dt DATE > > ) > PARTITIONED BY (datelocal DATE) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\001' > LOCATION '${OUTPUT_PATH}/daily_summary/' > TBLPROPERTIES ('serialization.null.format'=''); > > MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary; > > Regards, > Sujeet Singh Pardeshi > Software Specialist > SAS Research and Development (India) Pvt. Ltd. > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, > 411 013 > off: +91-20-30418810 > > "When the solution is simple, God is answering…" > > -Original Message- > From: Deepak Jaiswal > Sent: 07 August 2018 PM 11:19 > To: user@hive.apache.org > Subject: Re: Hive output file 00_0 > > EXTERNAL > > Hi Sujeet, > > I am assuming that the table is bucketed? If so, then the name represents > which bucket the file belongs to as Hive creates 1 file per bucket for each > operation. > > In this case, the file 03_0 belongs to bucket 3. > To always have files named 00_0, the table must be unbucketed. > I hope it helps. > > Regards, > Deepak > > On 8/7/18, 1:33 AM, "Sujeet Pardeshi" wrote: > > Hi All, > I am doing an Insert overwrite operation through a hive external table > onto AWS S3. Hive creates a output file 00_0 onto S3. However at times
RE: Hive output file 000000_0
Thanks Furcy. Will the below setting not hit performance? Just one reducer? I am processing huge data and cannot compromise on performance. SET mapred.reduce.tasks = 1 Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 [Description: untitled] "When the solution is simple, God is answering…" From: Furcy Pin Sent: 08 August 2018 PM 06:37 To: user@hive.apache.org Subject: Re: Hive output file 00_0 EXTERNAL It might sound silly, but isn't it what Hive is supposed to do, being a distributed computation framework and all ? Hive will write one file per reducer, called 0_0, 1_0, etc. where the number corresponds to the number of your reducer. Sometimes the _0 will be a _1 or _2 or more depending on the retries or speculative execution. This is how MapReduce works, and this is how Tez and Spark work too: you can't have several executors writing simultaneously in the same file on HDFS (maybe you can on s3, but not through the HDFS api). So each executor writes in its own file. If you want to read back the data with MapReduce, Tez or Spark, you simply specify the folder path, and all files in it will be read. That's what Hive does too. Now, if you really want only one file (for instance to export it as a csv file), I guess you can try to add this before your query (if you use MapReduce, there probably are similar configurations for Hive on Tez or Hive on Spark) SET mapred.reduce.tasks = 1 And if you really suspect that some files are not correctly overwritten (like it happened to me once with Spark on Azure blob storage), I suggest that you check the file timestamp to determine which execution of your query wrote it. Hope this helps, Furcy On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi mailto:sujeet.parde...@sas.com>> wrote: Hi Deepak, Thanks for your response. The table is not bucketed or clustered. It can be seen below. DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary; CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary ( bouncer VARCHAR(12), device_type VARCHAR(52), visitor_type VARCHAR(10), visit_origination_type VARCHAR(65), visit_origination_name VARCHAR(260), pg_domain_name VARCHAR(215), class1_id VARCHAR(650), class2_id VARCHAR(650), bouncers INT, rv_revenue DECIMAL(17,2), visits INT, active_page_view_time INT, total_page_view_time BIGINT, average_visit_duration INT, co_conversions INT, page_views INT, landing_page_url VARCHAR(1332), dt DATE ) PARTITIONED BY (datelocal DATE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LOCATION '${OUTPUT_PATH}/daily_summary/' TBLPROPERTIES ('serialization.null.format'=''); MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary; Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 "When the solution is simple, God is answering…" -Original Message- From: Deepak Jaiswal mailto:djais...@hortonworks.com>> Sent: 07 August 2018 PM 11:19 To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Hive output file 00_0 EXTERNAL Hi Sujeet, I am assuming that the table is bucketed? If so, then the name represents which bucket the file belongs to as Hive creates 1 file per bucket for each operation. In this case, the file 03_0 belongs to bucket 3. To always have files named 00_0, the table must be unbucketed. I hope it helps. Regards, Deepak On 8/7/18, 1:33 AM, "Sujeet Pardeshi" mailto:sujeet.parde...@sas.com>> wrote: Hi All, I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 00_0 onto S3. However at times I am noticing that it creates file with other names like 003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 00_0? Below is an example of how my code looks like, where tab_content is a hive external table. INSERT OVERWRITE TABLE tab_content PARTITION(datekey) select * from source Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 "When the solution is simple, God is answering…"
Re: Hive output file 000000_0
It might sound silly, but isn't it what Hive is supposed to do, being a distributed computation framework and all ? Hive will write one file per reducer, called 0_0, 1_0, etc. where the number corresponds to the number of your reducer. Sometimes the _0 will be a _1 or _2 or more depending on the retries or speculative execution. This is how MapReduce works, and this is how Tez and Spark work too: you can't have several executors writing simultaneously in the same file on HDFS (maybe you can on s3, but not through the HDFS api). So each executor writes in its own file. If you want to read back the data with MapReduce, Tez or Spark, you simply specify the folder path, and all files in it will be read. That's what Hive does too. Now, if you really want only one file (for instance to export it as a csv file), I guess you can try to add this before your query (if you use MapReduce, there probably are similar configurations for Hive on Tez or Hive on Spark) SET mapred.reduce.tasks = 1 And if you really suspect that some files are not correctly overwritten (like it happened to me once with Spark on Azure blob storage), I suggest that you check the file timestamp to determine which execution of your query wrote it. Hope this helps, Furcy On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi wrote: > Hi Deepak, > Thanks for your response. The table is not bucketed or clustered. It can > be seen below. > > DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary; > CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary > ( > bouncer VARCHAR(12), > device_type VARCHAR(52), > visitor_type VARCHAR(10), > visit_origination_type VARCHAR(65), > visit_origination_name VARCHAR(260), > pg_domain_name VARCHAR(215), > class1_id VARCHAR(650), > class2_id VARCHAR(650), > bouncers INT, > rv_revenue DECIMAL(17,2), > visits INT, > active_page_view_time INT, > total_page_view_time BIGINT, > average_visit_duration INT, > co_conversions INT, > page_views INT, > landing_page_url VARCHAR(1332), > dt DATE > > ) > PARTITIONED BY (datelocal DATE) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\001' > LOCATION '${OUTPUT_PATH}/daily_summary/' > TBLPROPERTIES ('serialization.null.format'=''); > > MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary; > > Regards, > Sujeet Singh Pardeshi > Software Specialist > SAS Research and Development (India) Pvt. Ltd. > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, > 411 013 > off: +91-20-30418810 > > "When the solution is simple, God is answering…" > > -Original Message- > From: Deepak Jaiswal > Sent: 07 August 2018 PM 11:19 > To: user@hive.apache.org > Subject: Re: Hive output file 00_0 > > EXTERNAL > > Hi Sujeet, > > I am assuming that the table is bucketed? If so, then the name represents > which bucket the file belongs to as Hive creates 1 file per bucket for each > operation. > > In this case, the file 03_0 belongs to bucket 3. > To always have files named 00_0, the table must be unbucketed. > I hope it helps. > > Regards, > Deepak > > On 8/7/18, 1:33 AM, "Sujeet Pardeshi" wrote: > > Hi All, > I am doing an Insert overwrite operation through a hive external table > onto AWS S3. Hive creates a output file 00_0 onto S3. However at times > I am noticing that it creates file with other names like 003_0 etc. I > always need to overwrite the existing file but with inconsistent file names > I am unable to do so. How do I force hive to always create a consistent > filename like 00_0? Below is an example of how my code looks like, > where tab_content is a hive external table. > > INSERT OVERWRITE TABLE tab_content > PARTITION(datekey) > select * from source > > Regards, > Sujeet Singh Pardeshi > Software Specialist > SAS Research and Development (India) Pvt. Ltd. > Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, > Maharashtra, 411 013 > off: +91-20-30418810 > > "When the solution is simple, God is answering…" > > >
RE: Hive output file 000000_0
Hi Deepak, Thanks for your response. The table is not bucketed or clustered. It can be seen below. DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary; CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary ( bouncer VARCHAR(12), device_type VARCHAR(52), visitor_type VARCHAR(10), visit_origination_type VARCHAR(65), visit_origination_name VARCHAR(260), pg_domain_name VARCHAR(215), class1_id VARCHAR(650), class2_id VARCHAR(650), bouncers INT, rv_revenue DECIMAL(17,2), visits INT, active_page_view_time INT, total_page_view_time BIGINT, average_visit_duration INT, co_conversions INT, page_views INT, landing_page_url VARCHAR(1332), dt DATE ) PARTITIONED BY (datelocal DATE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LOCATION '${OUTPUT_PATH}/daily_summary/' TBLPROPERTIES ('serialization.null.format'=''); MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary; Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 "When the solution is simple, God is answering…" -Original Message- From: Deepak Jaiswal Sent: 07 August 2018 PM 11:19 To: user@hive.apache.org Subject: Re: Hive output file 00_0 EXTERNAL Hi Sujeet, I am assuming that the table is bucketed? If so, then the name represents which bucket the file belongs to as Hive creates 1 file per bucket for each operation. In this case, the file 03_0 belongs to bucket 3. To always have files named 00_0, the table must be unbucketed. I hope it helps. Regards, Deepak On 8/7/18, 1:33 AM, "Sujeet Pardeshi" wrote: Hi All, I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 00_0 onto S3. However at times I am noticing that it creates file with other names like 003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 00_0? Below is an example of how my code looks like, where tab_content is a hive external table. INSERT OVERWRITE TABLE tab_content PARTITION(datekey) select * from source Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 "When the solution is simple, God is answering…"
Re: Hive output file 000000_0
Hi Sujeet, I am assuming that the table is bucketed? If so, then the name represents which bucket the file belongs to as Hive creates 1 file per bucket for each operation. In this case, the file 03_0 belongs to bucket 3. To always have files named 00_0, the table must be unbucketed. I hope it helps. Regards, Deepak On 8/7/18, 1:33 AM, "Sujeet Pardeshi" wrote: Hi All, I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 00_0 onto S3. However at times I am noticing that it creates file with other names like 003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 00_0? Below is an example of how my code looks like, where tab_content is a hive external table. INSERT OVERWRITE TABLE tab_content PARTITION(datekey) select * from source Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 "When the solution is simple, God is answering…"
Hive output file 000000_0
Hi All, I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 00_0 onto S3. However at times I am noticing that it creates file with other names like 003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 00_0? Below is an example of how my code looks like, where tab_content is a hive external table. INSERT OVERWRITE TABLE tab_content PARTITION(datekey) select * from source Regards, Sujeet Singh Pardeshi Software Specialist SAS Research and Development (India) Pvt. Ltd. Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar Pune, Maharashtra, 411 013 off: +91-20-30418810 "When the solution is simple, God is answering…"