Re: Hive output file 000000_0

Furcy Pin Wed, 08 Aug 2018 06:07:18 -0700

It might sound silly, but isn't it what Hive is supposed to do, being a
distributed computation framework and all ?


Hive will write one file per reducer, called 00000_0, 00001_0, etc. where
the number corresponds to the number of your reducer.
Sometimes the _0 will be a _1 or _2 or more depending on the retries or
speculative execution.

This is how MapReduce works, and this is how Tez and Spark work too: you
can't have several executors writing simultaneously in the same file on
HDFS
(maybe you can on s3, but not through the HDFS api). So each executor
writes in its own file.
If you want to read back the data with MapReduce, Tez or Spark, you simply
specify the folder path, and all files in it will be read. That's what Hive
does too.

Now, if you really want only one file (for instance to export it as a csv
file), I guess you can try to add this before your query
(if you use MapReduce, there probably are similar configurations for Hive
on Tez or Hive on Spark)
SET mapred.reduce.tasks = 1

And if you really suspect that some files are not correctly overwritten
(like it happened to me once with Spark on Azure blob storage), I suggest
that you check the file timestamp
to determine which execution of your query wrote it.


Hope this helps,

Furcy

On Wed, 8 Aug 2018 at 14:25, Sujeet Pardeshi <sujeet.parde...@sas.com>
wrote:

> Hi Deepak,
> Thanks for your response. The table is not bucketed or clustered. It can
> be seen below.
>
> DROP TABLE IF EXISTS ${SCHEMA_NM}. daily_summary;
> CREATE EXTERNAL TABLE ${SCHEMA_NM}.daily_summary
> (
>   bouncer VARCHAR(12),
>   device_type VARCHAR(52),
>   visitor_type VARCHAR(10),
>   visit_origination_type VARCHAR(65),
>   visit_origination_name VARCHAR(260),
>   pg_domain_name VARCHAR(215),
>   class1_id VARCHAR(650),
>   class2_id VARCHAR(650),
>   bouncers INT,
>   rv_revenue DECIMAL(17,2),
>   visits INT,
>   active_page_view_time INT,
>   total_page_view_time BIGINT,
>   average_visit_duration INT,
>   co_conversions INT,
>   page_views INT,
>   landing_page_url VARCHAR(1332),
>   dt DATE
>
> )
> PARTITIONED BY (datelocal DATE)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\001'
> LOCATION '${OUTPUT_PATH}/daily_summary/'
> TBLPROPERTIES ('serialization.null.format'='');
>
> MSCK REPAIR TABLE ${SCHEMA_NM}.daily_summary;
>
> Regards,
> Sujeet Singh Pardeshi
> Software Specialist
> SAS Research and Development (India) Pvt. Ltd.
> Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune, Maharashtra,
> 411 013
> off: +91-20-30418810
>
>  "When the solution is simple, God is answering…"
>
> -----Original Message-----
> From: Deepak Jaiswal <djais...@hortonworks.com>
> Sent: 07 August 2018 PM 11:19
> To: user@hive.apache.org
> Subject: Re: Hive output file 000000_0
>
> EXTERNAL
>
> Hi Sujeet,
>
> I am assuming that the table is bucketed? If so, then the name represents
> which bucket the file belongs to as Hive creates 1 file per bucket for each
> operation.
>
> In this case, the file 000003_0 belongs to bucket 3.
> To always have files named 000000_0, the table must be unbucketed.
> I hope it helps.
>
> Regards,
> Deepak
>
> On 8/7/18, 1:33 AM, "Sujeet Pardeshi" <sujeet.parde...@sas.com> wrote:
>
>     Hi All,
>     I am doing an Insert overwrite operation through a hive external table
> onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times
> I am noticing that it creates file with other names like 0000003_0 etc. I
> always need to overwrite the existing file but with inconsistent file names
> I am unable to do so. How do I force hive to always create a consistent
> filename like 000000_0? Below is an example of how my code looks like,
> where tab_content is a hive external table.
>
>     INSERT OVERWRITE TABLE tab_content
>     PARTITION(datekey)
>     select * from source
>
>     Regards,
>     Sujeet Singh Pardeshi
>     Software Specialist
>     SAS Research and Development (India) Pvt. Ltd.
>     Level 2A and Level 3, Cybercity, Magarpatta, Hadapsar  Pune,
> Maharashtra, 411 013
>     off: +91-20-30418810
>
>      "When the solution is simple, God is answering…"
>
>
>

Re: Hive output file 000000_0

Reply via email to