Converting CSV files to Parquet with CTAS, and getting errors on some
larger files:
With a source file of 16.34GB (as reported in the HDFS explorer):
~~~
create table `/parquet/customer_20151017` partition by (date_tm) AS
select * from `/csv/customer/customer_20151017.csv`;
Error: SYSTEM ERROR: IllegalArgumentException: length: -484 (expected:
>= 0)
Fragment 1:1
[Error Id: da53d687-a8d5-4927-88ec-e56d5da17112 on es07:31010]
(state=,code=0)
~~~
But an optation on a 70 MB file of the same format succeeds.
Given some HDFS advice is to avoid large numbers of small files [1], is
there a general guideline for the max file size to ingest into Parquet
files with CTAS?
---
[1] HDFS put performance is very poor with a large number of small
files, thus trying to find the right amount of source rollup to perform.
Pointers to HDFS configuration guides for beginners would be appreciated
too. I have only used HDFS for Drill - no other Hadoop experience.