Converting CSV files to Parquet with CTAS, and getting errors on some larger files:

With a source file of 16.34GB (as reported in the HDFS explorer):

~~~
create table `/parquet/customer_20151017` partition by (date_tm) AS select * from `/csv/customer/customer_20151017.csv`; Error: SYSTEM ERROR: IllegalArgumentException: length: -484 (expected: >= 0)

Fragment 1:1

[Error Id: da53d687-a8d5-4927-88ec-e56d5da17112 on es07:31010] (state=,code=0)
~~~

But an optation on a 70 MB file of the same format succeeds.

Given some HDFS advice is to avoid large numbers of small files [1], is there a general guideline for the max file size to ingest into Parquet files with CTAS?

---

[1] HDFS put performance is very poor with a large number of small files, thus trying to find the right amount of source rollup to perform. Pointers to HDFS configuration guides for beginners would be appreciated too. I have only used HDFS for Drill - no other Hadoop experience.

Reply via email to