Hi,
I think you’re asking the right question, however you’re making an assumption
that he’s on the cloud and he never talked about the size of the file.
It could be that he’s got a lot of small-ish data sets. 1GB is kinda small in
relative terms.
Again YMMV.
Personally if you’re going
You are right.
In general this job should deal with very small files and create an output
file of less than 100MB.
In other cases I would need to create multiple files of around 100 MB..
The issues with partitions that decrease the number of partitions will
reduce ETLs performance, while this job
Why only one file?
I would go more for files of specific size, eg data is split in 1gb files. The
reason is also that if you need to transfer it (eg to other clouds etc) -
having a large file of several terabytes is bad.
It depends on your use case but you might look also at partitions etc.
>
Hi,
I would like to develop a process that merges parquet files.
My first intention was to develop it with PySpark using coalesce(1) - to
create only 1 file.
This process is going to run on a huge amount of files.
I wanted your advice on what is the best way to implement it (PySpark isn't
a
Thanks, Hyukjin.
I’ll try using the Parquet tools for 1.9
On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon wrote:
Hi Benjamin,
As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
Thanks, Hyukjin.
I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t
work, I’ll try Kite.
Cheers,
Ben
> On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon wrote:
>
> Hi Benjamin,
>
>
> As you might already know, I believe the Hadoop command
Hi Benjamin,
As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.
I haven't tried this by myself but I remember I saw a JIRA in Parquet -
Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them
into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark
cluster will not work. It just does not have the resources to do it. I'm trying
to do it using the commandline and not use Spark. I will
You'll need to be running a very recent version of Spark SQL as this
feature was just added.
On Tue, Nov 25, 2014 at 1:01 AM, Daniel Haviv danielru...@gmail.com wrote:
Hi,
Thanks for your reply.. I'm trying to do what you suggested but I get:
scala sqlContext.sql(CREATE TEMPORARY TABLE data
Parquet does a lot of serial metadata operations on the driver which makes
it really slow when you have a very large number of files (especially if
you are reading from something like S3). This is something we are aware of
and that I'd really like to improve in 1.3.
You might try the (brand new
Hi,
I'm ingesting a lot of small JSON files and convert them to unified parquet
files, but even the unified files are fairly small (~10MB).
I want to run a merge operation every hour on the existing files, but it
takes a lot of time for such a small amount of data: about 3 GB spread of
3000
Hello,
I'm writing a process that ingests json files and saves them a parquet
files.
The process is as such:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonRequests=sqlContext.jsonFile(/requests)
val parquetRequests=sqlContext.parquetFile(/requests_parquet)
You can also insert into existing tables via .insertInto(tableName, overwrite).
You just have to import sqlContext._
On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:
Hello,
I'm writing a process that ingests json files and saves them a parquet files.
The process is as such:
Very cool thank you!
On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier mps@gmail.com wrote:
You can also insert into existing tables via .insertInto(tableName,
overwrite). You just have to import sqlContext._
On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:
Hello,
On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv danielru...@gmail.com
wrote:
Another problem I have is that I get a lot of small json files and as a
result a lot of small parquet files, I'd like to merge the json files into
a few parquet files.. how I do that?
You can use `coalesce` on any
15 matches
Mail list logo