Hi
What is the reason that Spark still comes with Parquet 1.6.0rc3? It seems like
newer Parquet versions are available (e.g. 1.6.0). This would fix problems with
‘spark.sql.parquet.filterPushdown’, which currently is disabled by default,
because of a bug in Parquet 1.6.0rc3.
Thanks!
Eric
to 1.7.0 (which is exactly the same as 1.6.0 with
package name renamed from com.twitter to org.apache.parquet) on master branch
recently.
Cheng
On 6/12/15 6:16 PM, Eric Eijkelenboom wrote:
Hi
What is the reason that Spark still comes with Parquet 1.6.0rc3? It seems
like newer Parquet
in the path.
Q2:
To reduce the number of partitions you can use rdd.repartition(x), x= number
of partitions. Depend on your case, repartition could be a heavy task
Regards.
Miguel.
On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom
eric.eijkelenb...@gmail.com mailto:eric.eijkelenb
Hello guys
Q1: How does Spark determine the number of partitions when reading a Parquet
file?
val df = sqlContext.parquetFile(path)
Is it some way related to the number of Parquet row groups in my input?
Q2: How can I reduce this number of partitions? Doing this:
df.rdd.coalesce(200).count
.
sqlContext.load() took about 30 minutes for 5000 Parquet files on S3, the
same as 1.3.0.
Any help would be greatly appreciated!
Thanks a lot.
Eric
On 10 Apr 2015, at 16:46, Eric Eijkelenboom eric.eijkelenb...@gmail.com
wrote:
Hi Ted
Ah, I guess the term ‘source’ confused me :)
Doing
minutes when working with Parquet files. Previously,
we used LZO files and did not experience this problem.
Bonus info:
This also happens when I use auto partition discovery (i.e.
sqlContext.parquetFile(“/path/to/logsroot/)).
What can I do to avoid this?
Thanks in advance!
Eric Eijkelenboom