Re: [Spark SQL] Does Spark group small files

2018-11-13 Thread Silvio Fiorito
Yes, it does bin-packing for small files which is a good thing so you avoid having many small partitions especially if you’re writing this data back out (e.g. it’s compacting as you read). The default partition size is 128MB with a 4MB “cost” for opening files. You can configure this using the

[ANNOUNCE] Apache Toree 0.3.0-incubating Released

2018-11-13 Thread Luciano Resende
Apache Toree is a kernel for the Jupyter Notebook platform providing interactive and remote access to Apache Spark. The Apache Toree community is pleased to announce the release of Apache Toree 0.3.0-incubating which provides various bug fixes and the following enhancements. * Fix JupyterLab

[ANNOUNCE] Apache Bahir 2.2.2 Released

2018-11-13 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.2.2 which provides the following extensions for Apache

[ANNOUNCE] Apache Bahir 2.1.3 Released

2018-11-13 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.1.3 which provides the following extensions for Apache

inferred schemas for spark streaming from a Kafka source

2018-11-13 Thread Colin Williams
Does anybody know how to use inferred schemas with structured streaming: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets I have some code like : object StreamingApp { def launch(config: Config,

[Spark SQL] Does Spark group small files

2018-11-13 Thread Yann Moisan
Hello, I'm using Spark 2.3.1. I have a job that reads 5.000 small parquet files into s3. When I do a mapPartitions followed by a collect, only *278* tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any

Failed to convert java.sql.Date to String

2018-11-13 Thread luby
Hi, All, I'm new to Spark SQL and just start to use it in our project. We are using spark 2. When importing data from a Hive table, I got the following error: if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class