Reading Spark doc
(https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not
mentioned how to parallel read parquet file with SparkSession. Would
--num-executors just work? Any additional parameters needed to be added to
SparkSession as well?
Also if I want to parallel
ddIt.flatMap(_.toList)
> res4: org.apache.spark.rdd.RDD[MyClass] = MapPartitionsRDD[3] at flatMap at
> :26
>
> res4 is what you're looking for.
>
> On Sat, 1 Dec 2018 at 21:09, Chris Teoh wrote:
>
>> Do you have the full code example?
>>
>> I think this would be
ich should lead to spark
> being able to use the serializable versions.
>
> That’s very much a last resort though!
>
> Chris
>
> On 30 Nov 2018, at 05:08, Koert Kuipers wrote:
>
>> if you only use it in the executors sometimes using lazy works
>>
>>
When processing data, I create an instance of RDD[Iterable[MyCaseClass]] and I
want to convert it to RDD[MyCaseClass] so that it can be further converted to
dataset or dataframe with toDS() function. But I encounter a problem that
SparkContext can not be instantiated within SparkSession.map
This is not problem directly caused by Spark, but it's related; thus asking
here. I use spark to read data from parquet and processing some http call with
sttp (https://github.com/softwaremill/sttp). However, spark throws
Caused by: java.io.NotSerializableException:
fully because I believe you are a bit confused.
>
> regards,
>
> Apostolos
>
> On 07/09/2018 05:39 μμ, James Starks wrote:
>
> > Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking
> > doc shows that my spark doesn't use those act
quot;hdfs directly from the executor". You
> can specify an hdfs file as your input and also you can use hdfs to
> store your output.
>
> regards,
>
> Apostolos
>
> On 07/09/2018 05:04 μμ, James Starks wrote:
>
>
> > I have a Spark job that
I have a Spark job that read data from database. By increasing submit parameter
'--driver-memory 25g' the job can works without a problem locally but not in
prod env because prod master do not have enough capacity.
So I have a few questions:
- What functions such as collecct() would cause the
, Sep 7, 2018 at 5:48 AM James Starks
> wrote:
>
>> I have a Spark job that reads from a postgresql (v9.5) table, and write
>> result to parquet. The code flow is not complicated, basically
>>
>> case class MyCaseClass(field1: String, field2: St
I have a Spark job that reads from a postgresql (v9.5) table, and write result
to parquet. The code flow is not complicated, basically
case class MyCaseClass(field1: String, field2: String)
val df = spark.read.format("jdbc")...load()
df.createOrReplaceTempView(...)
val newdf =
Accidentally to get it working, though don't thoroughly understand why (So far
as I know, it's to configure in allowing executor refers to the conf file after
copying to executors' working dir). Basically it's a combination of parameters
--conf, --files, and --driver-class-path, instead of any
I have a config file that exploits type safe config library located on the
local file system, and want to submit that file through spark-submit so that
spark program can read customized parameters. For instance,
my.app {
db {
host = domain.cc
port = 1234
db = dbname
user =
Now my spark job can perform sql operations against database table. Next I want
to combine that with streaming context, so switching to readStream() function.
But after job submission, spark throws
Exception in thread "main" java.lang.UnsupportedOperationException: Data
source jdbc does
ames and udf if possible. I think, that would be
> a much scalable way to address the issue.
>
> Also in case possible, it is always advisable to use the filter option before
> fetching the data to Spark.
>
> Thanks and Regards,
> Gourav
>
> On Tue, Aug 7, 2018
I am very new to Spark. Just successfully setup Spark SQL connecting to
postgresql database, and am able to display table with code
sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show()
Now I want to perform filter and map function on col_b value. In plain scala it
15 matches
Mail list logo