Parallel read parquet file, write to postgresql

2018-12-03 Thread James Starks
Reading Spark doc (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not mentioned how to parallel read parquet file with SparkSession. Would --num-executors just work? Any additional parameters needed to be added to SparkSession as well? Also if I want to parallel

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread James Starks
ddIt.flatMap(_.toList) > res4: org.apache.spark.rdd.RDD[MyClass] = MapPartitionsRDD[3] at flatMap at > :26 > > res4 is what you're looking for. > > On Sat, 1 Dec 2018 at 21:09, Chris Teoh wrote: > >> Do you have the full code example? >> >> I think this would be

Re: Caused by: java.io.NotSerializableException: com.softwaremill.sttp.FollowRedirectsBackend

2018-11-30 Thread James Starks
ich should lead to spark > being able to use the serializable versions. > > That’s very much a last resort though! > > Chris > > On 30 Nov 2018, at 05:08, Koert Kuipers wrote: > >> if you only use it in the executors sometimes using lazy works >> >>

Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-11-30 Thread James Starks
When processing data, I create an instance of RDD[Iterable[MyCaseClass]] and I want to convert it to RDD[MyCaseClass] so that it can be further converted to dataset or dataframe with toDS() function. But I encounter a problem that SparkContext can not be instantiated within SparkSession.map

Caused by: java.io.NotSerializableException: com.softwaremill.sttp.FollowRedirectsBackend

2018-11-29 Thread James Starks
This is not problem directly caused by Spark, but it's related; thus asking here. I use spark to read data from parquet and processing some http call with sttp (https://github.com/softwaremill/sttp). However, spark throws Caused by: java.io.NotSerializableException:

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
fully because I believe you are a bit confused. > > regards, > > Apostolos > > On 07/09/2018 05:39 μμ, James Starks wrote: > > > Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking > > doc shows that my spark doesn't use those act

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
quot;hdfs directly from the executor". You > can specify an hdfs file as your input and also you can use hdfs to > store your output. > > regards, > > Apostolos > > On 07/09/2018 05:04 μμ, James Starks wrote: > > > > I have a Spark job that

Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
I have a Spark job that read data from database. By increasing submit parameter '--driver-memory 25g' the job can works without a problem locally but not in prod env because prod master do not have enough capacity. So I have a few questions: - What functions such as collecct() would cause the

Re: [External Sender] How to debug Spark job

2018-09-07 Thread James Starks
, Sep 7, 2018 at 5:48 AM James Starks > wrote: > >> I have a Spark job that reads from a postgresql (v9.5) table, and write >> result to parquet. The code flow is not complicated, basically >> >> case class MyCaseClass(field1: String, field2: St

How to debug Spark job

2018-09-07 Thread James Starks
I have a Spark job that reads from a postgresql (v9.5) table, and write result to parquet. The code flow is not complicated, basically case class MyCaseClass(field1: String, field2: String) val df = spark.read.format("jdbc")...load() df.createOrReplaceTempView(...) val newdf =

Re: Pass config file through spark-submit

2018-08-17 Thread James Starks
Accidentally to get it working, though don't thoroughly understand why (So far as I know, it's to configure in allowing executor refers to the conf file after copying to executors' working dir). Basically it's a combination of parameters --conf, --files, and --driver-class-path, instead of any

Pass config file through spark-submit

2018-08-16 Thread James Starks
I have a config file that exploits type safe config library located on the local file system, and want to submit that file through spark-submit so that spark program can read customized parameters. For instance, my.app { db { host = domain.cc port = 1234 db = dbname user =

Data source jdbc does not support streamed reading

2018-08-08 Thread James Starks
Now my spark job can perform sql operations against database table. Next I want to combine that with streaming context, so switching to readStream() function. But after job submission, spark throws Exception in thread "main" java.lang.UnsupportedOperationException: Data source jdbc does

Re: Newbie question on how to extract column value

2018-08-07 Thread James Starks
ames and udf if possible. I think, that would be > a much scalable way to address the issue. > > Also in case possible, it is always advisable to use the filter option before > fetching the data to Spark. > > Thanks and Regards, > Gourav > > On Tue, Aug 7, 2018

Newbie question on how to extract column value

2018-08-07 Thread James Starks
I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show() Now I want to perform filter and map function on col_b value. In plain scala it