Re: Avoiding collect but use foreach

2019-02-04 Thread
hi, I think you can make your python code into an udf and call udf in foreachpartition. Aakash Basu 于2019年2月1日周五 下午3:37写道: > Hi, > > This: > > > *to_list = [list(row) for row in df.collect()]* > > > Gives: > > > [[5, 1, 1, 1, 2, 1, 3, 1, 1, 0], [5, 4, 4, 5, 7, 10, 3, 2, 1, 0], [3, 1, > 1, 1, 2,

Re: Connect to postgresql with pyspark

2018-04-30 Thread
hi, what's the problem you are facing ? 2018-04-30 6:15 GMT+08:00 dimitris plakas : > I am new in pyspark and i am learning it in order to complete my Thesis > project in university. > > > > I am trying to create a dataframe by reading from a postgresql database > table,

Re: spark dataframe jdbc Amazon RDS problem

2017-08-26 Thread
f.printSchema() wtf = df.collect() for i in wtf:print i 2017-08-27 1:00 GMT+08:00 刘虓 <ipf...@gmail.com>: > hi,all > I came across this problem yesterday: > I was using data frame to read from a amazon rds mysql table ,and this > exception came up: &g

spark dataframe jdbc Amazon RDS problem

2017-08-26 Thread
hi,all I came across this problem yesterday: I was using data frame to read from a amazon rds mysql table ,and this exception came up: java.sql.SQLException: Invalid value for getLong() - 'id' at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964) at

Re: Flatten JSON to multiple columns in Spark

2017-07-18 Thread
Hi, have you tried to use explode? Chetan Khatri 于2017年7月18日 周二下午2:06写道: > Hello Spark Dev's, > > Can you please guide me, how to flatten JSON to multiple columns in Spark. > > *Example:* > > Sr No Title ISBN Info > 1 Calculus Theory 1234567890 [{"cert":[{ >

Re: Scala Vs Python

2016-09-06 Thread
Hi, I have been using spark-sql with python for more than one year from ver 1.5.0 to ver 2.0.0, It works great so far,the performance is always great,though I have not done the benchmark yet. also I have skimmed through source code of python api,most of it only calls scala api,nothing heavily is

Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread
Hi, I think you can refer to spark history server to figure out how the time was spent. 2016-09-05 10:36 GMT+08:00 xiefeng : > The spark context will be reused, so the spark context initialization won't > affect the throughput test. > > > > -- > View this message in

spark-sql jdbc dataframe mysql data type issue

2016-06-25 Thread
Hi, I came across this strange behavior of Apache Spark 1.6.1: when I was reading mysql table into spark dataframe ,a column of data type float got mapped into double. dataframe schema: root |-- id: long (nullable = true) |-- ctime: double (nullable = true) |-- atime: double (nullable =

Re: DataFrame --> JSON objects, instead of un-named array of fields

2016-03-29 Thread
Hi, Besides your solution ,yon can use df.write.format('json').save('a.json') 2016-03-29 4:11 GMT+08:00 Russell Jurney : > To answer my own question, DataFrame.toJSON() does this, so there is no > need to map and json.dump(): > > >

Re: Restrictions on SQL operations on Spark temporary tables

2016-02-27 Thread
Hi, For now Spark-sql does not support subquery,I guess that's the reason your query fails 2016-02-27 20:01 GMT+08:00 Mich Talebzadeh : > It appeas that certain SQL on Spark temporary tables do not support Hive > SQL even when they are using HiveContext > > example > >

Re: Re: spark dataframe jdbc read/write using dbcp connection pool

2016-01-20 Thread
ease the partitions? Or > is there any other > alternatives I can choose to tune this ? > > Best, > Sun. > > -- > fightf...@163.com > > > *From:* fightf...@163.com > *Date:* 2016-01-20 15:06 > *To:* 刘虓 <ipf...@gmail.com> > *CC:*

Re: spark dataframe jdbc read/write using dbcp connection pool

2016-01-19 Thread
Hi, I suggest you partition the JDBC reading on a indexed column of the mysql table 2016-01-20 10:11 GMT+08:00 fightf...@163.com : > Hi , > I want to load really large volumn datasets from mysql using spark > dataframe api. And then save as > parquet file or orc file to

Re: spark yarn client mode

2016-01-19 Thread
Hi, No,you don't need to. However,when submitting jobs certain resources will be uploaded to hdfs,which could be a performance issue read the log and you will understand: 15/12/29 11:10:06 INFO Client: Uploading resource file:/data/spark/spark152/lib/spark-assembly-1.5.2-hadoop2.6.0.jar -> hdfs