Re: pyspark.sql.DataFrame write error to Postgres DB

2017-04-20 Thread Cinyoung Hur
Ooops, sorry. I meant Maria DB. 2017-04-21 12:51 GMT+09:00 Takeshi Yamamuro : > Why you use a mysql jdbc driver? > > // maropu > > On Fri, Apr 21, 2017 at 11:54 AM, Cinyoung Hur > wrote: > >> Hi, >> >> I tried to insert Dataframe into Postgres

Re: Azure Event Hub with Pyspark

2017-04-20 Thread Denny Lee
As well, perhaps another option could be to use the Spark Connector to DocumentDB (https://github.com/Azure/azure-documentdb-spark) if sticking with Scala? On Thu, Apr 20, 2017 at 21:46 Nan Zhu wrote: > DocDB does have a java client? Anything prevent you using that? > >

Re: Azure Event Hub with Pyspark

2017-04-20 Thread Nan Zhu
DocDB does have a java client? Anything prevent you using that? Get Outlook for iOS From: ayan guha Sent: Thursday, April 20, 2017 9:24:03 PM To: Ashish Singh Cc: user Subject: Re: Azure Event Hub with Pyspark Hi yes,

Re: Azure Event Hub with Pyspark

2017-04-20 Thread ayan guha
Hi yes, its only scala. I am looking for a pyspark version, as i want to write to documentDB which has good python integration. Thanks in advance best Ayan On Fri, Apr 21, 2017 at 2:02 PM, Ashish Singh wrote: > Hi , > > You can try

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Georg Heiler
Unfortunately I think this currently might require the old api. Hemanth Gudela schrieb am Fr. 21. Apr. 2017 um 05:58: > Idea #2 probably suits my needs better, because > > - Streaming query does not have a source database connector yet > > - My

Re: Azure Event Hub with Pyspark

2017-04-20 Thread Ashish Singh
Hi , You can try https://github.com/hdinsight/spark-eventhubs : which is eventhub receiver for spark streaming We are using it but you have scala version only i guess Thanks, Ashish Singh On Fri, Apr 21, 2017 at 9:19 AM, ayan guha wrote: > [image: Boxbe]

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Hemanth Gudela
Idea #2 probably suits my needs better, because - Streaming query does not have a source database connector yet - My source database table is big, so in-memory table could be huge for driver to handle. Thanks for cool ideas, TD! Regards, Hemanth From: Tathagata Das

Re: pyspark.sql.DataFrame write error to Postgres DB

2017-04-20 Thread Takeshi Yamamuro
Why you use a mysql jdbc driver? // maropu On Fri, Apr 21, 2017 at 11:54 AM, Cinyoung Hur wrote: > Hi, > > I tried to insert Dataframe into Postgres DB. > > But, I don't know what causes this error. > > > properties = { > "user": "user", > "password": "pass", >

Azure Event Hub with Pyspark

2017-04-20 Thread ayan guha
Hi I am not able to find any conector to be used to connect spark streaming with Azure Event Hub, using pyspark. Does anyone know if there is such library/package exists>? -- Best Regards, Ayan Guha

pyspark.sql.DataFrame write error to Postgres DB

2017-04-20 Thread Cinyoung Hur
Hi, I tried to insert Dataframe into Postgres DB. But, I don't know what causes this error. properties = { "user": "user", "password": "pass", "driver": "com.mysql.jdbc.Driver", } url = "jdbc:mysql://ip address/MYDB?useServerPrepStmts=false=true" df.write.jdbc(url=url,

Re: 答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Ryan
Hi mo, I don't think it needs shuffle cause the bloom filter only depends on data within each row group, not the whole data. But the HAR solution seems nice. I've thought of combining small files together and store the offsets.. not aware of hdfs has provided such functionality. And after some

Please participate in a research survey on graphs

2017-04-20 Thread Siddhartha Sahu
Hi, My name is Siddhartha Sahu and I am a Master's student at University of Waterloo working on graph processing with Prof. Semih Salihoglu. As part of my research, I am running a survey on how graphs are used in the industry and academia. If you work on any kind of graph technology, such as

Re: Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All, It appears that the bottleneck in my job was the EBS volumes. Very high i/o wait times across the cluster. I was only using 1 volume. Increasing to 4 made it faster. Thanks, Pradeep On Thu, Apr 20, 2017 at 3:12 PM, Pradeep Gollakota wrote: > Hi All, > > I have a

Long Shuffle Read Blocked Time

2017-04-20 Thread Pradeep Gollakota
Hi All, I have a simple ETL job that reads some data, shuffles it and writes it back out. This is running on AWS EMR 5.4.0 using Spark 2.1.0. After Stage 0 completes and the job starts Stage 1, I see a huge slowdown in the job. The CPU usage is low on the cluster, as is the network I/O. >From

Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

2017-04-20 Thread Subhash Sriram
Would it be an option to just write the results of each job into separate tables and then run a UNION on all of them at the end into a final target table? Just thinking of an alternative! Thanks, Subhash Sent from my iPhone > On Apr 20, 2017, at 3:48 AM, Rick Moritz wrote:

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Tathagata Das
Here are couple of ideas. 1. You can set up a Structured Streaming query to update in-memory table. Look at the memory sink in the programming guide - http://spark.apache.org/ docs/latest/structured-streaming-programming-guide.html#output-sinks So you can query the latest table using a specified

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Hemanth Gudela
Thanks Georg for your reply. But I’m not sure if I fully understood your answer. If you meant to join two streams (one reading Kafka, and another reading database table), then I think it’s not possible, because 1. According to

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Georg Heiler
What about treating the static data as a (slow) stream as well? Hemanth Gudela schrieb am Do., 20. Apr. 2017 um 22:09 Uhr: > Hello, > > > > I am working on a use case where there is a need to join streaming data > frame with a static data frame. > > The streaming

Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Hemanth Gudela
Hello, I am working on a use case where there is a need to join streaming data frame with a static data frame. The streaming data frame continuously gets data from Kafka topics, whereas static data frame fetches data from a database table. However, as the underlying database table is getting

Re: Any Idea about this error : IllegalArgumentException: File segment length cannot be negative ?

2017-04-20 Thread Victor Tso-Guillen
Along with Priya's email slightly earlier than this one, we also are seeing this happen on Spark 1.5.2. On Wed, Jul 13, 2016 at 1:26 AM Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > In Spark Streaming job, I see a Batch failed with following error. Haven't > seen anything like

Maximum Partitioner size

2017-04-20 Thread Patrick GRANDJEAN
Hi, I have implemented a custom Partitioner (org.apache.spark.Partitioner) that contains a medium-sized object (some megabytes). Unfortunately Spark (2.1.0) fails with a StackOverflowError, and I suspect it is because of the size of the partitioner that needs to be serialized. My question is,

Re: how to add new column using regular expression within pyspark dataframe

2017-04-20 Thread Pushkar.Gujar
Can be as simple as - from pyspark.sql.functions import split flight.withColumn('hour',split(flight.duration,'h').getItem(0)) Thank you, *Pushkar Gujar* On Thu, Apr 20, 2017 at 4:35 AM, Zeming Yu wrote: > Any examples? > > On 20 Apr. 2017 3:44 pm, "颜发才(Yan Facai)"

[sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ?

2017-04-20 Thread Radhwane Chebaane
Hi, I've been experimenting with the Spark *Word2vec* implementation in the MLLib package with Scala and it was very nice. I need to use the same algorithm in R leveraging the power of spark distribution with SparkR. I have been looking on the mailing list and Stackoverflow for any *Word2vec*

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
Hi Ryan, The attachment is the event timeline on executors. They are always busy computing. More executors are helpful but that's not my job as a developer. 1. The bad performance could be caused by my poor implementation, as "checkID" would not pushdown as a user defined function. 2. To

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
It's hadoop archive. https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html 发件人: Alonso Isidoro Roman 发送时间: 2017年4月20日 17:03:33 收件人: 莫涛 抄送: Jörn Franke; user@spark.apache.org 主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Alonso Isidoro Roman
forgive my ignorance, but, what does it mean HAR? a acronym to High available record? Thanks Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-04-20 10:58

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
Hi Jörn, HAR is a great idea! For POC, I've archived 1M records and stored the id -> path mapping in text (for better readability). Filtering 1K records takes only 2 minutes now (30 seconds to get the path list and 0.5 second per thread to read a record). Such performance is exactly what I

Re: how to add new column using regular expression within pyspark dataframe

2017-04-20 Thread Zeming Yu
Any examples? On 20 Apr. 2017 3:44 pm, "颜发才(Yan Facai)" wrote: > How about using `withColumn` and UDF? > > example: > + https://gist.github.com/zoltanctoth/2deccd69e3d1cde1dd78 > > +

Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

2017-04-20 Thread Rick Moritz
Hi List, I'm wondering if the following behaviour should be considered a bug, or whether it "works as designed": I'm starting multiple concurrent (FIFO-scheduled) jobs in a single SparkContext, some of which write into the same tables. When these tables already exist, it appears as though both

checkpoint on spark standalone

2017-04-20 Thread Vivek Mishra
Hi, I am processing multiple 2 GB each csv files with my spark application. Which also does union and aggregation across all the input files. Currently stuck with given below error: java.lang.StackOverflowError at

Re: Spark-shell's performance

2017-04-20 Thread Yan Facai
Hi, Hanson. Perhaps I’m digressing here. If I'm wrong or mistake, please correct me. SPARK_WORKER_* is the configuration for whole cluster, and it's fine to write those global variable in spark-env.sh. However, SPARK_DRIVER_* and SPARK_EXECUTOR_* is the configuration for application (your code),