Cannot connect to Python process in Spark Streaming

2017-08-01 Thread canan chen
I run pyspark streaming example queue_streaming.py. But run into the following error, does anyone know what might be wrong ? Thanks ERROR [2017-08-02 08:29:20,023] ({Stop-StreamingContext} Logging.scala[logError]:91) - Cannot connect to Python process. It's probably dead. Stopping

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
Thanks Mark. I'll examine the status more carefully to observe this. From: Mark Hamstra Sent: Tuesday, August 1, 2017 11:25:46 AM To: user@spark.apache.org Subject: Re: How can i remove the need for calling cache Very likely, much of the

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
Thanks Vadim. I'll try that From: Vadim Semenov Sent: Tuesday, August 1, 2017 12:05:17 PM To: jeff saremi Cc: user@spark.apache.org Subject: Re: How can i remove the need for calling cache You can use `.checkpoint()`: ``` val sc:

Re: How can i remove the need for calling cache

2017-08-01 Thread Vadim Semenov
You can use `.checkpoint()`: ``` val sc: SparkContext sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory") myrdd.checkpoint() val result1 = myrdd.map(op1(_)) result1.count() // Will save `myrdd` to HDFS and do map(op1… val result2 = myrdd.map(op2(_)) result2.count() // Will load `myrdd` from

Re: How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
here are the threads that talk about problems we're experiencing. These problems exacerbate when we use cache/persist https://www.mail-archive.com/user@spark.apache.org/msg64987.html https://www.mail-archive.com/user@spark.apache.org/msg64986.html So I am looking for a way to reproduce the same

[Spark SQL Lack of ForEach Sink in Python]: Is there anyway to use a ForEach sink in a Python application?

2017-08-01 Thread Denis Li
I am trying to use PySpark to read a Kafka stream and then write it to Redis. However, PySpark does not have support for a ForEach sink. So, I am thinking of reading the Kafka stream into a DataFrame in Python and then sending that DataFrame into a Scala application to be written to Redis. Is

Re: How can i remove the need for calling cache

2017-08-01 Thread Mark Hamstra
Very likely, much of the potential duplication is already being avoided even without calling cache/persist. When running the above code without `myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at least one of them you will likely see that many Stages are marked as "skipped",

Re: How can i remove the need for calling cache

2017-08-01 Thread lucas.g...@gmail.com
Hi Jeff, that looks sane to me. Do you have additional details? On 1 August 2017 at 11:05, jeff saremi wrote: > Calling cache/persist fails all our jobs (i have posted 2 threads on > this). > > And we're giving up hope in finding a solution. > So I'd like to find a

How can i remove the need for calling cache

2017-08-01 Thread jeff saremi
Calling cache/persist fails all our jobs (i have posted 2 threads on this). And we're giving up hope in finding a solution. So I'd like to find a workaround for that: If I save an RDD to hdfs and read it back, can I use it in more than one operation? Example: (using cache) // do a whole bunch

Re: Spark parquet file read problem !

2017-08-01 Thread ??????????
?9?4 I have no idea about this ---Original--- From: "serkan ta?0?6" Date: 2017/7/31 16:42:59 To: "pandees waran";"??"<1427357...@qq.com>; Cc: "user@spark.apache.org"; Subject: Re: Spark parquet file read problem !

Spark ES Connector -- AWS Managed ElasticSearch Services

2017-08-01 Thread Deepak Sharma
I am tying to connect to AWS managed ES service using Spark ES Connector , but am not able to. I am passing es.nodes and es.port along with es.nodes.wan.only set to true. But it fails with below error: 34 ERROR NetworkClient: Node [x.x.x.x:443] failed (The server x.x.x.x failed to respond); no

java.io.OptionalDataException during task deserialization

2017-08-01 Thread Viacheslav Krot
Hi, I have a weird issue - spark streaming application fails once/twice a day with java.io.OptionalDataException. It happens when deserializing task. The problem appeared after migration from spark 1.6 to spark 2.2, cluster runs in latest CDH distro, YARN mode. I have no clue what to blame and

java.io.OptionalDataException during task deserialization

2017-08-01 Thread Viacheslav Krot
Hi, I have a weird issue - spark streaming application fails once/twice a day with java.io.OptionalDataException. It happens when deserializing task. The problem appeared after migration from spark 1.6 to spark 2.2, cluster runs in latest CDH distro, YARN mode. I have no clue what to blame and