Re: Spark <--> S3 flakiness

2017-05-13 Thread Miguel Morales
Some things just didn't work as i had first expected it. For example, when writing from a spark collection to an alluxio destination didn't persist them to s3 automatically. I remember having to use the alluxio library directly to force the files to persist to s3 after spark finished writing to

Re: what is the difference between json format vs kafka format?

2017-05-13 Thread kant kodali
Hi, Here is a little bit of background. I've been using stateless streaming API's for a while like using JavaDstream and so on and they worked well. It's has come to a point where we need to do realtime stateful streaming based on event time and other things but for now I am just trying to get

Re: what is the difference between json format vs kafka format?

2017-05-13 Thread Tathagata Das
You cant do ".count()" directly on streaming DataFrames. This is because "count" is an Action (remember RDD actions) that executes and returns a result immediately which can be done only when the data is bounded (e.g. batch/interactive queries). For streaming queries, you have to let it run in the

Re: what is the difference between json format vs kafka format?

2017-05-13 Thread kant kodali
Hi! Thanks for the response. Looks like from_json requires schema ahead of time. Is there any function I can use to infer schema from the json messages I am receiving through Kafka? I tried with the code below however I get the following exception. org.apache.spark.sql.AnalysisException:

Re: what is the difference between json format vs kafka format?

2017-05-13 Thread Tathagata Das
I understand the confusing. "json" format is for json encoded files being written in a directory. For Kafka, use "kafk" format. Then you decode the binary data as a json, you can use the function "from_json" (spark 2.1 and above). Here is our blog post on this.

Re: Is GraphX really deprecated?

2017-05-13 Thread Jacek Laskowski
Hi, I'd like to hear the official statement too. My take on GraphX and Spark Streaming is that they are long dead projects with GraphFrames and Structured Streaming taking their place, respectively. Jacek On 13 May 2017 3:00 p.m., "Sergey Zhemzhitsky" wrote: > Hello Spark

Is GraphX really deprecated?

2017-05-13 Thread Sergey Zhemzhitsky
Hello Spark users, I just would like to know whether the GraphX component should be considered deprecated and no longer actively maintained and should not be considered when starting new graph-processing projects on top of Spark in favour of other graph-processing frameworks? I'm asking

Re: what does this error mean?

2017-05-13 Thread Zeming Yu
Another error. Anyone have any idea? this one happens when I tried to convert a spark dataframe to pandas: ---Py4JError Traceback (most recent call

what does this error mean?

2017-05-13 Thread Zeming Yu
My code runs error free on my local pc. Just tried running the same code on a ubuntu machine on ec2, and got the error below. Any idea where to start in terms of debugging? ---Py4JError

what is the difference between json format vs kafka format?

2017-05-13 Thread kant kodali
HI All, What is the difference between sparkSession.readStream.format("kafka") vs sparkSession.readStream.format("json") ? I am sending json encoded messages in Kafka and I am not sure which one of the above I should use? Thanks!

Re: Restful API Spark Application

2017-05-13 Thread vincent gromakowski
It's in scala but it should be portable in java https://github.com/vgkowski/akka-spark-experiments Le 12 mai 2017 10:54 PM, "Василец Дмитрий" a écrit : and livy https://hortonworks.com/blog/livy-a-rest-interface-for- apache-spark/ On Fri, May 12, 2017 at 10:51 PM,