Re: [SparkSQL] Count Distinct issue

2018-09-17 Thread kathleen li
Hi, I can't reproduce your issue: scala> spark.sql("select distinct * from dfv").show() ++++++++++++++++---+ | a| b| c| d| e| f| g| h| i| j| k| l| m| n| o| p|

Re: Subscribe Multiple Topics Structured Streaming

2018-09-17 Thread Sivaprakash
I would like to know how to create stream and sink operations outside "main" method - just like another class which I can invoke from main. So that I can have different implementations for each topic which I subscribed in a specific class file. Is it a good practice or always the whole

Spark FlatMapGroupsWithStateFunction throws cannot resolve 'named_struct()' due to data type mismatch 'SerializeFromObject"

2018-09-17 Thread Kuttaiah Robin
Hello, Am using FlatMapGroupsWithStateFunction in my spark streaming application. FlatMapGroupsWithStateFunction idstateUpdateFunction = new FlatMapGroupsWithStateFunction() {.} SessionUpdate class is having trouble when added the highlighted code which throws below exception; The same

Re: Metastore problem on Spark2.3 with Hive3.0

2018-09-17 Thread Dongjoon Hyun
Hi, Jerry. There is a JIRA issue for that, https://issues.apache.org/jira/browse/SPARK-24360 . So far, it's in progress for Hive 3.1.0 Metastore for Apache Spark 2.5.0. You can track that issue there. Bests, Dongjoon. On Mon, Sep 17, 2018 at 7:01 PM 白也诗无敌 <445484...@qq.com> wrote: > Hi, guys

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
I think that makes sense. The main benefit of deprecating *prior* to 3.0 would be informational - making the community aware of the upcoming transition earlier. But there are other ways to start informing the community between now and 3.0, besides formal deprecation. I have some residual

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Reynold Xin
i'd like to second that. if we want to communicate timeline, we can add to the release notes saying py2 will be deprecated in 3.0, and removed in a 3.x release. -- excuse the brevity and lower case due to wrist injury On Mon, Sep 17, 2018 at 4:24 PM Matei Zaharia wrote: > That’s a good point

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
That’s a good point — I’d say there’s just a risk of creating a perception issue. First, some users might feel that this means they have to migrate now, which is before Python itself drops support; they might also be surprised that we did this in a minor release (e.g. might we drop Python 2

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
FWIW, Pandas is dropping Py2 support at the end of this year. Tensorflow is less clear. They only support py3 on windows, but there is no reference to any policy about py2 on their roadmap or the

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't change the code at all; it's just a notification that we will eventually cease supporting Py2. Wouldn't users prefer to get that notification sooner rather than later? On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia wrote:

why display this error

2018-09-17 Thread hager
I run this code using spark-submit --jars spark-streaming-kafka-0-8-assembly_2.10-2.0.0-preview.jar kafka2.py localhost:9092 test import sys from pyspark import SparkContext, SparkConf from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils from uuid import

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in

Re: Subscribe Multiple Topics Structured Streaming

2018-09-17 Thread naresh Goud
You can have below statement for multiple topics val dfStatus = spark.readStream. format("kafka"). option("subscribe", "utility-status, utility-critical"). option("kafka.bootstrap.servers", "localhost:9092"). option("startingOffsets", "earliest") .load() On Mon,

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
If we're going to do that, then we need to do it right now, since 2.4.0 is already in release candidates. On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson wrote: > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem > like a ways off but even now there may be some spark

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like a ways off but even now there may be some spark versions supporting Py2 past the point where Py2 is no longer receiving security patches On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra wrote: > We could also deprecate

Subscribe Multiple Topics Structured Streaming

2018-09-17 Thread sivaprakash
Hi I have integrated Spark Streaming with Kafka in which Im listening 2 topics def main(args: Array[String]): Unit = { val schema = StructType( List( StructField("gatewayId", StringType, true), StructField("userId", StringType, true) ) ) val spark =

Re: What is the best way for Spark to read HDF5@scale?

2018-09-17 Thread Saurav Sinha
Here is the solution sc.textFile("hdfs://nn1home:8020/input/war-and-peace.txt") How did I find out nn1home:8020? Just search for the file core-site.xml and look for xml element fs.defaultFS On Fri, Sep 14, 2018 at 7:57 PM kathleen li wrote: > Hi, > Any Spark-connector for HDF5? > > The

Re: Best practices on how to multiple spark sessions

2018-09-17 Thread Venkat Ramakrishnan
Umesh, I found the following write-up dealing with architecture and memory considerations elaborately. There are updates on memory, but it would be a good start for you: https://0x0fff.com/spark-architecture/ Any additional source(s) of info. are welcome from others too. - Venkat. On Sun, Sep