Re: Structured streaming coding question

2017-09-19 Thread Burak Yavuz
Hey Kant, That won't work either. Your second query may fail, and as long as your first query is running, you will not know. Put this as the last line instead: spark.streams.awaitAnyTermination() On Tue, Sep 19, 2017 at 10:11 PM, kant kodali wrote: > Looks like my problem was the order of awai

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Jacek Laskowski
Hi, Cody's right. subscribe - Topic subscription strategy that accepts topic names as a comma-separated string, e.g. topic1,topic2,topic3 [1] subscribepattern - Topic subscription strategy that uses Java’s java.util.regex.Pattern for the topic subscription regex pattern of topics to subscribe to

Re: Structured streaming coding question

2017-09-19 Thread kant kodali
Looks like my problem was the order of awaitTermination() for some reason. *Doesn't work * outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new KafkaSink("hello1")).start().awaitTermination() outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new KafkaSink

Re: Structured streaming coding question

2017-09-19 Thread Jacek Laskowski
Hi, Ah, right! Start the queries and once they're running, awaitTermination them. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spar

Re: Structured streaming coding question

2017-09-19 Thread Jacek Laskowski
Hi, What's the code in readFromKafka to read from hello2 and hello1? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me a

Re: Structured streaming coding question

2017-09-19 Thread kant kodali
Looks like my problem was the order of awaitTermination() for some reason. Doesn't work On Tue, Sep 19, 2017 at 1:54 PM, kant kodali wrote: > Hi All, > > I have the following Psuedo code (I could paste the real code however it > is pretty long and involves Database calls inside dataset.map

RE: Cloudera - How to switch to the newly added Spark service (Spark2) from Spark 1.6 in CDH 5.12

2017-09-19 Thread Sudha KS
To set Spark2 as default, refer https://www.cloudera.com/documentation/spark2/latest/topics/spark2_admin.html#default_tools -Original Message- From: Gaurav1809 [mailto:gauravhpan...@gmail.com] Sent: Wednesday, September 20, 2017 9:16 AM To: user@spark.apache.org Subject: Cloudera - How t

Cloudera - How to switch to the newly added Spark service (Spark2) from Spark 1.6 in CDH 5.12

2017-09-19 Thread Gaurav1809
Hello all, I downloaded CDH and it comes with Spark 1.6 As per the step by step guide given - I added Spark 2 in the services list. Now I can see both Spark 1.6 & Spark 2 And when I do Spark-Shell in terminal window, it starts with Spark 1.6 only. How to switch to Spark 2? What all _HOMEs or param

unsubscribe

2017-09-19 Thread shshann
--- TSMC PROPERTY This email communication (and any attachments) is proprietary information for the sole use of its

unsubscribe

2017-09-19 Thread Brindha Sengottaiyan

Structured streaming coding question

2017-09-19 Thread kant kodali
Hi All, I have the following Psuedo code (I could paste the real code however it is pretty long and involves Database calls inside dataset.map operation and so on) so I am just trying to simplify my question. would like to know if there is something wrong with the following pseudo code? DataSet i

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Cody Koeninger
You should be able to pass a comma separated string of topics to subscribe. subscribePattern isn't necessary On Tue, Sep 19, 2017 at 2:54 PM, kant kodali wrote: > got it! Sorry. > > On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski wrote: >> >> Hi, >> >> Use subscribepattern >> >> You haven't

Re: SVD computation limit

2017-09-19 Thread Vadim Semenov
This may also be related to https://issues.apache.org/jira/browse/SPARK-22033 On Tue, Sep 19, 2017 at 3:40 PM, Mark Bittmann wrote: > I've run into this before. The EigenValueDecomposition creates a Java > Array with 2*k*n elements. The Java Array is indexed with a native integer > type, so 2*k*

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread kant kodali
got it! Sorry. On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski wrote: > Hi, > > Use subscribepattern > > You haven't googled well enough --> https://jaceklaskowski. > gitbooks.io/spark-structured-streaming/spark-sql-streaming- > KafkaSource.html :) > > Pozdrawiam, > Jacek Laskowski > > ht

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Jacek Laskowski
Hi, Use subscribepattern You haven't googled well enough --> https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-

How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread kant kodali
HI All, I am wondering How to read from multiple kafka topics using structured streaming (code below)? I googled prior to asking this question and I see responses related to Dstreams but not structured streams. Is it possible to read multiple topics using the same spark structured stream? sparkSe

Re: SVD computation limit

2017-09-19 Thread Mark Bittmann
I've run into this before. The EigenValueDecomposition creates a Java Array with 2*k*n elements. The Java Array is indexed with a native integer type, so 2*k*n cannot exceed Integer.MAX_VALUE values. The array is created here: https://github.com/apache/spark/blob/master/mllib/src/ main/scala/org/a

Re: Question on partitionColumn for a JDBC read using a timestamp from MySql

2017-09-19 Thread lucas.g...@gmail.com
I ended up doing this for the time being. It works but I *think* that timestamp seems like a rational partitionColumn and I'm wondering if there's a more built in way: > df = spark.read.jdbc( > > url=os.environ["JDBCURL"], > > table="schema.table", > > predicates=predicates > > ) >

Re: Nested RDD operation

2017-09-19 Thread Jean Georges Perrin
Have you tried to cache? maybe after the collect() and before the map? > On Sep 19, 2017, at 7:20 AM, Daniel O' Shaughnessy > wrote: > > Thanks for your response Jean. > > I managed to figure this out in the end but it's an extremely slow solution > and not tenable for my use-case: > >

SVD computation limit

2017-09-19 Thread Alexander Ovcharenko
Hello guys, While trying to compute SVD using computeSVD() function, i am getting the following warning with the follow up exception: 17/09/14 12:29:02 WARN RowMatrix: computing svd with k=49865 and n=191077, please check necessity IllegalArgumentException: u'requirement failed: k = 49865 and/or n

Re: Nested RDD operation

2017-09-19 Thread ayan guha
How big is the list of fruits in your example? Can you broadcast it? On Tue, 19 Sep 2017 at 9:21 pm, Daniel O' Shaughnessy < danieljamesda...@gmail.com> wrote: > Thanks for your response Jean. > > I managed to figure this out in the end but it's an extremely slow > solution and not tenable for my

Uses of avg hash probe metric in HashAggregateExec?

2017-09-19 Thread Jacek Laskowski
Hi, I've just noticed the new "avg hash probe" metric for HashAggregateExec operator [1]. My understanding (after briefly going through the code in the change and around) is as follows: Average hash map probe per lookup (i.e. `numProbes` / `numKeyLookups`) NOTE: `numProbes` and `numKeyLookups`

Re: Nested RDD operation

2017-09-19 Thread Daniel O' Shaughnessy
Thanks for your response Jean. I managed to figure this out in the end but it's an extremely slow solution and not tenable for my use-case: val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split( ",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList) //val oneRow = Converted(ev

Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-19 Thread HARSH TAKKAR
Thanks Cody, It worked for me buy keeping num executor with each having 1 core = num of partitions of kafka. On Mon, Sep 18, 2017 at 8:47 PM Cody Koeninger wrote: > Have you searched in jira, e.g. > > https://issues.apache.org/jira/browse/SPARK-19185 > > On Mon, Sep 18, 2017 at 1:56 AM, HARSH

Help needed in Dividing open close dates column into multiple columns in dataframe

2017-09-19 Thread Aakash Basu
Hi, I've a csv dataset which has a column with all the details of store open and close timings as per dates, but the data is highly variant, as follows - Mon-Fri 10am-9pm, Sat 10am-8pm, Sun 12pm-6pm Mon-Sat 10am-8pm, Sun Closed Mon-Sat 10am-8pm, Sun 10am-6pm Mon-Friday 9-8 / Saturday 10-7 / Sund

Spark Executor - jaas.conf with useTicketCache=true

2017-09-19 Thread Hugo Reinwald
Hi All, Apologies for cross posting, I posted this on Kafka user group. Is there any way that I can use kerberos ticket cache of the spark executor for reading from secured kafka? I am not 100% that the executor would do a kinit, but I presume so to be able to run code / read hdfs etc as the user