Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Jacek Laskowski
Hi, Cody's right. subscribe - Topic subscription strategy that accepts topic names as a comma-separated string, e.g. topic1,topic2,topic3 [1] subscribepattern - Topic subscription strategy that uses Java’s java.util.regex.Pattern for the topic subscription regex pattern of topics to subscribe

Re: Structured streaming coding question

2017-09-19 Thread kant kodali
Looks like my problem was the order of awaitTermination() for some reason. *Doesn't work * outputDS1.writeStream().trigger(Trigger.processingTime(1000)).foreach(new KafkaSink("hello1")).start().awaitTermination() outputDS2.writeStream().trigger(Trigger.processingTime(1000)).foreach(new

Re: Structured streaming coding question

2017-09-19 Thread Jacek Laskowski
Hi, Ah, right! Start the queries and once they're running, awaitTermination them. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2

Re: Structured streaming coding question

2017-09-19 Thread Jacek Laskowski
Hi, What's the code in readFromKafka to read from hello2 and hello1? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me

Re: Structured streaming coding question

2017-09-19 Thread kant kodali
Looks like my problem was the order of awaitTermination() for some reason. Doesn't work On Tue, Sep 19, 2017 at 1:54 PM, kant kodali wrote: > Hi All, > > I have the following Psuedo code (I could paste the real code however it > is pretty long and involves Database

RE: Cloudera - How to switch to the newly added Spark service (Spark2) from Spark 1.6 in CDH 5.12

2017-09-19 Thread Sudha KS
To set Spark2 as default, refer https://www.cloudera.com/documentation/spark2/latest/topics/spark2_admin.html#default_tools -Original Message- From: Gaurav1809 [mailto:gauravhpan...@gmail.com] Sent: Wednesday, September 20, 2017 9:16 AM To: user@spark.apache.org Subject: Cloudera - How

Cloudera - How to switch to the newly added Spark service (Spark2) from Spark 1.6 in CDH 5.12

2017-09-19 Thread Gaurav1809
Hello all, I downloaded CDH and it comes with Spark 1.6 As per the step by step guide given - I added Spark 2 in the services list. Now I can see both Spark 1.6 & Spark 2 And when I do Spark-Shell in terminal window, it starts with Spark 1.6 only. How to switch to Spark 2? What all _HOMEs or

unsubscribe

2017-09-19 Thread shshann
--- TSMC PROPERTY This email communication (and any attachments) is proprietary information for the sole use of its

Structured streaming coding question

2017-09-19 Thread kant kodali
Hi All, I have the following Psuedo code (I could paste the real code however it is pretty long and involves Database calls inside dataset.map operation and so on) so I am just trying to simplify my question. would like to know if there is something wrong with the following pseudo code? DataSet

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Cody Koeninger
You should be able to pass a comma separated string of topics to subscribe. subscribePattern isn't necessary On Tue, Sep 19, 2017 at 2:54 PM, kant kodali wrote: > got it! Sorry. > > On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski wrote: >> >> Hi, >> >>

Re: SVD computation limit

2017-09-19 Thread Vadim Semenov
This may also be related to https://issues.apache.org/jira/browse/SPARK-22033 On Tue, Sep 19, 2017 at 3:40 PM, Mark Bittmann wrote: > I've run into this before. The EigenValueDecomposition creates a Java > Array with 2*k*n elements. The Java Array is indexed with a native

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread kant kodali
got it! Sorry. On Tue, Sep 19, 2017 at 12:52 PM, Jacek Laskowski wrote: > Hi, > > Use subscribepattern > > You haven't googled well enough --> https://jaceklaskowski. > gitbooks.io/spark-structured-streaming/spark-sql-streaming- > KafkaSource.html :) > > Pozdrawiam, > Jacek

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Jacek Laskowski
Hi, Use subscribepattern You haven't googled well enough --> https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+)

How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread kant kodali
HI All, I am wondering How to read from multiple kafka topics using structured streaming (code below)? I googled prior to asking this question and I see responses related to Dstreams but not structured streams. Is it possible to read multiple topics using the same spark structured stream?

Re: SVD computation limit

2017-09-19 Thread Mark Bittmann
I've run into this before. The EigenValueDecomposition creates a Java Array with 2*k*n elements. The Java Array is indexed with a native integer type, so 2*k*n cannot exceed Integer.MAX_VALUE values. The array is created here: https://github.com/apache/spark/blob/master/mllib/src/

Re: Question on partitionColumn for a JDBC read using a timestamp from MySql

2017-09-19 Thread lucas.g...@gmail.com
I ended up doing this for the time being. It works but I *think* that timestamp seems like a rational partitionColumn and I'm wondering if there's a more built in way: > df = spark.read.jdbc( > > url=os.environ["JDBCURL"], > > table="schema.table", > > predicates=predicates > > ) >

Re: Nested RDD operation

2017-09-19 Thread Jean Georges Perrin
Have you tried to cache? maybe after the collect() and before the map? > On Sep 19, 2017, at 7:20 AM, Daniel O' Shaughnessy > wrote: > > Thanks for your response Jean. > > I managed to figure this out in the end but it's an extremely slow solution > and not

SVD computation limit

2017-09-19 Thread Alexander Ovcharenko
Hello guys, While trying to compute SVD using computeSVD() function, i am getting the following warning with the follow up exception: 17/09/14 12:29:02 WARN RowMatrix: computing svd with k=49865 and n=191077, please check necessity IllegalArgumentException: u'requirement failed: k = 49865 and/or

Re: Nested RDD operation

2017-09-19 Thread ayan guha
How big is the list of fruits in your example? Can you broadcast it? On Tue, 19 Sep 2017 at 9:21 pm, Daniel O' Shaughnessy < danieljamesda...@gmail.com> wrote: > Thanks for your response Jean. > > I managed to figure this out in the end but it's an extremely slow > solution and not tenable for

Uses of avg hash probe metric in HashAggregateExec?

2017-09-19 Thread Jacek Laskowski
Hi, I've just noticed the new "avg hash probe" metric for HashAggregateExec operator [1]. My understanding (after briefly going through the code in the change and around) is as follows: Average hash map probe per lookup (i.e. `numProbes` / `numKeyLookups`) NOTE: `numProbes` and `numKeyLookups`

Re: Nested RDD operation

2017-09-19 Thread Daniel O' Shaughnessy
Thanks for your response Jean. I managed to figure this out in the end but it's an extremely slow solution and not tenable for my use-case: val rddX = dfWithSchema.select("event_name").rdd.map(_.getString(0).split( ",").map(_.trim replaceAll ("[\\[\\]\"]", "")).toList) //val oneRow =

Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-19 Thread HARSH TAKKAR
Thanks Cody, It worked for me buy keeping num executor with each having 1 core = num of partitions of kafka. On Mon, Sep 18, 2017 at 8:47 PM Cody Koeninger wrote: > Have you searched in jira, e.g. > > https://issues.apache.org/jira/browse/SPARK-19185 > > On Mon, Sep 18,

Help needed in Dividing open close dates column into multiple columns in dataframe

2017-09-19 Thread Aakash Basu
Hi, I've a csv dataset which has a column with all the details of store open and close timings as per dates, but the data is highly variant, as follows - Mon-Fri 10am-9pm, Sat 10am-8pm, Sun 12pm-6pm Mon-Sat 10am-8pm, Sun Closed Mon-Sat 10am-8pm, Sun 10am-6pm Mon-Friday 9-8 / Saturday 10-7 /

Spark Executor - jaas.conf with useTicketCache=true

2017-09-19 Thread Hugo Reinwald
Hi All, Apologies for cross posting, I posted this on Kafka user group. Is there any way that I can use kerberos ticket cache of the spark executor for reading from secured kafka? I am not 100% that the executor would do a kinit, but I presume so to be able to run code / read hdfs etc as the user