[Beginner][StructuredStreaming] Console sink is not working as expected

2018-05-22 Thread karthikjay
I have the following code to read and process Kafka data using Structured Streaming object ETLTest { case class record(value: String, topic: String) def main(args: Array[String]): Unit = { run(); } def run(): Unit = {

RE: spark sql in-clause problem

2018-05-22 Thread Shiva Prashanth Vallabhaneni
Assuming the list of values in the “IN” clause is small, you could try using sparkSqlContext.sql(select * from mytable where key = 1 and ( (X,Y) = (1,2) OR (X,Y) = (3,4) ) Another solution could be to load the possible values for X & Y into a table and then using this table in the sub-query;

Re: testing frameworks

2018-05-22 Thread umargeek
Hi Steve, you can try out pytest-spark plugin if your writing programs using pyspark ,please find below link for reference. https://github.com/malexer/pytest-spark Thanks, Umar -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Alternative for numpy in Spark Mlib

2018-05-22 Thread umargeek
Hi Folks, I am planning to rewrite one of my python module written for entropy calculation using numpy into Spark Mlib so that it can be processed in distributed manner. Can you please advise on the possibilities of the same approach or any alternatives. Thanks, Umar -- Sent from:

spark sql in-clause problem

2018-05-22 Thread onmstester onmstester
I'm reading from this table in cassandra: Table mytable ( Integer Key, Integer X, Interger Y Using: sparkSqlContext.sql(select * from mytable where key = 1 and (X,Y) in ((1,2),(3,4))) Encountered error: StructType(StructField((X,IntegerType,true),StructField((Y,IntegerType,true))

How to validate orc vectorization is working within spark application?

2018-05-22 Thread umargeek
Hi Folks, I have enabled below listed configurations within my spark streaming application but I did not gain performance benefit even after setting these parameters ,can you please help me is there a way to validate whether vectorization is working as expeced/enabled correctly ! Note: I am

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-22 Thread Tathagata Das
Just to be clear, these screenshots are about the memory consumption of the driver. So this is nothing to do with streaming aggregation state which are kept in the memory of the executors, not the driver. On Tue, May 22, 2018 at 10:21 AM, Jungtaek Lim wrote: > 1. Could you

Re: Spark driver pod eviction Kubernetes

2018-05-22 Thread Anirudh Ramanathan
I think a pod disruption budget might actually work here. It can select the spark driver pod using a label. Using that with a minAvailable value that's appropriate here could do it. In a more general sense, we do plan on some future work to support driver recovery which should help long running

Re: [EXTERNAL] - Re: testing frameworks

2018-05-22 Thread Joel D
We’ve developed our own version of testing framework consisting of different areas of checking, sometimes providing expected data and comparing with the resultant data from the data object. Cheers. On Tue, May 22, 2018 at 1:48 PM Steve Pruitt wrote: > Something more on

Re: Spark UNEVENLY distributing data

2018-05-22 Thread Saad Mufti
I think TableInputFormat will try to maintain as much locality as possible, assigning one Spark partition per region and trying to assign that partition to a YARN container/executor on the same node (assuming you're using Spark over YARN). So the reason for the uneven distribution could be that

RE: [EXTERNAL] - Re: testing frameworks

2018-05-22 Thread Steve Pruitt
Something more on the lines of integration I believe. Run one or more Spark jobs and verify the output results. If this makes sense. I am very new to the world of Spark. We want to include pipeline testing from the get go. I will check out spark-testing-base. Thanks. From: Holden Karau

Re: Encounter 'Could not find or load main class' error when submitting spark job on kubernetes

2018-05-22 Thread Marcelo Vanzin
On Tue, May 22, 2018 at 12:45 AM, Makoto Hashimoto wrote: > local:///usr/local/oss/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar Is that the path of the jar inside your docker image? The default image puts that in /opt/spark IIRC. -- Marcelo

Spark driver pod eviction Kubernetes

2018-05-22 Thread purna pradeep
Hi, What would be the recommended approach to wait for spark driver pod to complete the currently running job before it gets evicted to new nodes while maintenance on the current node is goingon (kernel upgrade,hardware maintenance etc..) using drain command I don’t think I can use

Re: [structured-streaming]How to reset Kafka offset in readStream and read from beginning

2018-05-22 Thread Bowden, Chris
You can delete the write ahead log directory you provided to the sink via the “checkpointLocation” option. From: karthikjay Sent: Tuesday, May 22, 2018 7:24:45 AM To: user@spark.apache.org Subject: [structured-streaming]How to reset Kafka

[structured-streaming]How to reset Kafka offset in readStream and read from beginning

2018-05-22 Thread karthikjay
I have the following readstream in Spark structured streaming reading data from Kafka val kafkaStreamingDF = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("subscribe", "testtopic") .option("failOnDataLoss", "false")

Re: Spark Worker Re-register to Master

2018-05-22 Thread sushil.chaudhary
Can anyone please have a look and put thoughts here.. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

problem with saving RandomForestClassifier model - Saprk Java

2018-05-22 Thread Donni Khan
Hi SPark users, I built Random forest model by using Spark 1.6 with Java. I'm getting the following exception: User class threw exception: java.lang.UnsupportedOperationException: Pipeline write will fail on this Pipeline because it contains a stage which does not implement Writable. Does

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-22 Thread Jungtaek Lim
1. Could you share your Spark version? 2. Could you reduce "spark.sql.ui.retainedExecutions" and see whether it helps? This configuration is available in 2.3.0, and default value is 1000. Thanks, Jungtaek Lim (HeartSaVioR) 2018년 5월 22일 (화) 오후 4:29, weand 님이 작성: > You

Encounter 'Could not find or load main class' error when submitting spark job on kubernetes

2018-05-22 Thread Makoto Hashimoto
Hi, I am trying to run spark job on kubernetes. Using local spark job works fine as follows: $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master local[4] examples/jars/spark-examples_2.11-2.3.0.jar 100 .. 2018-05-20 21:49:02 INFO DAGScheduler:54 - Job 0 finished: reduce

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-22 Thread weand
You can see it even better on this screenshot: TOP Entries Collapsed #2 Sorry for the spam, attached a not so perfect screen in the mail before. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-22 Thread weand
Instances of org.apache.spark.sql.execution.ui.SparkPlanGraphWrapper are not cleaned up, see TOP Entries Collapsed #2: TOP Entries All TOP Entries Collapsed #1