Re: spark-submit in mesos cluster mode --jars option not working

2017-05-16 Thread Timothy Chen
Hi Satya, --jars doesn't work with local files in Mesos Cluster mode doesn't upload or stage files automatically. For now you need to put these files in a location that the Driver can access. Tim On Tue, May 16, 2017 at 10:17 PM, Satya Narayan1 wrote: > creating

spark-submit in mesos cluster mode --jars option not working

2017-05-16 Thread Satya Narayan1
creating new thread for this. Is anyone able to use --jars with spark-submit in mesos cluster mode. We have tried giving local file, hdfs file, file from http server , --jars didnt work with any of the approach Saw couple of similar open question with no answer

Re: Spark <--> S3 flakiness

2017-05-16 Thread lucas.g...@gmail.com
Steve, thanks for the reply. Digging through all the documentation now. Much appreciated! On 16 May 2017 at 10:10, Steve Loughran wrote: > > On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: > > Hi users, we have a bunch of pyspark jobs that are using S3 for

Re: Not able pass 3rd party jars to mesos executors

2017-05-16 Thread Satya Narayan1
Hi , Is anyone able to use --jars with spark-submit in mesos cluster mode. We have tried giving local file, hdfs file, file from http server , --jars didnt work with any of the approach Saw couple of similar open question with no answer

Re: KTable like functionality in structured streaming

2017-05-16 Thread Tathagata Das
Dataframes have the combined functionalities of both KTable and Kstreams. So I dont quite understand what you mean by querying a Ktable. If you meant interactively querying a table, then you can put an aggregation streaming query into memory format sink and complete output mode to have interactive

Spark Streaming: NullPointerException when restoring Spark Streaming job from hdfs/s3 checkpoint

2017-05-16 Thread Richard Moorhead
Im having some difficulty reliably restoring a streaming job from a checkpoint. When restoring a streaming job constructed from the following snippet, I receive NullPointerException's when `map` is called on the the restored RDD. lazy val ssc = StreamingContext.getOrCreate(checkpointDir,

KTable like functionality in structured streaming

2017-05-16 Thread Stephen Fletcher
Are there any plans to add Kafka Streams KTable like functionality in structured streaming for kafka sources? Allowing querying keyed messages using spark sql,maybe calling KTables in the backend

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread Michael Armbrust
I mean the actual kafka client: org.apache.kafka kafka-clients 0.10.0.1 On Tue, May 16, 2017 at 4:29 PM, kant kodali wrote: > Hi Michael, > > Thanks for the catch. I assume you meant > *spark-streaming-kafka-0-10_2.11-2.1.0.jar* > > I add this in all spark machines

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread kant kodali
Hi Michael, Thanks for the catch. I assume you meant *spark-streaming-kafka-0-10_2.11-2.1.0.jar* I add this in all spark machines under SPARK_HOME/jars. Still same error seems to persist. Is that the right jar or is there anything else I need to add? Thanks! On Tue, May 16, 2017 at 1:40 PM,

Re: s3 bucket access/read file

2017-05-16 Thread jazzed
How did you solve the problem with V4? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/s3-bucket-access-read-file-tp23536p28688.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Restful API Spark Application

2017-05-16 Thread Debasish Das
You can run l On May 15, 2017 3:29 PM, "Nipun Arora" wrote: > Thanks all for your response. I will have a look at them. > > Nipun > > On Sat, May 13, 2017 at 2:38 AM vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > >> It's in scala but it should be

Cannot create parquet with snappy output for hive external table

2017-05-16 Thread Dhimant
Hi Group, I am not able to load data into external hive table which is partitioned. Trace :- 1. create external table test(id int, name string) stored as parquet location 'hdfs://testcluster/user/abc/test' tblproperties ('PARQUET.COMPRESS'='SNAPPY'); 2.Spark code val spark =

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread Michael Armbrust
Looks like you are missing the kafka dependency. On Tue, May 16, 2017 at 1:04 PM, kant kodali wrote: > Looks like I am getting the following runtime exception. I am using Spark > 2.1.0 and the following jars > > *spark-sql_2.11-2.1.0.jar* > >

Documentation on "Automatic file coalescing for native data sources"?

2017-05-16 Thread Daniel Siegmann
When using spark.read on a large number of small files, these are automatically coalesced into fewer partitions. The only documentation I can find on this is in the Spark 2.0.0 release notes, where it simply says ( http://spark.apache.org/releases/spark-release-2-0-0.html): "Automatic file

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread kant kodali
Looks like I am getting the following runtime exception. I am using Spark 2.1.0 and the following jars *spark-sql_2.11-2.1.0.jar* *spark-sql-kafka-0-10_2.11-2.1.0.jar* *spark-streaming_2.11-2.1.0.jar* Exception in thread "stream execution thread for [id = fcfe1fa6-dab3-4769-9e15-e074af622cc1,

Spark Streaming 2.1 recovery

2017-05-16 Thread Dominik Safaric
Hi, currently I am exploring Spark’s fault tolerance capabilities in terms of fault recovery. Namely I run a Spark 2.1 standalone cluster on a master and four worker nodes. The application pulls data using the Kafka direct stream API from a Kafka topic over a (sliding) window of time, and

Re: How does preprocessing fit into Spark MLlib pipeline

2017-05-16 Thread Adrian Stern
Hey Liang and Yan, Been super busy, and just now getting back to this problem. I've been thinking a bit more about it, and it still feels like using the group by functionality even in a SQL transform is incorrect, and doesn't follow the transform pattern. It doesn't seem to fit the extractor

How to replay stream between 2 offsets?

2017-05-16 Thread ranjitreddy
I'm using Spark(2.1.1) streaming as Consumer for Kakfa messages. I've a data pipeline where I continuously (hourly) save offsets to HBase. So that I can replay stream from certain offset in case of errors in data transformations. I'm able to start the stream at a certain offset, but don't

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread Shixiong(Ryan) Zhu
The default "startingOffsets" is "latest". If you don't push any data after starting the query, it won't fetch anything. You can set it to "earliest" like ".option("startingOffsets", "earliest")" to start the stream from the beginning. On Tue, May 16, 2017 at 12:36 AM, kant kodali

Re: Spark <--> S3 flakiness

2017-05-16 Thread Steve Loughran
On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files. Please don't, not without a committer specially written to work against S3 in the

RE: Spark SQL DataFrame to Kafka Topic

2017-05-16 Thread Revin Chalil
Thanks Michael, that worked, appreciate your help. From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Monday, May 15, 2017 11:45 AM To: Revin Chalil Cc: User Subject: Re: Spark SQL DataFrame to Kafka Topic The foreach sink from that blog

Spark streaming app leaking memory?

2017-05-16 Thread Srikanth
Hi, I have a Spark streaming(Spark 2.1.0) app where I see these logs in executor. Does this point to some memory leak? 17/05/16 15:11:13 WARN Executor: Managed memory leak detected; size = 67108864 bytes, TID = 7752 17/05/16 15:11:13 WARN Executor: Managed memory leak detected; size = 67108864

Re: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2017-05-16 Thread Steve Loughran
On 10 May 2017, at 13:40, Mendelson, Assaf > wrote: Hi all, When running spark I get the following warning: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread kant kodali
This isn't structured Streaming right On Tue, May 16, 2017 at 4:15 AM, Didac Gil wrote: > From what I know, you would have to iterate on each RDD. When you are > reading from the Stream, Spark actually collects the data as a miniRDD for > each period of time. > > I hope

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread Didac Gil
From what I know, you would have to iterate on each RDD. When you are reading from the Stream, Spark actually collects the data as a miniRDD for each period of time. I hope this helps. ds.foreachRDD{ rdd => val newNames = Seq(“Field1”,"Field2”,"Field3") val mydataDF = rdd.toDF(newNames: _*)

How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread kant kodali
Hi All, I have the following code. val ds = sparkSession.readStream() .format("kafka") .option("kafka.bootstrap.servers",bootstrapServers)) .option("subscribe", topicName) .option("checkpointLocation", hdfsCheckPointDir)

Re: Is GraphX really deprecated?

2017-05-16 Thread Sergey Zhemzhitsky
GraphFrames seems promising but it still has a lot of algorithms, which involve in one way or another GraphX, or run on top of GraphX according to github repo ( https://github.com/graphframes/graphframes/tree/master/src/main/scala/org/graphframes/lib), and in case of RDDs and semistructured data