Re: Multiple vcores per container when running Spark applications in Yarn cluster mode

2017-09-11 Thread Gourav Sengupta
Saisai, thanks a ton :) Regards, Gourav On Mon, Sep 11, 2017 at 11:36 PM, Xiaoye Sun wrote: > Hi Jerry, > > This solves my problem.  thanks > > On Sun, Sep 10, 2017 at 8:19 PM Saisai Shao > wrote: > >> I guess you're using Capacity Scheduler

Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-11 Thread Gourav Sengupta
Hi Matthew, I have read close to 3 TB of data in Parquet format without any issues in EMR. A few questions: 1. What is the EMR version that you are using? 2. How many partitions do you have? 3. How many fields do you have in the table? Are you reading all of them? 4. Is there a way that you can

Re: Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0?

2017-09-11 Thread kant kodali
I can actually compile the following code with any one of these jars. But none of them seem to print the messages to console however when I use Kafka-console-consumer with the same hello topic I can see messages. When I run my spark code it just hangs here forever even when I continue producing

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread Michael Armbrust
Which version of spark? On Mon, Sep 11, 2017 at 8:27 PM, 张万新 wrote: > Thanks for reply, but using this method I got an exception: > > "Exception in thread "main" > org.apache.spark.sql.streaming.StreamingQueryException: > nondeterministic expressions are only allowed in

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread 张万新
Thanks for reply, but using this method I got an exception: "Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window" Can you give more advice? Michael Armbrust

unable to read from Kafka (very strange)

2017-09-11 Thread kant kodali
Hi All, I started using spark 2.2.0 very recently and now I can't even get the json data from Kafka out to console. I have no clue what's happening. This was working for me when I was using 2.1.1 Here is my code StreamingQuery query = sparkSession.readStream() .format("kafka")

Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0?

2017-09-11 Thread kant kodali
Hi All, Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0? kafka-clients-0.10.0.1.jar spark-streaming-kafka-0-10_2.11-2.2.0.jar 1) Above two are the only Kafka related jars or am I missing something? 2) What is the difference between the above two jars? 3) If I have

Re: Multiple vcores per container when running Spark applications in Yarn cluster mode

2017-09-11 Thread Xiaoye Sun
Hi Jerry, This solves my problem.  thanks On Sun, Sep 10, 2017 at 8:19 PM Saisai Shao wrote: > I guess you're using Capacity Scheduler with DefaultResourceCalculator, > which doesn't count cpu cores into resource calculation, this "1" you saw > is actually meaningless.

Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-11 Thread Matthew Anthony
any other feedback on this? On 9/8/17 11:00 AM, Neil Jonkers wrote: Can you provide a code sample please? On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony > wrote: Hi all - since upgrading to 2.2.0, we've noticed a significant

Re: How to convert Row to JSON in Java?

2017-09-11 Thread Riccardo Ferrari
I agree, Java tend to be pretty verbose unfortunately. You can check the "alternate" approach that should be more compact and readable. Should be something like: df.select(to_json(struct(col("*")).alias("value")) Of course to_json, struct and col are from the

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-11 Thread Michael Armbrust
The following will convert the whole row to JSON. import org.apache.spark.sql.functions.* df.select(to_json(struct(col("*" On Sat, Sep 9, 2017 at 6:27 PM, kant kodali wrote: > Thanks Ryan! In this case, I will have Dataset so is there a way to > convert Row to Json

Re: Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread Michael Armbrust
Checkpoints record what has been processed for a specific query, and as such only need to be defined when writing (which is how you "start" a query). You can use the DataFrame created with readStream to start multiple queries, so it wouldn't really make sense to have a single checkpoint there.

Efficient Spark-Submit planning

2017-09-11 Thread Aakash Basu
Hi, Can someone please clarify a little on how should we effectively calculate the parameters to be passed over using spark-submit. Parameters as in - Cores, NumExecutors, DriverMemory, etc. Is there any generic calculation which can be done over most kind of clusters with different sizes from

Re: [SS]How to add a column with custom system time?

2017-09-11 Thread Michael Armbrust
import org.apache.spark.sql.functions._ df.withColumn("window", window(current_timestamp(), "15 minutes")) On Mon, Sep 11, 2017 at 3:03 AM, 张万新 wrote: > Hi, > > In structured streaming how can I add a column to a dataset with current > system time aligned with 15

Re: How to convert Row to JSON in Java?

2017-09-11 Thread kant kodali
getValuesMap is not very Java friendly. I have to do something like this String[] fieldNames = row.schema().fieldNames(); Seq fieldNamesSeq = JavaConverters.asScalaIteratorConverter(Arrays.asList(fieldNames).iterator()).asScala().toSeq(); String json = row.getValuesMap(fieldNamesSeq).toString();

Re: [Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Eduardo D'Avila
Burak, thanks for the resources. I was thinking that the trigger interval and the sliding window were the same thing, but now I am confused. I didn't know there was a .trigger() method, since the official Programming Guide

Re: [Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Burak Yavuz
Hi Eduardo, What you have written out is to output counts "as fast as possible" for windows of 5 minute length and with a sliding window of 1 minute. So for a record at 10:13, you would get that record included in the count for 10:09-10:14, 10:10-10:15, 10:11-10:16, 10:12-10:16, 10:13-10:18.

[Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Eduardo D'Avila
Hi, I'm trying to use Spark 2.1.1 structured streaming to *count the number of records* from Kafka *for each time window* with the code in this GitHub gist . I expected that, *once each minute* (the slide duration), it would

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-11 Thread Cody Koeninger
If you want an "easy" but not particularly performant way to do it, each org.apache.kafka.clients.consumer.ConsumerRecord has a topic. The topic is going to be the same for the entire partition as long as you haven't shuffled, hence the examples on how to deal with it at a partition level. On

Re:

2017-09-11 Thread Gourav Sengupta
Hi, Why SPARK 1.5? Let me guess, are you using JAVA as well? The world has moved on mate. Regards, Gourav On Fri, Sep 8, 2017 at 8:36 AM, PICARD Damien wrote: > Hi ! > > > > I’m facing a Classloader problem using Spark 1.5.1 > > > > I use javax.validation and

Re: CSV write to S3 failing silently with partial completion

2017-09-11 Thread Gourav Sengupta
Hi, Can you please let me know the following: 1. Why are you using JAVA? 2. The way you are creating the SPARK cluster 3. The way you are initiating SPARK session or context 4. Are you able to query the data that is written to S3 using a SPARK dataframe and validate that the number of rows in the

Bayesian network with Saprk

2017-09-11 Thread Md. Rezaul Karim
Hi All, I am planning to use a Bayesian network to integrate and infer the links between miRNA and proteins based on their expression. Is there any implementation in Spark for the Bayesian network so that I can adapt to feed my data? Regards, _ *Md. Rezaul

Re: Quick Start Guide Syntax Error (Python)

2017-09-11 Thread larister
Hi, I just ran into this as well. I actually fixed it by using `alias` instead. Did you submit a PR? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Why do checkpoints work the way they do?

2017-09-11 Thread Dmitry Naumenko
+1 for me for this question. If there any constraints in restoring checkpoint for Structured Streaming, they should be documented. 2017-08-31 9:20 GMT+03:00 张万新 : > So is there any documents demonstrating in what condition can my > application recover from the same

How does spark work?

2017-09-11 Thread 陈卓
Hi I'm a newbie. In my spark cluster, there are 5 machines, each machine 16G memory, but my data may be more than 900G, the source may be HDFS or mongodb, I want to know how to put this 900G data into spark cluster memory because I have a total memory space of 80G. How does spark work?

[SS]How to add a column with custom system time?

2017-09-11 Thread 张万新
Hi, In structured streaming how can I add a column to a dataset with current system time aligned with 15 minutes? Thanks.

Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread kant kodali
Hi All, I was wondering if we need to checkpoint both read and write streams when reading from Kafka and inserting into a target store? for example sparkSession.readStream().option("checkpointLocation", "hdfsPath").load() vs dataSet.writeStream().option("checkpointLocation", "hdfsPath")

Re: How to convert Row to JSON in Java?

2017-09-11 Thread Riccardo Ferrari
Hi Ayan, yup that works very well however I believe Kant's other mail "Queries with streaming sources must be executed with writeStream.start()" is adding more context. I think he is trying leverage on structured streaming and applying the rdd conversion to a streaming dataset is breaking the

ClassNotFoundException while unmarshalling a remote RDD on Spark 1.5.1

2017-09-11 Thread PICARD Damien
Hi ! I'm facing a Classloader problem using Spark 1.5.1 I use javax.validation and hibernate validation annotations on some of my beans : @NotBlank @Valid private String attribute1 ; @Valid private String attribute2 ; When Spark tries to unmarshall these beans (after a remote RDD),