Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread Paul
You would set the Kafka topic as your data source and you would write a custom output to Cassandra everything would be or could be contained within your stream -Paul Sent from my iPhone > On Sep 8, 2017, at 2:52 PM, kant kodali wrote: > > How can I use one SparkSession

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-08 Thread Dan Dong
Hi,Alonso. Thanks! I've read about this but did not quite understand it. To pick out the topic name of a kafka message seems a simple task but the example code looks so complicated with redundent info. Why do we need offsetRanges here and do we have a easy way to achieve this? Cheers, Dan

Re: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

2017-09-08 Thread Karthik Palaniappan
For posterity, I found the root cause and filed a JIRA: https://issues.apache.org/jira/browse/SPARK-21960. I plan to open a pull request with the minor fix. From: Karthik Palaniappan Sent: Friday, September 1, 2017 9:49 AM To: Akhil Das Cc: user@spark.apache.org;

Multiple vcores per container when running Spark applications in Yarn cluster mode

2017-09-08 Thread Xiaoye Sun
Hi, I am using Spark 1.6.1 and Yarn 2.7.4. I want to submit a Spark application to a Yarn cluster. However, I found that the number of vcores assigned to a container/executor is always 1, even if I set spark.executor.cores=2. I also found the number of tasks an executor runs concurrently is 2.

Re: CSV write to S3 failing silently with partial completion

2017-09-08 Thread Steve Loughran
On 7 Sep 2017, at 18:36, Mcclintic, Abbi > wrote: Thanks all – couple notes below. Generally all our partitions are of equal size (ie on a normal day in this particular case I see 10 equally sized partitions of 2.8 GB). We see the problem with

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread kant kodali
How can I use one SparkSession to talk to both Kafka and Cassandra let's say? On Fri, Sep 8, 2017 at 3:46 AM, Arkadiusz Bicz wrote: > You don't need multiple spark sessions to have more than one stream > working, but from maintenance and reliability perspective it is

SPARK CSV ISSUE

2017-09-08 Thread Gourav Sengupta
Hi, According to this thread https://issues.apache.org/jira/browse/SPARK-11374. SPARK will not resolve the issue of skipping header option when the table is defined in HIVE. But I am unable to see a SPARK SQL option for setting up external partitioned table. Does that mean in case I have to

Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-08 Thread Matthew Anthony
The code is as simple as calling `data = spark.read.parquet(address.)`. I can't give you the actual address I'm reading from for security reasons. Is there something else I can provide? We're using standard EMR images with Hive and Spark installed. On 9/8/17 11:00 AM, Neil Jonkers wrote: Can

Re: [Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-08 Thread Neil Jonkers
Can you provide a code sample please? On Fri, Sep 8, 2017 at 5:44 PM, Matthew Anthony wrote: > Hi all - > > > since upgrading to 2.2.0, we've noticed a significant increase in > read.parquet(...) ops. The parquet files are being read from S3. Upon entry > at the interactive

[Spark Core] excessive read/load times on parquet files in 2.2 vs 2.0

2017-09-08 Thread Matthew Anthony
Hi all - since upgrading to 2.2.0, we've noticed a significant increase in read.parquet(...) ops. The parquet files are being read from S3. Upon entry at the interactive terminal (pyspark in this case), the terminal will sit "idle" for several minutes (as many as 10) before returning:

CVE-2017-12612 Unsafe deserialization in Apache Spark launcher API

2017-09-08 Thread Sean Owen
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Versions of Apache Spark from 1.6.0 until 2.1.1 Description: In Apache Spark 1.6.0 until 2.1.1, the launcher API performs unsafe deserialization of data received by its socket. This makes applications launched

Re: Chaining Spark Streaming Jobs

2017-09-08 Thread Sunita Arvind
Thanks for your response Praneeth. We did consider Kafka however cost was the only hold back factor as we might need a larger cluster and existing cluster is on premise and my app is on cloud. So the same cluster cannot be used. But I agree it does sound like a good alternative. Regards Sunita

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread Arkadiusz Bicz
You don't need multiple spark sessions to have more than one stream working, but from maintenance and reliability perspective it is not good idea. On Thu, Sep 7, 2017 at 2:40 AM, kant kodali wrote: > Hi All, > > I am wondering if it is ok to have multiple sparksession's in

Re: graphframe out of memory

2017-09-08 Thread Imran Rajjad
No I did not, I thought Spark would take care of that itself since I have put in the arguments. On Thu, Sep 7, 2017 at 9:26 PM, Lukas Bradley wrote: > Did you also increase the size of the heap of the Java app that is > starting Spark? > >

Wish you give our product a wonderful name

2017-09-08 Thread Jone Zhang
We have built an an ml platform, based on open source framework like hadoop, spark, tensorflow. Now we need to give our product a wonderful name, and eager for everyone's advice. Any answers will be greatly appreciated. Thanks.

[no subject]

2017-09-08 Thread PICARD Damien
Hi ! I'm facing a Classloader problem using Spark 1.5.1 I use javax.validation and hibernate validation annotations on some of my beans : @NotBlank @Valid private String attribute1 ; @Valid private String attribute2 ; When Spark tries to unmarshall these beans (after a remote RDD),

Part-time job

2017-09-08 Thread Uğur Sopaoğlu
Hi all, I have been working with Spark for about 8 months. But it is not fully learned by self-study. So I want to take a part-time job on a project. Thus, I believe that it will both contribute to my own development and benefit others. I *do not have any salary* anticipation. Can you help

Re: Spark Dataframe returning null columns when schema is specified

2017-09-08 Thread Praneeth Gayam
What is the desired behaviour when a field is null for only a few records? You can not avoid nulls in this case But if all rows are guaranteed to be uniform(either all-null are all-non-null), you can *take* the first row of the DF and *drop* the columns with null fields. On Fri, Sep 8, 2017 at

Re: Chaining Spark Streaming Jobs

2017-09-08 Thread Praneeth Gayam
With file stream you will have to deal with the following 1. The file(s) must not be changed once created. So if the files are being continuously appended, the new data will not be read. Refer 2.