Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
Congrats, Great work Dongjoon. Dongjoon Hyun 于2019年1月15日周二 下午3:47写道: > We are happy to announce the availability of Spark 2.2.3! > > Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2 > maintenance branch of Spark. We strongly recommend all 2.2.x users

[ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-14 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 2.2.3! Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2 maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade to this stable release. To download Spark 2.2.3, head over to the download page: http

Monthly Apache Spark Newsletter

2018-11-20 Thread Ankur Gupta
Hey Guys, Just launched a monthly Apache Spark Newsletter. https://newsletterspot.com/apache-spark/ Cheers, Ankur Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

CVE-2018-17190: Unsecured Apache Spark standalone executes user code

2018-11-18 Thread Sean Owen
Severity: Low Vendor: The Apache Software Foundation Versions Affected: All versions of Apache Spark Description: Spark's standalone resource manager accepts code to execute on a 'master' host, that then runs that code on 'worker' hosts. The master itself does not, by design, execute user code

Re: Testing Apache Spark applications

2018-11-15 Thread Lars Albertsson
My previous answers to this question can be found in the archives, along with some other responses: http://apache-spark-user-list.1001560.n3.nabble.com/testing-frameworks-td32251.html https://www.mail-archive.com/user%40spark.apache.org/msg48032.html I have made a couple of presentations

Re: Testing Apache Spark applications

2018-11-15 Thread Vitaliy Pisarev
Hard to answer in a succinct manner but I'll give it a shot. Cucumber is a tool for writing *Behaviour* Driven Tests (closely related to behaviour driven development, BDD). It is not a mere *technical* approach to testing but a mindset, a way of work and a different (different, whether it is

Re: Testing Apache Spark applications

2018-11-15 Thread ☼ R Nair
Sparklens from qubole is a good source. Other tests are to be handled by developer. Best, Ravi On Thu, Nov 15, 2018, 12:45 PM Hi all, > > > > How are you testing your Spark applications? > > We are writing features by using Cucumber. This is testing the behaviours. > Is this called functional

Testing Apache Spark applications

2018-11-15 Thread Omer.Ozsakarya
Hi all, How are you testing your Spark applications? We are writing features by using Cucumber. This is testing the behaviours. Is this called functional test or integration test? We are also planning to write unit tests. For instance we have a class like below. It has one method. This methos

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-09 Thread purna pradeep
size ? On Thu, Nov 8, 2018 at 2:18 PM Marcelo Vanzin wrote: > +user@ > > >> -- Forwarded message - > >> From: Wenchen Fan > >> Date: Thu, Nov 8, 2018 at 10:55 PM > >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Xiao Li
he release manager, >>>>> Wenchen! >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> >>>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan >>>>> wrote: >>>>> >>>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Reynold Xin
>>>>> >>>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan >>>>> wrote: >>>>> >>>>>> resend >>>>>> >>>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan >>>>>> wrote: >&

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Li Gao
>>>> >>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: >>>> >>>>> resend >>>>> >>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan >>>>> wrote: >>>>> >>>>>> >

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Swapnil Shinde
nchen Fan wrote: >> >>> + user list >>> >>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: >>> >>>> resend >>>> >>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan >>>> wrote: >>>> >>>>

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Stavros Kontopoulos
t;>> >>>> >>>> >>>> ------ Forwarded message - >>>> From: Wenchen Fan >>>> Date: Thu, Nov 8, 2018 at 10:55 PM >>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>>> To: Spark dev list &

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Jules Damji
hen Fan wrote: >> + user list >> >>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: >>> resend >>> >>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: >>>> >>>> >>>> -- Forwarded message -

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Dongjoon Hyun
at 11:02 PM Wenchen Fan wrote: >> >>> >>> >>> -- Forwarded message - >>> From: Wenchen Fan >>> Date: Thu, Nov 8, 2018 at 10:55 PM >>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>> To: Spark dev list >>> &g

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
+ user list On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: > resend > > On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: > >> >> >> -- Forwarded message - >> From: Wenchen Fan >> Date: Thu, Nov 8, 2018 at 10:55 PM >> S

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Marcelo Vanzin
+user@ >> -- Forwarded message - >> From: Wenchen Fan >> Date: Thu, Nov 8, 2018 at 10:55 PM >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >> To: Spark dev list >> >> >> Hi all, >> >> Apache Spark 2.4.0 is the

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread gpatcham
age 0.0 (TID 40) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread Jörn Franke
shdown": "true > > test table in Hive is pointing to hdfs://test/ and partitioned on date > > val sqlStr = s"select * from test where date > 20181001" > val logs = spark.sql(sqlStr) > > With Hive query I don't see filter pushdown is happening. I tried sett

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
tr) With Hive query I don't see filter pushdown is happening. I tried setting these configs in both hive-site.xml and also spark.sqlContext.setConf "hive.optimize.ppd":"true", "hive.optimize.ppd.storage":"true" -- Sent from: http://apache-

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke
orc files I tried running Hive query on > same dataset. But I was not able to push filter predicate. Where should I > set the below config's "hive.optimize.ppd":"true", > "hive.optimize.ppd.storage":"true" > > Sug

Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
ut I was not able to push filter predicate. Where should I set the below config's "hive.optimize.ppd":"true", "hive.optimize.ppd.storage":"true" Suggest what is the best way to read orc files from

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-30 Thread Jörn Franke
n alang wrote: >> Hello >> - is there a "performance" difference when using Java or Scala for Apache >> Spark ? >> >> I understand, there are other obvious differences (less code with scala, >> easier to focus on logic etc), >> but wrt performance - i think

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-30 Thread akshay naidu
how about Python. java vs scala vs python vs R which is better. On Sat, Oct 27, 2018 at 3:34 AM karan alang wrote: > Hello > - is there a "performance" difference when using Java or Scala for Apache > Spark ? > > I understand, there are other obvious differences (less

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Gourav Sengupta
>> Scala. Now if you want to do data science, Java is probably not the best >> tool yet... >> >> On Oct 26, 2018, at 6:04 PM, karan alang wrote: >> >> Hello >> - is there a "performance" difference when using Java or Scala for Apache >> Spark ?

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread kant kodali
of your team to > Scala. Now if you want to do data science, Java is probably not the best > tool yet... > > On Oct 26, 2018, at 6:04 PM, karan alang wrote: > > Hello > - is there a "performance" difference when using Java or Scala for Apache > Spark ? > > I unders

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Jean Georges Perrin
... > On Oct 26, 2018, at 6:04 PM, karan alang wrote: > > Hello > - is there a "performance" difference when using Java or Scala for Apache > Spark ? > > I understand, there are other obvious differences (less code with scala, > easier to focus on logic etc)

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-26 Thread Battini Lakshman
On Oct 27, 2018 3:34 AM, "karan alang" wrote: Hello - is there a "performance" difference when using Java or Scala for Apache Spark ? I understand, there are other obvious differences (less code with scala, easier to focus on logic etc), but wrt performance - i think the

java vs scala for Apache Spark - is there a performance difference ?

2018-10-26 Thread karan alang
Hello - is there a "performance" difference when using Java or Scala for Apache Spark ? I understand, there are other obvious differences (less code with scala, easier to focus on logic etc), but wrt performance - i think there would not be much of a difference since both of them are

CVE-2018-11804: Apache Spark build/mvn runs zinc, and can expose information from build machines

2018-10-24 Thread Sean Owen
Severity: Low Vendor: The Apache Software Foundation Versions Affected: 1.3.x release branch and later, including master Description: Spark's Apache Maven-based build includes a convenience script, 'build/mvn', that downloads and runs a zinc server to speed up compilation. This server will

Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Gourav Sengupta
: > https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important > > It seems also interesting. > > I was in meeting, I will also watch it. > > > > *From: *Gourav Sengupta > *Date: *24 October 2018 Wednesday 13:39 &

Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Omer.Ozsakarya
Thank you Gourav, Today I saw the article: https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important It seems also interesting. I was in meeting, I will also watch it. From: Gourav Sengupta Date: 24 October 2018 Wednesday 13:39

Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Gourav Sengupta
To: *"Ozsakarya, Omer" > *Cc: *Spark Forum > *Subject: *Re: Triggering sql on Was S3 via Apache Spark > > > > This is interesting you asked and then answered the questions (almost) as > well > > > > Regards, > > Gourav > > > > On Tue, 23 O

Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Omer.Ozsakarya
Thank you very much  From: Gourav Sengupta Date: 24 October 2018 Wednesday 11:20 To: "Ozsakarya, Omer" Cc: Spark Forum Subject: Re: Triggering sql on Was S3 via Apache Spark This is interesting you asked and then answered the questions (almost) as well Regards, Gourav On Tue, 2

Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Gourav Sengupta
This is interesting you asked and then answered the questions (almost) as well Regards, Gourav On Tue, 23 Oct 2018, 13:23 , wrote: > Hi guys, > > > > We are using Apache Spark on a local machine. > > > > I need to implement the scenario below. > > >

Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Jörn Franke
ver-on-prem-put-into-s3-bucket > > Thanks, > Divya > >> On Tue, 23 Oct 2018 at 15:53, wrote: >> Hi guys, >> >> >> >> We are using Apache Spark on a local machine. >> >> >> >> I need to implement the scenario b

Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Divya Gehlot
-ftp-server-on-prem-put-into-s3-bucket Thanks, Divya On Tue, 23 Oct 2018 at 15:53, wrote: > Hi guys, > > > > We are using Apache Spark on a local machine. > > > > I need to implement the scenario below. > > > > In the initial load: > >1. CRM applic

Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Omer.Ozsakarya
Hi guys, We are using Apache Spark on a local machine. I need to implement the scenario below. In the initial load: 1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name

Triangle Apache Spark Meetup

2018-10-10 Thread Jean Georges Perrin
Hi, Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh, Durham, and Chapel Hill in North Carolina, USA. The group started back in July 2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/ <https://www.meetup.com/Triangle-Apache-Spark-Mee

[ANNOUNCE] Announcing Apache Spark 2.3.2

2018-09-26 Thread Saisai Shao
We are happy to announce the availability of Spark 2.3.2! Apache Spark 2.3.2 is a maintenance release, based on the branch-2.3 maintenance branch of Spark. We strongly recommend all 2.3.x users to upgrade to this stable release. To download Spark 2.3.2, head over to the download page: http

Apache Spark and Airflow connection

2018-09-24 Thread Uğur Sopaoğlu
I have a docker based cluster. In my cluster, I try to schedule spark jobs by using Airflow. Airflow and Spark are running separately in *different containers*. However, I cannot run a spark job by using airflow. Below the code is my airflow script: from airflow import DAG from

Padova Apache Spark Meetup

2018-09-05 Thread Matteo Durighetto
Hello, we are creating a new meetup of enthusiast Apache Spark Users in Italy at Padova https://www.meetup.com/Padova-Apache-Spark-Meetup/ Is it possible to add the meetup link to the web page https://spark.apache.org/community.html ? Moreover is it possible to announce future

CVE-2018-11770: Apache Spark standalone master, Mesos REST APIs not controlled by authentication

2018-08-13 Thread Sean Owen
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Spark versions from 1.3.0, running standalone master with REST API enabled, or running Mesos master with cluster mode enabled Description: >From version 1.3.0 onward, Spark's standalone master exposes a REST API for job

Data quality measurement for streaming data with apache spark

2018-08-01 Thread Uttam
Hello, I have very general question about Apache Spark. I want to know if it is possible(and where to start, if possible) to implement a data quality measurement prototype for streaming data using Apache Spark. Let's say I want to work on Timeliness or Completeness as a data quality metrics

Apache Spark Cluster

2018-07-23 Thread Uğur Sopaoğlu
We try to create a cluster which consists of 4 machines. The cluster will be used by multiple-users. How can we configured that user can submit jobs from personal computer and is there any free tool you can suggest to leverage procedure. -- Uğur Sopaoğlu

CVE-2018-8024 Apache Spark XSS vulnerability in UI

2018-07-11 Thread Sean Owen
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Spark versions through 2.1.2 Spark 2.2.0 through 2.2.1 Spark 2.3.0 Description: In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's possible for a malicious user to construct a URL pointing

CVE-2018-1334 Apache Spark local privilege escalation vulnerability

2018-07-11 Thread Sean Owen
Severity: High Vendor: The Apache Software Foundation Versions affected: Spark versions through 2.1.2 Spark 2.2.0 to 2.2.1 Spark 2.3.0 Description: In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when using PySpark or SparkR, it's possible for a different local user

[ANNOUNCE] Apache Spark 2.2.2

2018-07-10 Thread Tom Graves
We are happy to announce the availability of Spark 2.2.2! Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2 maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade to this stable release. The release notes are available at  http://spark.apache.org

[ANNOUNCE] Apache Spark 2.1.3

2018-07-01 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.3! Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. The release notes are available at http://spark.apache.org/releases

Apache Spark use case: correlate data strings from file

2018-06-20 Thread darkdrake
with these data strings. I’m trying to understand if Apache Spark can fit my use case. The only input data will be these strings from this file. Can I correlate these events and how? Is there a GUI to do it? Any hints and advices will be appreciated. Best regards, Simone -- Sent from: http://apache

[ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-11 Thread Marcelo Vanzin
We are happy to announce the availability of Spark 2.3.1! Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3 maintenance branch of Spark. We strongly recommend all 2.3.x users to upgrade to this stable release. To download Spark 2.3.1, head over to the download page: http

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread amihay gonen
though after the > checkpoint dir is deleted , > > I don't know how spark do this without checkpoint's metadata. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread licl
I met the same issue and I have try to delete the checkpoint dir before the job , But spark seems can read the correct offset even though after the checkpoint dir is deleted , I don't know how spark do this without checkpoint's metadata. -- Sent from: http://apache-spark-user-list.1001560.n3

Re: Apache Spark Installation error

2018-05-31 Thread Irving Duran
You probably want to recognize "spark-shell" as a command in your environment. Maybe try "sudo ln -s /path/to/spark-shell /usr/bin/spark-shell" Have you tried "./spark-shell" in the current path to see if it works? Thank You, Irving Duran On Thu, May 31, 2018 at 9:00 AM Remil Mohanan wrote:

Apache Spark is not working as expected

2018-05-30 Thread remil
hadoopuser@sherin-VirtualBox:/usr/lib/spark/bin$ spark-shell spark-shell: command not found hadoopuser@sherin-VirtualBox:/usr/lib/spark/bin$ Spark.odt <http://apache-spark-user-list.1001560.n3.nabble.com/file/t9314/Spark.odt> -- Sent from: http://apache-spark-user-list.1001560.n3.nabb

Apache spark on windows without shortnames enabled

2018-04-15 Thread ashwini
Hi, We use Apache Spark 2.2.0 in our stack. Our software by default like other softwares gets installed under "C:\Program Files\". We have a restriction that we cannot ask our customers to enable short names on their machines. From our experience, spark does not handle the absolute

Apache spark -2.1.0 question in Spark SQL

2018-04-03 Thread anbu
- saome manipulation happening here and finally return a array of rows return res[Row] } Could you please someone help me what causing the issues here.i have tested the import spark.implicits not working. how to fix this error or else help me in different approach here

Apache Spark - Structured Streaming State Management With Watermark

2018-03-28 Thread M Singh
Hi: I am using Apache Spark Structured Streaming (2.2.1) to implement custom sessionization for events.  The processing is in two steps:1. flatMapGroupsWithState (based on user id) - which stores the state of user and emits events every minute until a expire event is received 2. The next step

Apache Spark - Structured Streaming StreamExecution Stats Description

2018-03-28 Thread M Singh
Hi: I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState and a groupBy count operators. In the StreamExecution logs I see two enteries for stateOperators "stateOperators" : [ {     "numRowsTotal" : 1617339,     "numRowsUpdated" : 9647   }, {     "numRowsTotal" :

Apache Spark Structured Streaming - How to keep executor alive.

2018-03-23 Thread M Singh
Hi: I am working on spark structured streaming (2.2.1) with kafka and want 100 executors to be alive. I set spark.executor.instances to be 100.  The process starts running with 100 executors but after some time only a few remain which causes backlog of events from kafka.  I thought I saw a

Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-03-22 Thread M Singh
Hi: I am working on a realtime application using spark structured streaming (v 2.2.1). The application reads data from kafka and if there is a failure, I would like to ignore the checkpoint.  Is there any configuration to just read from last kafka offset after a failure and ignore any offset

Re: Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread Tathagata Das
> version etc, please let me know. > > Thanks > > Here is the exception stack trace. > > java.util.concurrent.TimeoutException: Cannot fetch record for offset > <offset#> in 120000 milliseconds > at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$ >

Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread M Singh
ondsat org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:219) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117) at org.apache.spark.sql.ka

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-24 Thread M Singh
e look at the UI if not already it can provide lot of information -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread vijay.bvp
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread M Singh
Hi: I am working with spark structured streaming (2.2.1) reading data from Kafka (0.11).  I need to aggregate data ingested every minute and I am using spark-shell at the moment.  The message rate ingestion rate is approx 500k/second.  During some trigger intervals (1 minute) especially when

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread M Singh
Thanks Richard.  I am hoping that Spark team will at some time, provide more detailed documentation. On Sunday, February 11, 2018 2:17 AM, Richard Qiao wrote: Can find a good source for documents, but the source code

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread Richard Qiao
Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSec This is explaining

Re: Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-10 Thread M Singh
Just checking if anyone has any pointers for dynamically updating query state in structured streaming. Thanks On Thursday, February 8, 2018 2:58 PM, M Singh wrote: Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that

Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-10 Thread M Singh
Hi: I am working with spark 2.2.0 and am looking at the query status console output.  My application reads from kafka - performs flatMapGroupsWithState and then aggregates the elements for two group counts.  The output is send to console sink.  I see the following output  (with my questions

Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-08 Thread M Singh
Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state periodically. Here is the scenario: 1. I have a udf with a variable with default value (eg: 1)  This value is applied to a column (eg: subtract the variable from the column value

Free access to Index Conf for Apache Spark community attendees

2018-02-08 Thread xwu0226
Free access to Index Conf for Apache Spark session attendees. For info go to: https://www.meetup.com/SF-Big-Analytic IBM is hosting a developer conference - Essentially the conference is ‘By Developers, for Developers’ based on Open technologies. This will be held Feb 20 - 22nd in Moscone West

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
Hi Jacek: Thanks for your response. I am just trying to understand the fundamentals of watermarking and how it behaves in aggregation vs non-aggregation scenarios. On Tuesday, February 6, 2018 9:04 AM, Jacek Laskowski wrote: Hi, What would you expect? The data is

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread Jacek Laskowski
Hi, What would you expect? The data is simply dropped as that's the purpose of watermarking it. That's my understanding at least. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-05 Thread M Singh
Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018 8:47 AM, M Singh wrote: Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-05 Thread M Singh
(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.sca

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is processing time. Vishnu - Spark documentation (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) does indicate that it can dedup using watermark.  So I believe

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-01 Thread M Singh
5) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread Tathagata Das
Could you give the full stack trace of the exception? Also, can you do `dataframe2.explain(true)` and show us the plan output? On Wed, Jan 31, 2018 at 3:35 PM, M Singh wrote: > Hi Folks: > > I have to add a column to a structured *streaming* dataframe but when I

Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread M Singh
Hi Folks: I have to add a column to a structured streaming dataframe but when I do that (using select or withColumn) I get an exception.  I can add a column in structured non-streaming structured dataframe. I could not find any documentation on how to do this in the following doc 

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-31 Thread Vishnu Viswanath
Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored. I recently wrote a blog post on this : http://vishnuviswanath.com/spark_structured_streaming.html#watermark Yes, this State is

Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Dongjoon Hyun
Hi, Nicolas. Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901 (Feature parity for ORC with Parquet). For your questions, the following three are related. 1. spark.sql.orc.impl="native" By default, `native` ORC implementation (based on the latest ORC 1.4.1

Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Nicolas Paris
Hi Thanks for this work. Will this affect both: 1) spark.read.format("orc").load("...") 2) spark.sql("select ... from my_orc_table_in_hive") ? Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait : > Hi, All. > > Vectorized ORC Reader is now suppor

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread Jacek Laskowski
Hi, I'm curious how would you do the requirement "by a certain amount of time" without a watermark? How would you know what's current and compute the lag? Let's forget about watermark for a moment and see if it pops up as an inevitable feature :) "I am trying to filter out records which are

Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread M Singh
Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time.   Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ?  I could not get a clear understanding from the

Re: Apache Spark - Custom structured streaming data source

2018-01-26 Thread M Singh
Thanks TD.  When will 2.3 scheduled for release ?   On Thursday, January 25, 2018 11:32 PM, Tathagata Das wrote: Hello Mans, The streaming DataSource APIs are still evolving and are not public yet. Hence there is no official documentation. In fact, there is a new

Re: Apache Spark - Custom structured streaming data source

2018-01-25 Thread Tathagata Das
Hello Mans, The streaming DataSource APIs are still evolving and are not public yet. Hence there is no official documentation. In fact, there is a new DataSourceV2 API (in Spark 2.3) that we are migrating towards. So at this point of time, it's hard to make any concrete suggestion. You can take a

Apache Spark - Custom structured streaming data source

2018-01-25 Thread M Singh
Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package:   private[sql] def

Re: good materiala to learn apache spark

2018-01-18 Thread Marco Mistroni
Jacek lawskowski on this mail list wrote a book which is available online. Hth On Jan 18, 2018 6:16 AM, "Manuel Sopena Ballesteros" < manuel...@garvan.org.au> wrote: > Dear Spark community, > > > > I would like to learn more about apache spark. I have a Horto

good materiala to learn apache spark

2018-01-17 Thread Manuel Sopena Ballesteros
Dear Spark community, I would like to learn more about apache spark. I have a Horton works HDP platform and have ran a few spark jobs in a cluster but now I need to know more in depth how spark works. My main interest is sys admin and operational point of Spark and it's ecosystem

Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-10 Thread Dongjoon Hyun
Hi, All. Vectorized ORC Reader is now supported in Apache Spark 2.3. https://issues.apache.org/jira/browse/SPARK-16060 It has been a long journey. From now, Spark can read ORC files faster without feature penalty. Thank you for all your support, especially Wenchen Fan. It's done by two

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Felix Cheung
Saisai Shao; Raj Adyanthaya; spark users Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0 My current best guess is that Spark does not fully support Hadoop 3.x because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for Hadoop 3.x) has not been resolved. There

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Josh Rosen
My current best guess is that Spark does *not* fully support Hadoop 3.x because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for Hadoop 3.x) has not been resolved. There are also likely to be transitive dependency conflicts which will need to be resolved. On Mon, Jan

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread akshay naidu
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and later', but my confusion is because spark was released on 1st dec and hadoop-3 stable version released on 13th Dec. And to my similar question on stackoverflow.com

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-07 Thread Saisai Shao
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it is not clear whether it is supported or not (or has some issues). I think in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC). Thanks Jerry

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-07 Thread Raj Adyanthaya
Hi Akshay On the Spark Download page when you select Spark 2.2.1 it gives you an option to select package type. In that, there is an option to select "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it does support Hadoop 3.0. http://spark.apache.org/downloads.html

Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-06 Thread akshay naidu
hello Users, I need to know whether we can run latest spark on latest hadoop version i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec. thanks.

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-05 Thread M Singh
Hi Jacek: The javadoc mentions that we can only consume data from the data frame in the addBatch method.  So, if I would like to save the data to a new sink then I believe that I will need to collect the data and then save it.  This is the reason I am asking about how to control the size of

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread Jacek Laskowski
Hi, > If the data is very large then a collect may result in OOM. That's a general case even in any part of Spark, incl. Spark Structured Streaming. Why would you collect in addBatch? It's on the driver side and as anything on the driver, it's a single JVM (and usually not fault tolerant) > Do

<    1   2   3   4   5   6   7   8   9   10   >