Re: [ANNOUNCE] Announcing Apache Spark 2.3.3
Yay! Good job Takeshi! On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro We are happy to announce the availability of Spark 2.3.3! > > Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 > maintenance branch of Spark. We strongly recommend all 2.3.x users to > upgrade to this stable release. > > To download Spark 2.3.3, head over to the download page: > http://spark.apache.org/downloads.html > > To view the release notes: > https://spark.apache.org/releases/spark-release-2-3-3.html > > We would like to acknowledge all community members for contributing to > this release. This release would not have been possible without you. > > Best, > Takeshi > > -- > --- > Takeshi Yamamuro >
[ANNOUNCE] Announcing Apache Spark 2.3.3
We are happy to announce the availability of Spark 2.3.3! Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 maintenance branch of Spark. We strongly recommend all 2.3.x users to upgrade to this stable release. To download Spark 2.3.3, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: https://spark.apache.org/releases/spark-release-2-3-3.html We would like to acknowledge all community members for contributing to this release. This release would not have been possible without you. Best, Takeshi -- --- Takeshi Yamamuro
Re: CVE-2018-11760: Apache Spark local privilege escalation vulnerability
I received some questions about what the exact change was which fixed the issue, and the PMC decided to post info in jira to make it easier for the community to track. The relevant details are all on https://issues.apache.org/jira/browse/SPARK-26802 On Mon, Jan 28, 2019 at 1:08 PM Imran Rashid wrote: > Severity: Important > > Vendor: The Apache Software Foundation > > Versions affected: > All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions > Spark 2.2.0 to 2.2.2 > Spark 2.3.0 to 2.3.1 > > Description: > When using PySpark , it's possible for a different local user to connect > to the Spark application and impersonate the user running the Spark > application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and > 2.3.0 to 2.3.1. > > Mitigation: > 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer > 2.3.x users should upgrade to 2.3.2 or newer > Otherwise, affected users should avoid using PySpark in multi-user > environments. > > Credit: > This issue was reported by Luca Canali and Jose Carlos Luna Duran from > CERN. > > References: > https://spark.apache.org/security.html >
CVE-2018-11760: Apache Spark local privilege escalation vulnerability
Severity: Important Vendor: The Apache Software Foundation Versions affected: All Spark 1.x, Spark 2.0.x, and Spark 2.1.x versions Spark 2.2.0 to 2.2.2 Spark 2.3.0 to 2.3.1 Description: When using PySpark , it's possible for a different local user to connect to the Spark application and impersonate the user running the Spark application. This affects versions 1.x, 2.0.x, 2.1.x, 2.2.0 to 2.2.2, and 2.3.0 to 2.3.1. Mitigation: 1.x, 2.0.x, 2.1.x, and 2.2.x users should upgrade to 2.2.3 or newer 2.3.x users should upgrade to 2.3.2 or newer Otherwise, affected users should avoid using PySpark in multi-user environments. Credit: This issue was reported by Luca Canali and Jose Carlos Luna Duran from CERN. References: https://spark.apache.org/security.html
Re: [ANNOUNCE] Announcing Apache Spark 2.2.3
Thanks, Dongjoon! On Wed, Jan 16, 2019 at 5:23 PM Hyukjin Kwon wrote: > Nice! > > 2019년 1월 16일 (수) 오전 11:55, Jiaan Geng 님이 작성: > >> Glad to hear this. >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- --- Takeshi Yamamuro
Re: [ANNOUNCE] Announcing Apache Spark 2.2.3
Nice! 2019년 1월 16일 (수) 오전 11:55, Jiaan Geng 님이 작성: > Glad to hear this. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: [ANNOUNCE] Announcing Apache Spark 2.2.3
Glad to hear this. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [ANNOUNCE] Announcing Apache Spark 2.2.3
Congrats, Great work Dongjoon. Dongjoon Hyun 于2019年1月15日周二 下午3:47写道: > We are happy to announce the availability of Spark 2.2.3! > > Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2 > maintenance branch of Spark. We strongly recommend all 2.2.x users to > upgrade to this stable release. > > To download Spark 2.2.3, head over to the download page: > http://spark.apache.org/downloads.html > > To view the release notes: > https://spark.apache.org/releases/spark-release-2-2-3.html > > We would like to acknowledge all community members for contributing to > this release. This release would not have been possible without you. > > Bests, > Dongjoon. > -- Best Regards Jeff Zhang
[ANNOUNCE] Announcing Apache Spark 2.2.3
We are happy to announce the availability of Spark 2.2.3! Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2 maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade to this stable release. To download Spark 2.2.3, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: https://spark.apache.org/releases/spark-release-2-2-3.html We would like to acknowledge all community members for contributing to this release. This release would not have been possible without you. Bests, Dongjoon.
Monthly Apache Spark Newsletter
Hey Guys, Just launched a monthly Apache Spark Newsletter. https://newsletterspot.com/apache-spark/ Cheers, Ankur Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
CVE-2018-17190: Unsecured Apache Spark standalone executes user code
Severity: Low Vendor: The Apache Software Foundation Versions Affected: All versions of Apache Spark Description: Spark's standalone resource manager accepts code to execute on a 'master' host, that then runs that code on 'worker' hosts. The master itself does not, by design, execute user code. A specially-crafted request to the master can, however, cause the master to execute code too. Note that this does not affect standalone clusters with authentication enabled. While the master host typically has less outbound access to other resources than a worker, the execution of code on the master is nevertheless unexpected. Mitigation: Enable authentication on any Spark standalone cluster that is not otherwise secured from unwanted access, for example by network-level restrictions. Use spark.authenticate and related security properties described at https://spark.apache.org/docs/latest/security.html Credit: Andre Protas, Apple Information Security References: https://spark.apache.org/security.html - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Testing Apache Spark applications
My previous answers to this question can be found in the archives, along with some other responses: http://apache-spark-user-list.1001560.n3.nabble.com/testing-frameworks-td32251.html https://www.mail-archive.com/user%40spark.apache.org/msg48032.html I have made a couple of presentations on the subject. Slides and video are linked on this page: http://www.mapflat.com/presentations/ You can find more material in this list of resources: http://www.mapflat.com/lands/resources/reading-list Happy testing! Regards, Lars Albertsson Data engineering entrepreneur www.mimeria.com, www.mapflat.com https://twitter.com/lalleal +46 70 7687109 On Thu, Nov 15, 2018 at 6:45 PM wrote: > Hi all, > > > > How are you testing your Spark applications? > > We are writing features by using Cucumber. This is testing the behaviours. > Is this called functional test or integration test? > > > > We are also planning to write unit tests. > > > > For instance we have a class like below. It has one method. This methos is > implementing several things: like DataFrame operations, saving DataFrame > into database table, insert, update,delete statements. > > > > Our classes generally contains 2 or 3 methods. These methods cover a lot > of tasks in the same function defintion. (like the function below) > > So I am not sure how I can write unit tests for these classes and methods. > > Do you have any suggestion? > > > > class CustomerOperations > > > >def doJob(inputDataFrame : DataFrame) = { > >// definitions (value/variable) > >// spark context, session etc definition > > > > // filtering, cleansing on inputDataframe and save results on a > new dataframe > > // insert new dataframe to a database table > > // several insert/update/delete statements on the database tables > > > > } > > > > >
Re: Testing Apache Spark applications
Hard to answer in a succinct manner but I'll give it a shot. Cucumber is a tool for writing *Behaviour* Driven Tests (closely related to behaviour driven development, BDD). It is not a mere *technical* approach to testing but a mindset, a way of work and a different (different, whether it is better is a matter of controversy) way to structure communication between product and R&D. I will not elaborate more as there is plenty of material out there if you want to educate yourself. Just bear in mind that BDD is riddled with misconception. Most often than not I see people just using Cucumber, but not doing actual BDD. Regarding unit testing, I do not consider the code you showed to be a good candidate for unit testing. There is very little procedural logic there and there is a good chance that if you go about unit testing it you will end up with lots and lots of mocks overly bound to the implementation details of the suit under test , rendering the tests unmaintainable and brittle. I would argue that unit tests are more appropriate for code that is algorithmic in nature, that has no or very little dependencies and where you have an absolute oracle of truth regrading your expectations from it. I think that in your situation going for integration tests (on small scale data) and regression tests would give you the most ROI. On Thu, Nov 15, 2018 at 8:43 PM ☼ R Nair wrote: > Sparklens from qubole is a good source. Other tests are to be handled by > developer. > > Best, > Ravi > > On Thu, Nov 15, 2018, 12:45 PM >> Hi all, >> >> >> >> How are you testing your Spark applications? >> >> We are writing features by using Cucumber. This is testing the >> behaviours. Is this called functional test or integration test? >> >> >> >> We are also planning to write unit tests. >> >> >> >> For instance we have a class like below. It has one method. This methos >> is implementing several things: like DataFrame operations, saving DataFrame >> into database table, insert, update,delete statements. >> >> >> >> Our classes generally contains 2 or 3 methods. These methods cover a lot >> of tasks in the same function defintion. (like the function below) >> >> So I am not sure how I can write unit tests for these classes and methods. >> >> Do you have any suggestion? >> >> >> >> class CustomerOperations >> >> >> >>def doJob(inputDataFrame : DataFrame) = { >> >>// definitions (value/variable) >> >>// spark context, session etc definition >> >> >> >> // filtering, cleansing on inputDataframe and save results on >> a new dataframe >> >> // insert new dataframe to a database table >> >> // several insert/update/delete statements on the database >> tables >> >> >> >> } >> >> >> >> >> >
Re: Testing Apache Spark applications
Sparklens from qubole is a good source. Other tests are to be handled by developer. Best, Ravi On Thu, Nov 15, 2018, 12:45 PM Hi all, > > > > How are you testing your Spark applications? > > We are writing features by using Cucumber. This is testing the behaviours. > Is this called functional test or integration test? > > > > We are also planning to write unit tests. > > > > For instance we have a class like below. It has one method. This methos is > implementing several things: like DataFrame operations, saving DataFrame > into database table, insert, update,delete statements. > > > > Our classes generally contains 2 or 3 methods. These methods cover a lot > of tasks in the same function defintion. (like the function below) > > So I am not sure how I can write unit tests for these classes and methods. > > Do you have any suggestion? > > > > class CustomerOperations > > > >def doJob(inputDataFrame : DataFrame) = { > >// definitions (value/variable) > >// spark context, session etc definition > > > > // filtering, cleansing on inputDataframe and save results on a > new dataframe > > // insert new dataframe to a database table > > // several insert/update/delete statements on the database tables > > > > } > > > > >
Testing Apache Spark applications
Hi all, How are you testing your Spark applications? We are writing features by using Cucumber. This is testing the behaviours. Is this called functional test or integration test? We are also planning to write unit tests. For instance we have a class like below. It has one method. This methos is implementing several things: like DataFrame operations, saving DataFrame into database table, insert, update,delete statements. Our classes generally contains 2 or 3 methods. These methods cover a lot of tasks in the same function defintion. (like the function below) So I am not sure how I can write unit tests for these classes and methods. Do you have any suggestion? class CustomerOperations def doJob(inputDataFrame : DataFrame) = { // definitions (value/variable) // spark context, session etc definition // filtering, cleansing on inputDataframe and save results on a new dataframe // insert new dataframe to a database table // several insert/update/delete statements on the database tables }
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
Thanks this is a great news Can you please lemme if dynamic resource allocation is available in spark 2.4? I’m using spark 2.3.2 on Kubernetes, do I still need to provide executor memory options as part of spark submit command or spark will manage required executor memory based on the spark job size ? On Thu, Nov 8, 2018 at 2:18 PM Marcelo Vanzin wrote: > +user@ > > >> -- Forwarded message - > >> From: Wenchen Fan > >> Date: Thu, Nov 8, 2018 at 10:55 PM > >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 > >> To: Spark dev list > >> > >> > >> Hi all, > >> > >> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release > adds Barrier Execution Mode for better integration with deep learning > frameworks, introduces 30+ built-in and higher-order functions to deal with > complex data type easier, improves the K8s integration, along with > experimental Scala 2.12 support. Other major updates include the built-in > Avro data source, Image data source, flexible streaming sinks, elimination > of the 2GB block size limitation during transfer, Pandas UDF improvements. > In addition, this release continues to focus on usability, stability, and > polish while resolving around 1100 tickets. > >> > >> We'd like to thank our contributors and users for their contributions > and early feedback to this release. This release would not have been > possible without you. > >> > >> To download Spark 2.4.0, head over to the download page: > http://spark.apache.org/downloads.html > >> > >> To view the release notes: > https://spark.apache.org/releases/spark-release-2-4-0.html > >> > >> Thanks, > >> Wenchen > >> > >> PS: If you see any issues with the release notes, webpage or published > artifacts, please contact me directly off-list. > > > > -- > Marcelo > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
Try to clear your browsing data or use a different web browser. Enjoy it, Xiao On Thu, Nov 8, 2018 at 4:15 PM Reynold Xin wrote: > Do you have a cached copy? I see it here > > http://spark.apache.org/downloads.html > > > > On Thu, Nov 8, 2018 at 4:12 PM Li Gao wrote: > >> this is wonderful ! >> I noticed the official spark download site does not have 2.4 download >> links yet. >> >> On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde > wrote: >> >>> Great news.. thank you very much! >>> >>> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos < >>> stavros.kontopou...@lightbend.com wrote: >>> >>>> Awesome! >>>> >>>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji >>>> wrote: >>>> >>>>> Indeed! >>>>> >>>>> Sent from my iPhone >>>>> Pardon the dumb thumb typos :) >>>>> >>>>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun >>>>> wrote: >>>>> >>>>> Finally, thank you all. Especially, thanks to the release manager, >>>>> Wenchen! >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> >>>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan >>>>> wrote: >>>>> >>>>>> + user list >>>>>> >>>>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan >>>>>> wrote: >>>>>> >>>>>>> resend >>>>>>> >>>>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- Forwarded message - >>>>>>>> From: Wenchen Fan >>>>>>>> Date: Thu, Nov 8, 2018 at 10:55 PM >>>>>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>>>>>>> To: Spark dev list >>>>>>>> >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This >>>>>>>> release adds Barrier Execution Mode for better integration with deep >>>>>>>> learning frameworks, introduces 30+ built-in and higher-order >>>>>>>> functions to >>>>>>>> deal with complex data type easier, improves the K8s integration, along >>>>>>>> with experimental Scala 2.12 support. Other major updates include the >>>>>>>> built-in Avro data source, Image data source, flexible streaming sinks, >>>>>>>> elimination of the 2GB block size limitation during transfer, Pandas >>>>>>>> UDF >>>>>>>> improvements. In addition, this release continues to focus on >>>>>>>> usability, >>>>>>>> stability, and polish while resolving around 1100 tickets. >>>>>>>> >>>>>>>> We'd like to thank our contributors and users for their >>>>>>>> contributions and early feedback to this release. This release would >>>>>>>> not >>>>>>>> have been possible without you. >>>>>>>> >>>>>>>> To download Spark 2.4.0, head over to the download page: >>>>>>>> http://spark.apache.org/downloads.html >>>>>>>> >>>>>>>> To view the release notes: >>>>>>>> https://spark.apache.org/releases/spark-release-2-4-0.html >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Wenchen >>>>>>>> >>>>>>>> PS: If you see any issues with the release notes, webpage or >>>>>>>> published artifacts, please contact me directly off-list. >>>>>>>> >>>>>>> >>>> >>>> >>>> >>>> -- [image: Spark+AI Summit North America 2019] <http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america&si=undefined&pi=406b8c9a-b648-4923-9ed1-9a51ffe213fa>
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
Do you have a cached copy? I see it here http://spark.apache.org/downloads.html On Thu, Nov 8, 2018 at 4:12 PM Li Gao wrote: > this is wonderful ! > I noticed the official spark download site does not have 2.4 download > links yet. > > On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde wrote: > >> Great news.. thank you very much! >> >> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos < >> stavros.kontopou...@lightbend.com wrote: >> >>> Awesome! >>> >>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji wrote: >>> >>>> Indeed! >>>> >>>> Sent from my iPhone >>>> Pardon the dumb thumb typos :) >>>> >>>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun >>>> wrote: >>>> >>>> Finally, thank you all. Especially, thanks to the release manager, >>>> Wenchen! >>>> >>>> Bests, >>>> Dongjoon. >>>> >>>> >>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan >>>> wrote: >>>> >>>>> + user list >>>>> >>>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan >>>>> wrote: >>>>> >>>>>> resend >>>>>> >>>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> -- Forwarded message - >>>>>>> From: Wenchen Fan >>>>>>> Date: Thu, Nov 8, 2018 at 10:55 PM >>>>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>>>>>> To: Spark dev list >>>>>>> >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This >>>>>>> release adds Barrier Execution Mode for better integration with deep >>>>>>> learning frameworks, introduces 30+ built-in and higher-order functions >>>>>>> to >>>>>>> deal with complex data type easier, improves the K8s integration, along >>>>>>> with experimental Scala 2.12 support. Other major updates include the >>>>>>> built-in Avro data source, Image data source, flexible streaming sinks, >>>>>>> elimination of the 2GB block size limitation during transfer, Pandas UDF >>>>>>> improvements. In addition, this release continues to focus on usability, >>>>>>> stability, and polish while resolving around 1100 tickets. >>>>>>> >>>>>>> We'd like to thank our contributors and users for their >>>>>>> contributions and early feedback to this release. This release would not >>>>>>> have been possible without you. >>>>>>> >>>>>>> To download Spark 2.4.0, head over to the download page: >>>>>>> http://spark.apache.org/downloads.html >>>>>>> >>>>>>> To view the release notes: >>>>>>> https://spark.apache.org/releases/spark-release-2-4-0.html >>>>>>> >>>>>>> Thanks, >>>>>>> Wenchen >>>>>>> >>>>>>> PS: If you see any issues with the release notes, webpage or >>>>>>> published artifacts, please contact me directly off-list. >>>>>>> >>>>>> >>> >>> >>> >>>
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
this is wonderful ! I noticed the official spark download site does not have 2.4 download links yet. On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde Great news.. thank you very much! > > On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos < > stavros.kontopou...@lightbend.com wrote: > >> Awesome! >> >> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji wrote: >> >>> Indeed! >>> >>> Sent from my iPhone >>> Pardon the dumb thumb typos :) >>> >>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun >>> wrote: >>> >>> Finally, thank you all. Especially, thanks to the release manager, >>> Wenchen! >>> >>> Bests, >>> Dongjoon. >>> >>> >>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan wrote: >>> >>>> + user list >>>> >>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: >>>> >>>>> resend >>>>> >>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> -- Forwarded message - >>>>>> From: Wenchen Fan >>>>>> Date: Thu, Nov 8, 2018 at 10:55 PM >>>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>>>>> To: Spark dev list >>>>>> >>>>>> >>>>>> Hi all, >>>>>> >>>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release >>>>>> adds Barrier Execution Mode for better integration with deep learning >>>>>> frameworks, introduces 30+ built-in and higher-order functions to deal >>>>>> with >>>>>> complex data type easier, improves the K8s integration, along with >>>>>> experimental Scala 2.12 support. Other major updates include the built-in >>>>>> Avro data source, Image data source, flexible streaming sinks, >>>>>> elimination >>>>>> of the 2GB block size limitation during transfer, Pandas UDF >>>>>> improvements. >>>>>> In addition, this release continues to focus on usability, stability, and >>>>>> polish while resolving around 1100 tickets. >>>>>> >>>>>> We'd like to thank our contributors and users for their contributions >>>>>> and early feedback to this release. This release would not have been >>>>>> possible without you. >>>>>> >>>>>> To download Spark 2.4.0, head over to the download page: >>>>>> http://spark.apache.org/downloads.html >>>>>> >>>>>> To view the release notes: >>>>>> https://spark.apache.org/releases/spark-release-2-4-0.html >>>>>> >>>>>> Thanks, >>>>>> Wenchen >>>>>> >>>>>> PS: If you see any issues with the release notes, webpage or >>>>>> published artifacts, please contact me directly off-list. >>>>>> >>>>> >> >> >> >>
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
Great news.. thank you very much! On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos < stavros.kontopou...@lightbend.com wrote: > Awesome! > > On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji wrote: > >> Indeed! >> >> Sent from my iPhone >> Pardon the dumb thumb typos :) >> >> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun >> wrote: >> >> Finally, thank you all. Especially, thanks to the release manager, >> Wenchen! >> >> Bests, >> Dongjoon. >> >> >> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan wrote: >> >>> + user list >>> >>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: >>> >>>> resend >>>> >>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan >>>> wrote: >>>> >>>>> >>>>> >>>>> -- Forwarded message - >>>>> From: Wenchen Fan >>>>> Date: Thu, Nov 8, 2018 at 10:55 PM >>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>>>> To: Spark dev list >>>>> >>>>> >>>>> Hi all, >>>>> >>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release >>>>> adds Barrier Execution Mode for better integration with deep learning >>>>> frameworks, introduces 30+ built-in and higher-order functions to deal >>>>> with >>>>> complex data type easier, improves the K8s integration, along with >>>>> experimental Scala 2.12 support. Other major updates include the built-in >>>>> Avro data source, Image data source, flexible streaming sinks, elimination >>>>> of the 2GB block size limitation during transfer, Pandas UDF improvements. >>>>> In addition, this release continues to focus on usability, stability, and >>>>> polish while resolving around 1100 tickets. >>>>> >>>>> We'd like to thank our contributors and users for their contributions >>>>> and early feedback to this release. This release would not have been >>>>> possible without you. >>>>> >>>>> To download Spark 2.4.0, head over to the download page: >>>>> http://spark.apache.org/downloads.html >>>>> >>>>> To view the release notes: >>>>> https://spark.apache.org/releases/spark-release-2-4-0.html >>>>> >>>>> Thanks, >>>>> Wenchen >>>>> >>>>> PS: If you see any issues with the release notes, webpage or published >>>>> artifacts, please contact me directly off-list. >>>>> >>>> > > > >
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
Awesome! On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji wrote: > Indeed! > > Sent from my iPhone > Pardon the dumb thumb typos :) > > On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun > wrote: > > Finally, thank you all. Especially, thanks to the release manager, Wenchen! > > Bests, > Dongjoon. > > > On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan wrote: > >> + user list >> >> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: >> >>> resend >>> >>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: >>> >>>> >>>> >>>> ------ Forwarded message - >>>> From: Wenchen Fan >>>> Date: Thu, Nov 8, 2018 at 10:55 PM >>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>>> To: Spark dev list >>>> >>>> >>>> Hi all, >>>> >>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release >>>> adds Barrier Execution Mode for better integration with deep learning >>>> frameworks, introduces 30+ built-in and higher-order functions to deal with >>>> complex data type easier, improves the K8s integration, along with >>>> experimental Scala 2.12 support. Other major updates include the built-in >>>> Avro data source, Image data source, flexible streaming sinks, elimination >>>> of the 2GB block size limitation during transfer, Pandas UDF improvements. >>>> In addition, this release continues to focus on usability, stability, and >>>> polish while resolving around 1100 tickets. >>>> >>>> We'd like to thank our contributors and users for their contributions >>>> and early feedback to this release. This release would not have been >>>> possible without you. >>>> >>>> To download Spark 2.4.0, head over to the download page: >>>> http://spark.apache.org/downloads.html >>>> >>>> To view the release notes: https://spark.apache.org/ >>>> releases/spark-release-2-4-0.html >>>> >>>> Thanks, >>>> Wenchen >>>> >>>> PS: If you see any issues with the release notes, webpage or published >>>> artifacts, please contact me directly off-list. >>>> >>>
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
Indeed! Sent from my iPhone Pardon the dumb thumb typos :) > On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun wrote: > > Finally, thank you all. Especially, thanks to the release manager, Wenchen! > > Bests, > Dongjoon. > > >> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan wrote: >> + user list >> >>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: >>> resend >>> >>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: >>>> >>>> >>>> -- Forwarded message - >>>> From: Wenchen Fan >>>> Date: Thu, Nov 8, 2018 at 10:55 PM >>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>>> To: Spark dev list >>>> >>>> >>>> Hi all, >>>> >>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds >>>> Barrier Execution Mode for better integration with deep learning >>>> frameworks, introduces 30+ built-in and higher-order functions to deal >>>> with complex data type easier, improves the K8s integration, along with >>>> experimental Scala 2.12 support. Other major updates include the built-in >>>> Avro data source, Image data source, flexible streaming sinks, elimination >>>> of the 2GB block size limitation during transfer, Pandas UDF improvements. >>>> In addition, this release continues to focus on usability, stability, and >>>> polish while resolving around 1100 tickets. >>>> >>>> We'd like to thank our contributors and users for their contributions and >>>> early feedback to this release. This release would not have been possible >>>> without you. >>>> >>>> To download Spark 2.4.0, head over to the download page: >>>> http://spark.apache.org/downloads.html >>>> >>>> To view the release notes: >>>> https://spark.apache.org/releases/spark-release-2-4-0.html >>>> >>>> Thanks, >>>> Wenchen >>>> >>>> PS: If you see any issues with the release notes, webpage or published >>>> artifacts, please contact me directly off-list.
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
Finally, thank you all. Especially, thanks to the release manager, Wenchen! Bests, Dongjoon. On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan wrote: > + user list > > On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: > >> resend >> >> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: >> >>> >>> >>> -- Forwarded message - >>> From: Wenchen Fan >>> Date: Thu, Nov 8, 2018 at 10:55 PM >>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >>> To: Spark dev list >>> >>> >>> Hi all, >>> >>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release >>> adds Barrier Execution Mode for better integration with deep learning >>> frameworks, introduces 30+ built-in and higher-order functions to deal with >>> complex data type easier, improves the K8s integration, along with >>> experimental Scala 2.12 support. Other major updates include the built-in >>> Avro data source, Image data source, flexible streaming sinks, elimination >>> of the 2GB block size limitation during transfer, Pandas UDF improvements. >>> In addition, this release continues to focus on usability, stability, and >>> polish while resolving around 1100 tickets. >>> >>> We'd like to thank our contributors and users for their contributions >>> and early feedback to this release. This release would not have been >>> possible without you. >>> >>> To download Spark 2.4.0, head over to the download page: >>> http://spark.apache.org/downloads.html >>> >>> To view the release notes: >>> https://spark.apache.org/releases/spark-release-2-4-0.html >>> >>> Thanks, >>> Wenchen >>> >>> PS: If you see any issues with the release notes, webpage or published >>> artifacts, please contact me directly off-list. >>> >>
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
+ user list On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: > resend > > On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: > >> >> >> -- Forwarded message - >> From: Wenchen Fan >> Date: Thu, Nov 8, 2018 at 10:55 PM >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >> To: Spark dev list >> >> >> Hi all, >> >> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release >> adds Barrier Execution Mode for better integration with deep learning >> frameworks, introduces 30+ built-in and higher-order functions to deal with >> complex data type easier, improves the K8s integration, along with >> experimental Scala 2.12 support. Other major updates include the built-in >> Avro data source, Image data source, flexible streaming sinks, elimination >> of the 2GB block size limitation during transfer, Pandas UDF improvements. >> In addition, this release continues to focus on usability, stability, and >> polish while resolving around 1100 tickets. >> >> We'd like to thank our contributors and users for their contributions and >> early feedback to this release. This release would not have been possible >> without you. >> >> To download Spark 2.4.0, head over to the download page: >> http://spark.apache.org/downloads.html >> >> To view the release notes: >> https://spark.apache.org/releases/spark-release-2-4-0.html >> >> Thanks, >> Wenchen >> >> PS: If you see any issues with the release notes, webpage or published >> artifacts, please contact me directly off-list. >> >
Re: [ANNOUNCE] Announcing Apache Spark 2.4.0
+user@ >> -- Forwarded message - >> From: Wenchen Fan >> Date: Thu, Nov 8, 2018 at 10:55 PM >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0 >> To: Spark dev list >> >> >> Hi all, >> >> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds >> Barrier Execution Mode for better integration with deep learning frameworks, >> introduces 30+ built-in and higher-order functions to deal with complex data >> type easier, improves the K8s integration, along with experimental Scala >> 2.12 support. Other major updates include the built-in Avro data source, >> Image data source, flexible streaming sinks, elimination of the 2GB block >> size limitation during transfer, Pandas UDF improvements. In addition, this >> release continues to focus on usability, stability, and polish while >> resolving around 1100 tickets. >> >> We'd like to thank our contributors and users for their contributions and >> early feedback to this release. This release would not have been possible >> without you. >> >> To download Spark 2.4.0, head over to the download page: >> http://spark.apache.org/downloads.html >> >> To view the release notes: >> https://spark.apache.org/releases/spark-release-2-4-0.html >> >> Thanks, >> Wenchen >> >> PS: If you see any issues with the release notes, webpage or published >> artifacts, please contact me directly off-list. -- Marcelo - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark orc read performance when reading large number of small files
When I run spark.read.orc("hdfs://test").filter("conv_date = 20181025").count with "spark.sql.orc.filterPushdown=true" I see below in executors logs. Predicate push down is happening 18/11/01 17:31:17 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (IS_NULL conv_date) leaf-1 = (EQUALS conv_date 20181025) expr = (and (not leaf-0) leaf-1) But when I run hive query in spark I see below logs Hive table: Hive spark.sql("select * from test where conv_date = 20181025").count 18/11/01 17:37:57 INFO HadoopRDD: Input split: hdfs://test/test1.orc:0+34568 18/11/01 17:37:57 INFO OrcRawRecordMerger: min key = null, max key = null 18/11/01 17:37:57 INFO ReaderImpl: Reading ORC rows from hdfs://test/test1.orc with {include: [true, false, false, false, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false], offset: 0, length: 9223372036854775807} 18/11/01 17:37:57 INFO Executor: Finished task 224.0 in stage 0.0 (TID 33). 1662 bytes result sent to driver 18/11/01 17:37:57 INFO CoarseGrainedExecutorBackend: Got assigned task 40 18/11/01 17:37:57 INFO Executor: Running task 956.0 in stage 0.0 (TID 40) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark orc read performance when reading large number of small files
A lot of small files is very inefficient itself and predicate push down will not help you much there unless you merge them into one large file (one large file can be much more efficiently processed). How did you validate that predicate pushdown did not work on Hive? You Hive Version is also very old - consider upgrading to at least Hive 2.x > Am 31.10.2018 um 20:35 schrieb gpatcham : > > spark version 2.2.0 > Hive version 1.1.0 > > There are lot of small files > > Spark code : > > "spark.sql.orc.enabled": "true", > "spark.sql.orc.filterPushdown": "true > > val logs > =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date > > 20181003") > > Hive: > > "spark.sql.orc.enabled": "true", > "spark.sql.orc.filterPushdown": "true > > test table in Hive is pointing to hdfs://test/ and partitioned on date > > val sqlStr = s"select * from test where date > 20181001" > val logs = spark.sql(sqlStr) > > With Hive query I don't see filter pushdown is happening. I tried setting > these configs in both hive-site.xml and also spark.sqlContext.setConf > > "hive.optimize.ppd":"true", > "hive.optimize.ppd.storage":"true" > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark orc read performance when reading large number of small files
spark version 2.2.0 Hive version 1.1.0 There are lot of small files Spark code : "spark.sql.orc.enabled": "true", "spark.sql.orc.filterPushdown": "true val logs =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date > 20181003") Hive: "spark.sql.orc.enabled": "true", "spark.sql.orc.filterPushdown": "true test table in Hive is pointing to hdfs://test/ and partitioned on date val sqlStr = s"select * from test where date > 20181001" val logs = spark.sql(sqlStr) With Hive query I don't see filter pushdown is happening. I tried setting these configs in both hive-site.xml and also spark.sqlContext.setConf "hive.optimize.ppd":"true", "hive.optimize.ppd.storage":"true" -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark orc read performance when reading large number of small files
How large are they? A lot of (small) files will cause significant delay in progressing - try to merge as much as possible into one file. Can you please share full source code in Hive and Spark as well as the versions you are using? > Am 31.10.2018 um 18:23 schrieb gpatcham : > > > > When reading large number of orc files from HDFS under a directory spark > doesn't launch any tasks until some amount of time and I don't see any tasks > running during that time. I'm using below command to read orc and spark.sql > configs. > > What spark is doing under hoods when spark.read.orc is issued? > > spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001") > "spark.sql.orc.enabled": "true", > "spark.sql.orc.filterPushdown": "true > > Also instead of directly reading orc files I tried running Hive query on > same dataset. But I was not able to push filter predicate. Where should I > set the below config's "hive.optimize.ppd":"true", > "hive.optimize.ppd.storage":"true" > > Suggest what is the best way to read orc files from HDFS and tuning > parameters ? > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Apache Spark orc read performance when reading large number of small files
When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs. What spark is doing under hoods when spark.read.orc is issued? spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001") "spark.sql.orc.enabled": "true", "spark.sql.orc.filterPushdown": "true Also instead of directly reading orc files I tried running Hive query on same dataset. But I was not able to push filter predicate. Where should I set the below config's "hive.optimize.ppd":"true", "hive.optimize.ppd.storage":"true" Suggest what is the best way to read orc files from HDFS and tuning parameters ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: java vs scala for Apache Spark - is there a performance difference ?
Older versions of Spark had indeed a lower performance on Python and R due to a conversion need between JVM datatypes and python/r datatypes. This changed in Spark 2.2, I think, with the integration of Apache Arrow. However, what you do after the conversion in those languages can be still slower than, for instance, in Java if you do not use Spark only functions. It could be also faster (eg you use a python module implemented natively in C and if there is no translation into c datatypes needed). Scala has in certain cases a more elegant syntax than Java (if you do not use Lambda). Sometimes this elegant syntax can lead to (unintentional) inefficient things for which there is a better way to express them (eg implicit conversions, use of collection methods etc). However there are better ways and you just have to spot these issues in the source code and address them, if needed. So a comparison does not make really sense between those languages - it always depends. > Am 30.10.2018 um 07:00 schrieb akshay naidu : > > how about Python. > java vs scala vs python vs R > which is better. > >> On Sat, Oct 27, 2018 at 3:34 AM karan alang wrote: >> Hello >> - is there a "performance" difference when using Java or Scala for Apache >> Spark ? >> >> I understand, there are other obvious differences (less code with scala, >> easier to focus on logic etc), >> but wrt performance - i think there would not be much of a difference since >> both of them are JVM based, >> pls. let me know if this is not the case. >> >> thanks!
Re: java vs scala for Apache Spark - is there a performance difference ?
how about Python. java vs scala vs python vs R which is better. On Sat, Oct 27, 2018 at 3:34 AM karan alang wrote: > Hello > - is there a "performance" difference when using Java or Scala for Apache > Spark ? > > I understand, there are other obvious differences (less code with scala, > easier to focus on logic etc), > but wrt performance - i think there would not be much of a difference > since both of them are JVM based, > pls. let me know if this is not the case. > > thanks! >
Re: java vs scala for Apache Spark - is there a performance difference ?
I genuinely do not think that Scala for Spark needs us to be super in Scala. There is infact a tutorial called as "Just enough Scala for Spark" which even with my IQ does not take more than 40 mins to go through. Also the sytax of Scala is almost always similar to that of Python. Data processing is much more amenable to functional thinking and therefore Scala suits best also Spark is written in Scala. Regards, Gourav On Mon, Oct 29, 2018 at 11:33 PM kant kodali wrote: > Most people when they compare two different programming languages 99% of > the time it all seems to boil down to syntax sugar. > > Performance I doubt Scala is ever faster than Java given that Scala likes > Heap more than Java. I had also written some pointless micro-benchmarking > code like (Random String Generation, hash computations, etc..) on Java, > Scala and Golang and Java had outperformed both Scala and Golang as well on > many occasions. > > Now that Java 11 had released things seem to get even better given the > startup time is also very low. > > I am happy to change my view as long as I can see some code and benchmarks! > > > > On Mon, Oct 29, 2018 at 1:58 PM Jean Georges Perrin wrote: > >> did not see anything, but curious if you find something. >> >> I think one of the big benefit of using Java, for data engineering in the >> context of Spark, is that you do not have to train a lot of your team to >> Scala. Now if you want to do data science, Java is probably not the best >> tool yet... >> >> On Oct 26, 2018, at 6:04 PM, karan alang wrote: >> >> Hello >> - is there a "performance" difference when using Java or Scala for Apache >> Spark ? >> >> I understand, there are other obvious differences (less code with scala, >> easier to focus on logic etc), >> but wrt performance - i think there would not be much of a difference >> since both of them are JVM based, >> pls. let me know if this is not the case. >> >> thanks! >> >> >>
Re: java vs scala for Apache Spark - is there a performance difference ?
Most people when they compare two different programming languages 99% of the time it all seems to boil down to syntax sugar. Performance I doubt Scala is ever faster than Java given that Scala likes Heap more than Java. I had also written some pointless micro-benchmarking code like (Random String Generation, hash computations, etc..) on Java, Scala and Golang and Java had outperformed both Scala and Golang as well on many occasions. Now that Java 11 had released things seem to get even better given the startup time is also very low. I am happy to change my view as long as I can see some code and benchmarks! On Mon, Oct 29, 2018 at 1:58 PM Jean Georges Perrin wrote: > did not see anything, but curious if you find something. > > I think one of the big benefit of using Java, for data engineering in the > context of Spark, is that you do not have to train a lot of your team to > Scala. Now if you want to do data science, Java is probably not the best > tool yet... > > On Oct 26, 2018, at 6:04 PM, karan alang wrote: > > Hello > - is there a "performance" difference when using Java or Scala for Apache > Spark ? > > I understand, there are other obvious differences (less code with scala, > easier to focus on logic etc), > but wrt performance - i think there would not be much of a difference > since both of them are JVM based, > pls. let me know if this is not the case. > > thanks! > > >
Re: java vs scala for Apache Spark - is there a performance difference ?
did not see anything, but curious if you find something. I think one of the big benefit of using Java, for data engineering in the context of Spark, is that you do not have to train a lot of your team to Scala. Now if you want to do data science, Java is probably not the best tool yet... > On Oct 26, 2018, at 6:04 PM, karan alang wrote: > > Hello > - is there a "performance" difference when using Java or Scala for Apache > Spark ? > > I understand, there are other obvious differences (less code with scala, > easier to focus on logic etc), > but wrt performance - i think there would not be much of a difference since > both of them are JVM based, > pls. let me know if this is not the case. > > thanks!
Re: java vs scala for Apache Spark - is there a performance difference ?
On Oct 27, 2018 3:34 AM, "karan alang" wrote: Hello - is there a "performance" difference when using Java or Scala for Apache Spark ? I understand, there are other obvious differences (less code with scala, easier to focus on logic etc), but wrt performance - i think there would not be much of a difference since both of them are JVM based, pls. let me know if this is not the case. thanks!
java vs scala for Apache Spark - is there a performance difference ?
Hello - is there a "performance" difference when using Java or Scala for Apache Spark ? I understand, there are other obvious differences (less code with scala, easier to focus on logic etc), but wrt performance - i think there would not be much of a difference since both of them are JVM based, pls. let me know if this is not the case. thanks!
CVE-2018-11804: Apache Spark build/mvn runs zinc, and can expose information from build machines
Severity: Low Vendor: The Apache Software Foundation Versions Affected: 1.3.x release branch and later, including master Description: Spark's Apache Maven-based build includes a convenience script, 'build/mvn', that downloads and runs a zinc server to speed up compilation. This server will accept connections from external hosts by default. A specially-crafted request to the zinc server could cause it to reveal information in files readable to the developer account running the build. Note that this issue does not affect end users of Spark, only developers building Spark from source code. Mitigation: Spark users are not affected, as zinc is only a part of the build process. Spark developers may simply use a local Maven installation's 'mvn' command to build, and avoid running build/mvn and zinc. Spark developers building actively-developed branches (2.2.x, 2.3.x, 2.4.x, master) may update their branches to receive mitigations already patched onto the build/mvn script. Spark developers running zinc separately may include "-server 127.0.0.1" in its command line, and consider additional flags like "-idle-timeout 30m" to achieve similar mitigation. Credit: Andre Protas, Apple Information Security References: https://spark.apache.org/security.html - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Triggering sql on Was S3 via Apache Spark
I do not think security and governance has become important it always was. Horton works and Cloudera has fantastic security implementations and hence I mentioned about updates via Hive. Regards, Gourav On Wed, 24 Oct 2018, 17:32 , wrote: > Thank you Gourav, > > Today I saw the article: > https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important > > It seems also interesting. > > I was in meeting, I will also watch it. > > > > *From: *Gourav Sengupta > *Date: *24 October 2018 Wednesday 13:39 > *To: *"Ozsakarya, Omer" > *Cc: *Spark Forum > *Subject: *Re: Triggering sql on Was S3 via Apache Spark > > > > Also try to read about SCD and the fact that Hive may be a very good > alternative as well for running updates on data > > > > Regards, > > Gourav > > > > On Wed, 24 Oct 2018, 14:53 , wrote: > > Thank you very much 😊 > > > > *From: *Gourav Sengupta > *Date: *24 October 2018 Wednesday 11:20 > *To: *"Ozsakarya, Omer" > *Cc: *Spark Forum > *Subject: *Re: Triggering sql on Was S3 via Apache Spark > > > > This is interesting you asked and then answered the questions (almost) as > well > > > > Regards, > > Gourav > > > > On Tue, 23 Oct 2018, 13:23 , wrote: > > Hi guys, > > > > We are using Apache Spark on a local machine. > > > > I need to implement the scenario below. > > > > In the initial load: > >1. CRM application will send a file to a folder. This file contains >customer information of all customers. This file is in a folder in the >local server. File name is: customer.tsv > > >1. Customer.tsv contains customerid, country, birty_month, > activation_date etc > > >1. I need to read the contents of customer.tsv. >2. I will add current timestamp info to the file. >3. I will transfer customer.tsv to the S3 bucket: customer.history.data > > > > In the daily loads: > >1. CRM application will send a new file which contains the >updated/deleted/inserted customer information. > > File name is daily_customer.tsv > >1. Daily_customer.tsv contains contains customerid, cdc_field, > country, birty_month, activation_date etc > > Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. > >1. I need to read the contents of daily_customer.tsv. >2. I will add current timestamp info to the file. >3. I will transfer daily_customer.tsv to the S3 bucket: >customer.daily.data >4. I need to merge two buckets customer.history.data and >customer.daily.data. > > >1. Two buckets have timestamp fields. So I need to query all records > whose timestamp is the last timestamp. > 2. I can use row_number() over(partition by customer_id order by > timestamp_field desc) as version_number > 3. Then I can put the records whose version is one, to the final > bucket: customer.dimension.data > > > > I am running Spark on premise. > >- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD >on a local Spark cluster? >- Is this approach efficient? Will the queries transfer all historical >data from AWS S3 to the local cluster? >- How can I implement this scenario in a more effective way? Like just >transferring daily data to AWS S3 and then running queries on AWS. > > >- For instance Athena can query on AWS. But it is just a query engine. > As I know I can not call it by using an sdk and I can not write the > results > to a bucket/folder. > > > > Thanks in advance, > > Ömer > > > > > > > > > >
Re: Triggering sql on Was S3 via Apache Spark
Thank you Gourav, Today I saw the article: https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important It seems also interesting. I was in meeting, I will also watch it. From: Gourav Sengupta Date: 24 October 2018 Wednesday 13:39 To: "Ozsakarya, Omer" Cc: Spark Forum Subject: Re: Triggering sql on Was S3 via Apache Spark Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data Regards, Gourav On Wed, 24 Oct 2018, 14:53 , mailto:omer.ozsaka...@sony.com>> wrote: Thank you very much 😊 From: Gourav Sengupta mailto:gourav.sengu...@gmail.com>> Date: 24 October 2018 Wednesday 11:20 To: "Ozsakarya, Omer" mailto:omer.ozsaka...@sony.com>> Cc: Spark Forum mailto:user@spark.apache.org>> Subject: Re: Triggering sql on Was S3 via Apache Spark This is interesting you asked and then answered the questions (almost) as well Regards, Gourav On Tue, 23 Oct 2018, 13:23 , mailto:omer.ozsaka...@sony.com>> wrote: Hi guys, We are using Apache Spark on a local machine. I need to implement the scenario below. In the initial load: 1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv * Customer.tsv contains customerid, country, birty_month, activation_date etc 1. I need to read the contents of customer.tsv. 2. I will add current timestamp info to the file. 3. I will transfer customer.tsv to the S3 bucket: customer.history.data In the daily loads: 1. CRM application will send a new file which contains the updated/deleted/inserted customer information. File name is daily_customer.tsv * Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. 1. I need to read the contents of daily_customer.tsv. 2. I will add current timestamp info to the file. 3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data 4. I need to merge two buckets customer.history.data and customer.daily.data. * Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp. * I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number * Then I can put the records whose version is one, to the final bucket: customer.dimension.data I am running Spark on premise. * Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster? * Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster? * How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS. * For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder. Thanks in advance, Ömer
Re: Triggering sql on Was S3 via Apache Spark
Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data Regards, Gourav On Wed, 24 Oct 2018, 14:53 , wrote: > Thank you very much 😊 > > > > *From: *Gourav Sengupta > *Date: *24 October 2018 Wednesday 11:20 > *To: *"Ozsakarya, Omer" > *Cc: *Spark Forum > *Subject: *Re: Triggering sql on Was S3 via Apache Spark > > > > This is interesting you asked and then answered the questions (almost) as > well > > > > Regards, > > Gourav > > > > On Tue, 23 Oct 2018, 13:23 , wrote: > > Hi guys, > > > > We are using Apache Spark on a local machine. > > > > I need to implement the scenario below. > > > > In the initial load: > >1. CRM application will send a file to a folder. This file contains >customer information of all customers. This file is in a folder in the >local server. File name is: customer.tsv > > >1. Customer.tsv contains customerid, country, birty_month, > activation_date etc > > >1. I need to read the contents of customer.tsv. >2. I will add current timestamp info to the file. >3. I will transfer customer.tsv to the S3 bucket: customer.history.data > > > > In the daily loads: > >1. CRM application will send a new file which contains the >updated/deleted/inserted customer information. > > File name is daily_customer.tsv > >1. Daily_customer.tsv contains contains customerid, cdc_field, > country, birty_month, activation_date etc > > Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. > >1. I need to read the contents of daily_customer.tsv. >2. I will add current timestamp info to the file. >3. I will transfer daily_customer.tsv to the S3 bucket: >customer.daily.data >4. I need to merge two buckets customer.history.data and >customer.daily.data. > > >1. Two buckets have timestamp fields. So I need to query all records > whose timestamp is the last timestamp. > 2. I can use row_number() over(partition by customer_id order by > timestamp_field desc) as version_number > 3. Then I can put the records whose version is one, to the final > bucket: customer.dimension.data > > > > I am running Spark on premise. > >- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD >on a local Spark cluster? >- Is this approach efficient? Will the queries transfer all historical >data from AWS S3 to the local cluster? >- How can I implement this scenario in a more effective way? Like just >transferring daily data to AWS S3 and then running queries on AWS. > > >- For instance Athena can query on AWS. But it is just a query engine. > As I know I can not call it by using an sdk and I can not write the > results > to a bucket/folder. > > > > Thanks in advance, > > Ömer > > > > > > > > > >
Re: Triggering sql on Was S3 via Apache Spark
Thank you very much 😊 From: Gourav Sengupta Date: 24 October 2018 Wednesday 11:20 To: "Ozsakarya, Omer" Cc: Spark Forum Subject: Re: Triggering sql on Was S3 via Apache Spark This is interesting you asked and then answered the questions (almost) as well Regards, Gourav On Tue, 23 Oct 2018, 13:23 , mailto:omer.ozsaka...@sony.com>> wrote: Hi guys, We are using Apache Spark on a local machine. I need to implement the scenario below. In the initial load: 1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv * Customer.tsv contains customerid, country, birty_month, activation_date etc 1. I need to read the contents of customer.tsv. 2. I will add current timestamp info to the file. 3. I will transfer customer.tsv to the S3 bucket: customer.history.data In the daily loads: 1. CRM application will send a new file which contains the updated/deleted/inserted customer information. File name is daily_customer.tsv * Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. 1. I need to read the contents of daily_customer.tsv. 2. I will add current timestamp info to the file. 3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data 4. I need to merge two buckets customer.history.data and customer.daily.data. * Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp. * I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number * Then I can put the records whose version is one, to the final bucket: customer.dimension.data I am running Spark on premise. * Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster? * Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster? * How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS. * For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder. Thanks in advance, Ömer
Re: Triggering sql on Was S3 via Apache Spark
This is interesting you asked and then answered the questions (almost) as well Regards, Gourav On Tue, 23 Oct 2018, 13:23 , wrote: > Hi guys, > > > > We are using Apache Spark on a local machine. > > > > I need to implement the scenario below. > > > > In the initial load: > >1. CRM application will send a file to a folder. This file contains >customer information of all customers. This file is in a folder in the >local server. File name is: customer.tsv > 1. Customer.tsv contains customerid, country, birty_month, > activation_date etc >2. I need to read the contents of customer.tsv. >3. I will add current timestamp info to the file. >4. I will transfer customer.tsv to the S3 bucket: customer.history.data > > > > In the daily loads: > >1. CRM application will send a new file which contains the >updated/deleted/inserted customer information. > > File name is daily_customer.tsv > >1. Daily_customer.tsv contains contains customerid, cdc_field, > country, birty_month, activation_date etc > > Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. > >1. I need to read the contents of daily_customer.tsv. >2. I will add current timestamp info to the file. >3. I will transfer daily_customer.tsv to the S3 bucket: >customer.daily.data >4. I need to merge two buckets customer.history.data and >customer.daily.data. > 1. Two buckets have timestamp fields. So I need to query all > records whose timestamp is the last timestamp. > 2. I can use row_number() over(partition by customer_id order by > timestamp_field desc) as version_number > 3. Then I can put the records whose version is one, to the final > bucket: customer.dimension.data > > > > I am running Spark on premise. > >- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD >on a local Spark cluster? >- Is this approach efficient? Will the queries transfer all historical >data from AWS S3 to the local cluster? >- How can I implement this scenario in a more effective way? Like just >transferring daily data to AWS S3 and then running queries on AWS. > - For instance Athena can query on AWS. But it is just a query > engine. As I know I can not call it by using an sdk and I can not write > the > results to a bucket/folder. > > > > Thanks in advance, > > Ömer > > > > > > > > >
Re: Triggering sql on Was S3 via Apache Spark
Why not directly access the S3 file from Spark? You need to configure the IAM roles so that the machine running the S3 code is allowed to access the bucket. > Am 24.10.2018 um 06:40 schrieb Divya Gehlot : > > Hi Omer , > Here are couple of the solutions which you can implement for your use case : > Option 1 : > you can mount the S3 bucket as local file system > Here are the details : > https://cloud.netapp.com/blog/amazon-s3-as-a-file-system > Option 2 : > You can use Amazon Glue for your use case > here are the details : > https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/ > > Option 3 : > Store the file in the local file system and later push it s3 bucket > here are the details > https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket > > Thanks, > Divya > >> On Tue, 23 Oct 2018 at 15:53, wrote: >> Hi guys, >> >> >> >> We are using Apache Spark on a local machine. >> >> >> >> I need to implement the scenario below. >> >> >> >> In the initial load: >> >> CRM application will send a file to a folder. This file contains customer >> information of all customers. This file is in a folder in the local server. >> File name is: customer.tsv >> Customer.tsv contains customerid, country, birty_month, activation_date etc >> I need to read the contents of customer.tsv. >> I will add current timestamp info to the file. >> I will transfer customer.tsv to the S3 bucket: customer.history.data >> >> >> In the daily loads: >> >> CRM application will send a new file which contains the >> updated/deleted/inserted customer information. >> File name is daily_customer.tsv >> >> Daily_customer.tsv contains contains customerid, cdc_field, country, >> birty_month, activation_date etc >> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. >> >> I need to read the contents of daily_customer.tsv. >> I will add current timestamp info to the file. >> I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data >> I need to merge two buckets customer.history.data and customer.daily.data. >> Two buckets have timestamp fields. So I need to query all records whose >> timestamp is the last timestamp. >> I can use row_number() over(partition by customer_id order by >> timestamp_field desc) as version_number >> Then I can put the records whose version is one, to the final bucket: >> customer.dimension.data >> >> >> I am running Spark on premise. >> >> Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a >> local Spark cluster? >> Is this approach efficient? Will the queries transfer all historical data >> from AWS S3 to the local cluster? >> How can I implement this scenario in a more effective way? Like just >> transferring daily data to AWS S3 and then running queries on AWS. >> For instance Athena can query on AWS. But it is just a query engine. As I >> know I can not call it by using an sdk and I can not write the results to a >> bucket/folder. >> >> >> Thanks in advance, >> >> Ömer >> >> >> >> >> >> >> >>
Re: Triggering sql on Was S3 via Apache Spark
Hi Omer , Here are couple of the solutions which you can implement for your use case : *Option 1 : * you can mount the S3 bucket as local file system Here are the details : https://cloud.netapp.com/blog/amazon-s3-as-a-file-system *Option 2 :* You can use Amazon Glue for your use case here are the details : https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/ *Option 3 :* Store the file in the local file system and later push it s3 bucket here are the details https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket Thanks, Divya On Tue, 23 Oct 2018 at 15:53, wrote: > Hi guys, > > > > We are using Apache Spark on a local machine. > > > > I need to implement the scenario below. > > > > In the initial load: > >1. CRM application will send a file to a folder. This file contains >customer information of all customers. This file is in a folder in the >local server. File name is: customer.tsv > 1. Customer.tsv contains customerid, country, birty_month, > activation_date etc >2. I need to read the contents of customer.tsv. >3. I will add current timestamp info to the file. >4. I will transfer customer.tsv to the S3 bucket: customer.history.data > > > > In the daily loads: > >1. CRM application will send a new file which contains the >updated/deleted/inserted customer information. > > File name is daily_customer.tsv > >1. Daily_customer.tsv contains contains customerid, cdc_field, > country, birty_month, activation_date etc > > Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. > >1. I need to read the contents of daily_customer.tsv. >2. I will add current timestamp info to the file. >3. I will transfer daily_customer.tsv to the S3 bucket: >customer.daily.data >4. I need to merge two buckets customer.history.data and >customer.daily.data. > 1. Two buckets have timestamp fields. So I need to query all > records whose timestamp is the last timestamp. > 2. I can use row_number() over(partition by customer_id order by > timestamp_field desc) as version_number > 3. Then I can put the records whose version is one, to the final > bucket: customer.dimension.data > > > > I am running Spark on premise. > >- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD >on a local Spark cluster? >- Is this approach efficient? Will the queries transfer all historical >data from AWS S3 to the local cluster? >- How can I implement this scenario in a more effective way? Like just >transferring daily data to AWS S3 and then running queries on AWS. > - For instance Athena can query on AWS. But it is just a query > engine. As I know I can not call it by using an sdk and I can not write > the > results to a bucket/folder. > > > > Thanks in advance, > > Ömer > > > > > > > > >
Triggering sql on Was S3 via Apache Spark
Hi guys, We are using Apache Spark on a local machine. I need to implement the scenario below. In the initial load: 1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv * Customer.tsv contains customerid, country, birty_month, activation_date etc 2. I need to read the contents of customer.tsv. 3. I will add current timestamp info to the file. 4. I will transfer customer.tsv to the S3 bucket: customer.history.data In the daily loads: 1. CRM application will send a new file which contains the updated/deleted/inserted customer information. File name is daily_customer.tsv * Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. 1. I need to read the contents of daily_customer.tsv. 2. I will add current timestamp info to the file. 3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data 4. I need to merge two buckets customer.history.data and customer.daily.data. * Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp. * I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number * Then I can put the records whose version is one, to the final bucket: customer.dimension.data I am running Spark on premise. * Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster? * Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster? * How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS. * For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder. Thanks in advance, Ömer
Triangle Apache Spark Meetup
Hi, Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh, Durham, and Chapel Hill in North Carolina, USA. The group started back in July 2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/ <https://www.meetup.com/Triangle-Apache-Spark-Meetup/>. Can you add our meetup to http://spark.apache.org/community.html <http://spark.apache.org/community.html> ? jg
[ANNOUNCE] Announcing Apache Spark 2.3.2
We are happy to announce the availability of Spark 2.3.2! Apache Spark 2.3.2 is a maintenance release, based on the branch-2.3 maintenance branch of Spark. We strongly recommend all 2.3.x users to upgrade to this stable release. To download Spark 2.3.2, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: https://spark.apache.org/releases/spark-release-2-3-2.html We would like to acknowledge all community members for contributing to this release. This release would not have been possible without you. Best regards Saisai
Apache Spark and Airflow connection
I have a docker based cluster. In my cluster, I try to schedule spark jobs by using Airflow. Airflow and Spark are running separately in *different containers*. However, I cannot run a spark job by using airflow. Below the code is my airflow script: from airflow import DAG from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator from datetime import datetime, timedelta args = {'owner': 'airflow', 'start_date': datetime(2018, 7, 31) } dag = DAG('spark_example_new', default_args=args, schedule_interval="@once") operator = SparkSubmitOperator(task_id='spark_submit_job', conn_id='spark_default', java_class='Main', application='/SimpleSpark.jar', name='airflow-spark-example', dag=dag) I also configure spark_default in Airflow UI: [image: Screenshot from 2018-09-24 12-00-46.png] However, it produce following error: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit' I think, airflow try to run spark job in own. How can I configure that it runs spark code on spark master. -- Uğur Sopaoğlu
Padova Apache Spark Meetup
Hello, we are creating a new meetup of enthusiast Apache Spark Users in Italy at Padova https://www.meetup.com/Padova-Apache-Spark-Meetup/ Is it possible to add the meetup link to the web page https://spark.apache.org/community.html ? Moreover is it possible to announce future events in this mailing list ? Kind Regards Matteo Durighetto e-mail: m.durighe...@miriade.it supporto kandula : database : support...@miriade.it business intelligence : support...@miriade.it infrastructure : supp...@miriade.it M I R I A D E - P L A Y T H E C H A N G E Via Castelletto 11, 36016 Thiene VI Tel. 0445030111 - Fax 0445030100 Website: http://www.miriade.it/ <https://www.facebook.com/MiriadePlayTheChange/> <https://www.linkedin.com/company/miriade-s-p-a-?trk=company_logo> <https://plus.google.com/+MiriadeIt/about> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Le informazioni contenute in questa e-mail sono destinate alla persona alla quale sono state inviate. Nel rispetto della legge, dei regolamenti e delle normative vigenti, questa e-mail non deve essere resa pubblica poiché potrebbe contenere informazioni di natura strettamente confidenziale. Qualsiasi persona che al di fuori del destinatario dovesse riceverla o dovesse entrarne in possesso non é autorizzata a leggerla, diffonderla, inoltrarla o duplicarla. Se chi legge non é il destinatario del messaggio e' pregato di avvisare immediatamente il mittente e successivamente di eliminarlo. Miriade declina ogni responsabilità per l'incompleta e l'errata trasmissione di questa e-mail o per un ritardo nella ricezione della stessa.
CVE-2018-11770: Apache Spark standalone master, Mesos REST APIs not controlled by authentication
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Spark versions from 1.3.0, running standalone master with REST API enabled, or running Mesos master with cluster mode enabled Description: >From version 1.3.0 onward, Spark's standalone master exposes a REST API for job submission, in addition to the submission mechanism used by spark-submit. In standalone, the config property 'spark.authenticate.secret' establishes a shared secret for authenticating requests to submit jobs via spark-submit. However, the REST API does not use this or any other authentication mechanism, and this is not adequately documented. In this case, a user would be able to run a driver program without authenticating, but not launch executors, using the REST API. This REST API is also used by Mesos, when set up to run in cluster mode (i.e., when also running MesosClusterDispatcher), for job submission. Future versions of Spark will improve documentation on these points, and prohibit setting 'spark.authenticate.secret' when running the REST APIs, to make this clear. Future versions will also disable the REST API by default in the standalone master by changing the default value of 'spark.master.rest.enabled' to 'false'. Mitigation: For standalone masters, disable the REST API by setting 'spark.master.rest.enabled' to 'false' if it is unused, and/or ensure that all network access to the REST API (port 6066 by default) is restricted to hosts that are trusted to submit jobs. Mesos users can stop the MesosClusterDispatcher, though that will prevent them from running jobs in cluster mode. Alternatively, they can ensure access to the MesosRestSubmissionServer (port 7077 by default) is restricted to trusted hosts. Credit: Imran Rashid, Cloudera Fengwei Zhang, Alibaba Cloud Security Team Reference: https://spark.apache.org/security.html
Data quality measurement for streaming data with apache spark
Hello, I have very general question about Apache Spark. I want to know if it is possible(and where to start, if possible) to implement a data quality measurement prototype for streaming data using Apache Spark. Let's say I want to work on Timeliness or Completeness as a data quality metrics, is similar work already done using spark? Are there other frameworks which are better designed for this use case? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Apache Spark Cluster
We try to create a cluster which consists of 4 machines. The cluster will be used by multiple-users. How can we configured that user can submit jobs from personal computer and is there any free tool you can suggest to leverage procedure. -- Uğur Sopaoğlu
CVE-2018-8024 Apache Spark XSS vulnerability in UI
Severity: Medium Vendor: The Apache Software Foundation Versions Affected: Spark versions through 2.1.2 Spark 2.2.0 through 2.2.1 Spark 2.3.0 Description: In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's possible for a malicious user to construct a URL pointing to a Spark cluster's UI's job and stage info pages, and if a user can be tricked into accessing the URL, can be used to cause script to execute and expose information from the user's view of the Spark UI. While some browsers like recent versions of Chrome and Safari are able to block this type of attack, current versions of Firefox (and possibly others) do not. Mitigation: 1.x, 2.0.x, and 2.1.x users should upgrade to 2.1.3 or newer 2.2.x users should upgrade to 2.2.2 or newer 2.3.x users should upgrade to 2.3.1 or newer Credit: Spencer Gietzen, Rhino Security Labs References: https://spark.apache.org/security.html
CVE-2018-1334 Apache Spark local privilege escalation vulnerability
Severity: High Vendor: The Apache Software Foundation Versions affected: Spark versions through 2.1.2 Spark 2.2.0 to 2.2.1 Spark 2.3.0 Description: In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when using PySpark or SparkR, it's possible for a different local user to connect to the Spark application and impersonate the user running the Spark application. Mitigation: 1.x, 2.0.x, and 2.1.x users should upgrade to 2.1.3 or newer 2.2.x users should upgrade to 2.2.2 or newer 2.3.x users should upgrade to 2.3.1 or newer Otherwise, affected users should avoid using PySpark and SparkR in multi-user environments. Credit: Nehmé Tohmé, Cloudera, Inc. References: https://spark.apache.org/security.html
[ANNOUNCE] Apache Spark 2.2.2
We are happy to announce the availability of Spark 2.2.2! Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2 maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade to this stable release. The release notes are available at http://spark.apache.org/releases/spark-release-2-2-2.html To download Apache Spark 2.2.2 visit http://spark.apache.org/downloads.html. This version of Spark is also available on Maven and PyPI. We would like to acknowledge all community members for contributing patches to this release.
[ANNOUNCE] Apache Spark 2.1.3
We are happy to announce the availability of Spark 2.1.3! Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. The release notes are available at http://spark.apache.org/releases/spark-release-2-1-3.html To download Apache Spark 2.1.3 visit http://spark.apache.org/downloads.html. This version of Spark is also available on Maven and PyPI. We would like to acknowledge all community members for contributing patches to this release. Special thanks to Marcelo Vanzin for making the release, I'm just handling the last few details this time.
Apache Spark use case: correlate data strings from file
Hi, I’m new on Spark and I’m trying to understand if it can fit my use case. I have the following scenario. I have a file (it can be a log file, .txt, .csv, .xml or .json, I can produce the data in whatever format I prefer) with some data, e.g.: *Event “X”, City “Y”, Zone “Z”* with different events, cities and zones. This data can be represented by string (like the one I wrote) in a .txt, or by XML , CSV, or JSON, as I wish. I can also send this data through TCP Socket, if I need it. What I really want to do is to *correlate each single entry with other similar entries by declaring rules*. For example, I want to declare some rules on the data flow: if I received event X1 and event X2 in same city and same zone, I’ll want to do something (execute a .bat script, write a log file, etc). Same thing if I received the same string multiple times, or whatever rule I want to produce with these data strings. I’m trying to understand if Apache Spark can fit my use case. The only input data will be these strings from this file. Can I correlate these events and how? Is there a GUI to do it? Any hints and advices will be appreciated. Best regards, Simone -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ANNOUNCE] Announcing Apache Spark 2.3.1
We are happy to announce the availability of Spark 2.3.1! Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3 maintenance branch of Spark. We strongly recommend all 2.3.x users to upgrade to this stable release. To download Spark 2.3.1, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: https://spark.apache.org/releases/spark-release-2-3-1.html We would like to acknowledge all community members for contributing to this release. This release would not have been possible without you. -- Marcelo - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint
If you are using kafka direct connect api it might be committing offset back to kafka itself בתאריך יום ה׳, 7 ביוני 2018, 4:10, מאת licl : > I met the same issue and I have try to delete the checkpoint dir before the > job , > > But spark seems can read the correct offset even though after the > checkpoint dir is deleted , > > I don't know how spark do this without checkpoint's metadata. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint
I met the same issue and I have try to delete the checkpoint dir before the job , But spark seems can read the correct offset even though after the checkpoint dir is deleted , I don't know how spark do this without checkpoint's metadata. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark Installation error
You probably want to recognize "spark-shell" as a command in your environment. Maybe try "sudo ln -s /path/to/spark-shell /usr/bin/spark-shell" Have you tried "./spark-shell" in the current path to see if it works? Thank You, Irving Duran On Thu, May 31, 2018 at 9:00 AM Remil Mohanan wrote: > Hi there, > >I am not able to execute the spark-shell command. Can you please help. > > Thanks > > Remil > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Apache Spark is not working as expected
hadoopuser@sherin-VirtualBox:/usr/lib/spark/bin$ spark-shell spark-shell: command not found hadoopuser@sherin-VirtualBox:/usr/lib/spark/bin$ Spark.odt <http://apache-spark-user-list.1001560.n3.nabble.com/file/t9314/Spark.odt> -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Apache spark on windows without shortnames enabled
Hi, We use Apache Spark 2.2.0 in our stack. Our software by default like other softwares gets installed under "C:\Program Files\". We have a restriction that we cannot ask our customers to enable short names on their machines. From our experience, spark does not handle the absolute paths well if there is a whitespace in the path neither while calling spark-class2.cmd from commandline nor the paths inside spark-env.cmd. We have tried the following: 1. Using double quotes around the path. 2. Escaping the whitespaces. 3. Using relative paths. None of them have been successful in bringing up spark. How do you recommend handling this? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Apache spark -2.1.0 question in Spark SQL
Please help me on the below error & give me different approach on the below data manipulation. Error:Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. not enough arguments for method flatmap: (implicit evidence$8: org.apache.spark.sql.Encoder[org.apache.spark.sql.Row])org.apache.spark.sql.Datset[org.apache.spark.sql.Row] unspecified value parameter evidence$8. Code: def getFlattenDataframe(someDataframe:DataFrame,spark:SparkSession):DataFrame = { val mymanualschema = new StructType(Array( StructField("field1",StringType,true), StructField("field2",StringType,true), StructField("field3",StringType,true), StructField("field4",IntegerType,true), StructField("field5",DoubleType,true))) val flattenRDD= someDataFrame.flatMap{curRow:Row=>getRows(curRow)} > error showing in this line spark.createDataFrame(flattenRdd,mymanualschema) def getRows(CurRow:Row):Array[Row]={ val somefield =curRow.getAs[String]("field1") --- saome manipulation happening here and finally return a array of rows return res[Row] } Could you please someone help me what causing the issues here.i have tested the import spark.implicits not working. how to fix this error or else help me in different approach here. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Apache Spark - Structured Streaming State Management With Watermark
Hi: I am using Apache Spark Structured Streaming (2.2.1) to implement custom sessionization for events. The processing is in two steps:1. flatMapGroupsWithState (based on user id) - which stores the state of user and emits events every minute until a expire event is received 2. The next step is a aggregation (group by count) I am using outputMode - Update. I have a few questions: 1. If I don't use watermark at all - (a) is the state for flatMapGroupsWithState state stored forever ? (b) is the state for groupBy count stored for ever ?2. Is watermark applicable for cleaning up groupBy aggregates only ?3. Can we use watermark to manage state in by flatMapGroupsWithState ? If so, how ? 4. Can watermark be used for other state clean up - are there any examples for those ? Thanks
Apache Spark - Structured Streaming StreamExecution Stats Description
Hi: I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState and a groupBy count operators. In the StreamExecution logs I see two enteries for stateOperators "stateOperators" : [ { "numRowsTotal" : 1617339, "numRowsUpdated" : 9647 }, { "numRowsTotal" : 1326355, "numRowsUpdated" : 1398672 } ], My questions are:1. Is there way to figure out which stats is for flatMapGroupWithState and which one for groupBy count ? In my case, I can guess based on my data but want to be definitive about it.2. For the second stats - how can the numRowsTotal (1326355) be less than numRowsUpdated (1398672) ? If there in documentation I can use to understand the debug output, please let me know. Thanks
Apache Spark Structured Streaming - How to keep executor alive.
Hi: I am working on spark structured streaming (2.2.1) with kafka and want 100 executors to be alive. I set spark.executor.instances to be 100. The process starts running with 100 executors but after some time only a few remain which causes backlog of events from kafka. I thought I saw a setting to keep the executors from being killed. However, I am not able to find that configuration in spark docs. If anyone knows that setting, please let me know. Thanks
Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint
Hi: I am working on a realtime application using spark structured streaming (v 2.2.1). The application reads data from kafka and if there is a failure, I would like to ignore the checkpoint. Is there any configuration to just read from last kafka offset after a failure and ignore any offset checkpoints ? Also, I believe that the checkpoint also saves state and will continue to aggregations after recovery. Is there any way to ignore checkpointed state ? Also, is there a way to selectively save state or offset checkpoint only ? Thanks
Re: Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception
Structured Streaming AUTOMATICALLY saves the offsets in a checkpoint directory that you provide. And when you start the query again with the same directory it will just pick up where it left off. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing On Thu, Mar 22, 2018 at 8:06 PM, M Singh wrote: > Hi: > > I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the > last few days, after running the application for 30-60 minutes get > exception from Kafka Consumer included below. > > The structured streaming application is processing 1 minute worth of data > from kafka topic. So I've tried increasing request.timeout.ms from 4 > seconds default to 45000 seconds and receive.buffer.bytes to 1mb but still > get the same exception. > > Is there any spark/kafka configuration that can save the offset and retry > it next time rather than throwing an exception and killing the application. > > I've tried googling but have not found substantial > solution/recommendation. If anyone has any suggestions or a different > version etc, please let me know. > > Thanks > > Here is the exception stack trace. > > java.util.concurrent.TimeoutException: Cannot fetch record for offset > in 12 milliseconds > at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$ > apache$spark$sql$kafka010$CachedKafkaConsumer$$ > fetchData(CachedKafkaConsumer.scala:219) > at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply( > CachedKafkaConsumer.scala:117) > at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply( > CachedKafkaConsumer.scala:106) > at org.apache.spark.util.UninterruptibleThread.runUninterruptibly( > UninterruptibleThread.scala:85) > at org.apache.spark.sql.kafka010.CachedKafkaConsumer. > runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) > at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get( > CachedKafkaConsumer.scala:106) > at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1. > getNext(KafkaSourceRDD.scala:157) > at >
Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception
Hi: I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last few days, after running the application for 30-60 minutes get exception from Kafka Consumer included below. The structured streaming application is processing 1 minute worth of data from kafka topic. So I've tried increasing request.timeout.ms from 4 seconds default to 45000 seconds and receive.buffer.bytes to 1mb but still get the same exception. Is there any spark/kafka configuration that can save the offset and retry it next time rather than throwing an exception and killing the application. I've tried googling but have not found substantial solution/recommendation. If anyone has any suggestions or a different version etc, please let me know. Thanks Here is the exception stack trace. java.util.concurrent.TimeoutException: Cannot fetch record for offset in 12 millisecondsat org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:219) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157) at
Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer
Hi Vijay: I am using spark-shell because I am still prototyping the steps involved. Regarding executors - I have 280 executors and UI only show a few straggler tasks on each trigger. The UI does not show too much time spend on GC. suspect the delay is because of getting data from kafka. The number of straggler is generally less than 5 out 240 but sometimes is higher. I will try to dig more into it and see if changing partitions etc helps but was wondering if anyone else has encountered similar stragglers holding up processing of a window trigger. Thanks On Friday, February 23, 2018 6:07 PM, vijay.bvp wrote: Instead of spark-shell have you tried running it as a job. how many executors and cores, can you share the RDD graph and event timeline on the UI and did you find which of the tasks taking more time was they are any GC please look at the UI if not already it can provide lot of information -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer
Instead of spark-shell have you tried running it as a job. how many executors and cores, can you share the RDD graph and event timeline on the UI and did you find which of the tasks taking more time was they are any GC please look at the UI if not already it can provide lot of information -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Apache Spark - Structured Streaming reading from Kafka some tasks take much longer
Hi: I am working with spark structured streaming (2.2.1) reading data from Kafka (0.11). I need to aggregate data ingested every minute and I am using spark-shell at the moment. The message rate ingestion rate is approx 500k/second. During some trigger intervals (1 minute) especially when the streaming process is started, all tasks finish in 20seconds but during some triggers, it takes 90 seconds. I have tried to reduce the number of partitions approx (100 from 300) to reduce the consumers for Kafka, but that has not helped. I also tried the kafkaConsumer.pollTimeoutMs to 30 seconds but then I see a lot of java.util.concurrent.TimeoutException: Cannot fetch record for offset. So I wanted to see if anyone has any thoughts/recommendations. Thanks
Re: Apache Spark - Structured Streaming Query Status - field descriptions
Thanks Richard. I am hoping that Spark team will at some time, provide more detailed documentation. On Sunday, February 11, 2018 2:17 AM, Richard Qiao wrote: Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why the 2 rowPerSec difference. On Feb 10, 2018, at 8:42 PM, M Singh wrote: Hi: I am working with spark 2.2.0 and am looking at the query status console output. My application reads from kafka - performs flatMapGroupsWithState and then aggregates the elements for two group counts. The output is send to console sink. I see the following output (with my questions in bold). Please me know where I can find detailed description of the query status fields for spark structured streaming ? StreamExecution: Streaming query made progress: { "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2", "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce", "name" : null, "timestamp" : "2018-02-11T01:18:00.005Z", "numInputRows" : 5780, "inputRowsPerSecond" : 96.32851690748795, "processedRowsPerSecond" : 583.9563548191554, // Why is the number of processedRowsPerSecond greater than inputRowsPerSecond ? Does this include shuffling/grouping ? "durationMs" : { "addBatch" : 9765, // Is the time taken to get send output to all console output streams ? "getBatch" : 3, // Is this time taken to get the batch from Kafka ? "getOffset" : 3, // Is this time for getting offset from Kafka ? "queryPlanning" : 89, // The value of this field changes with different triggers but the query is not changing so why does this change ? "triggerExecution" : 9898, // Is this total time for this trigger ? "walCommit" : 35 // Is this for checkpointing ? }, "stateOperators" : [ { // What are the two state operators ? I am assuming one is flatMapWthState (first one). "numRowsTotal" : 8, "numRowsUpdated" : 1 }, { "numRowsTotal" : 6, //Is this the group by state operator ? If so, I have two group by so why do I see only one ? "numRowsUpdated" : 6 } ], "sources" : [ { "description" : "KafkaSource[Subscribe[xyz]]", "startOffset" : { "xyz" : { "2" : 9183, "1" : 9184, "3" : 9184, "0" : 9183 } }, "endOffset" : { "xyz" : { "2" : 10628, "1" : 10629, "3" : 10629, "0" : 10628 } }, "numInputRows" : 5780, "inputRowsPerSecond" : 96.32851690748795, "processedRowsPerSecond" : 583.9563548191554 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c" } }
Re: Apache Spark - Structured Streaming Query Status - field descriptions
Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSec This is explaining why the 2 rowPerSec difference. > On Feb 10, 2018, at 8:42 PM, M Singh wrote: > > Hi: > > I am working with spark 2.2.0 and am looking at the query status console > output. > > My application reads from kafka - performs flatMapGroupsWithState and then > aggregates the elements for two group counts. The output is send to console > sink. I see the following output (with my questions in bold). > > Please me know where I can find detailed description of the query status > fields for spark structured streaming ? > > > StreamExecution: Streaming query made progress: { > "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2", > "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce", > "name" : null, > "timestamp" : "2018-02-11T01:18:00.005Z", > "numInputRows" : 5780, > "inputRowsPerSecond" : 96.32851690748795, > "processedRowsPerSecond" : 583.9563548191554, // Why is the number of > processedRowsPerSecond greater than inputRowsPerSecond ? Does this include > shuffling/grouping ? > "durationMs" : { > "addBatch" : 9765,// > Is the time taken to get send output to all console output streams ? > "getBatch" : 3, > // Is this time taken to get the batch from Kafka ? > "getOffset" : 3, > // Is this time for getting offset from Kafka ? > "queryPlanning" : 89, // > The value of this field changes with different triggers but the query is not > changing so why does this change ? > "triggerExecution" : 9898, // Is > this total time for this trigger ? > "walCommit" : 35 // > Is this for checkpointing ? > }, > "stateOperators" : [ { // > What are the two state operators ? I am assuming one is flatMapWthState > (first one). > "numRowsTotal" : 8, > "numRowsUpdated" : 1 > }, { > "numRowsTotal" : 6,//Is > this the group by state operator ? If so, I have two group by so why do I > see only one ? > "numRowsUpdated" : 6 > } ], > "sources" : [ { > "description" : "KafkaSource[Subscribe[xyz]]", > "startOffset" : { > "xyz" : { > "2" : 9183, > "1" : 9184, > "3" : 9184, > "0" : 9183 > } > }, > "endOffset" : { > "xyz" : { > "2" : 10628, > "1" : 10629, > "3" : 10629, > "0" : 10628 > } > }, > "numInputRows" : 5780, > "inputRowsPerSecond" : 96.32851690748795, > "processedRowsPerSecond" : 583.9563548191554 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c" > } > } > >
Re: Apache Spark - Structured Streaming - Updating UDF state dynamically at run time
Just checking if anyone has any pointers for dynamically updating query state in structured streaming. Thanks On Thursday, February 8, 2018 2:58 PM, M Singh wrote: Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state periodically. Here is the scenario: 1. I have a udf with a variable with default value (eg: 1) This value is applied to a column (eg: subtract the variable from the column value )2. The variable is to be updated periodically asynchronously (eg: reading a file every 5 minutes) and the new rows will have the new value applied to the column value. Spark natively supports broadcast variables, but I could not find a way to update the broadcasted variables dynamically or rebroadcast them once so that the udf internal state can be updated while the structure streaming application is running. I can try to read the variable from the file on each invocation of the udf but it will not scale since each invocation open/read/close the file. Please let me know if there is any documentation/example to support this scenario. Thanks
Apache Spark - Structured Streaming Query Status - field descriptions
Hi: I am working with spark 2.2.0 and am looking at the query status console output. My application reads from kafka - performs flatMapGroupsWithState and then aggregates the elements for two group counts. The output is send to console sink. I see the following output (with my questions in bold). Please me know where I can find detailed description of the query status fields for spark structured streaming ? StreamExecution: Streaming query made progress: { "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2", "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce", "name" : null, "timestamp" : "2018-02-11T01:18:00.005Z", "numInputRows" : 5780, "inputRowsPerSecond" : 96.32851690748795, "processedRowsPerSecond" : 583.9563548191554, // Why is the number of processedRowsPerSecond greater than inputRowsPerSecond ? Does this include shuffling/grouping ? "durationMs" : { "addBatch" : 9765, // Is the time taken to get send output to all console output streams ? "getBatch" : 3, // Is this time taken to get the batch from Kafka ? "getOffset" : 3, // Is this time for getting offset from Kafka ? "queryPlanning" : 89, // The value of this field changes with different triggers but the query is not changing so why does this change ? "triggerExecution" : 9898, // Is this total time for this trigger ? "walCommit" : 35 // Is this for checkpointing ? }, "stateOperators" : [ { // What are the two state operators ? I am assuming one is flatMapWthState (first one). "numRowsTotal" : 8, "numRowsUpdated" : 1 }, { "numRowsTotal" : 6, //Is this the group by state operator ? If so, I have two group by so why do I see only one ? "numRowsUpdated" : 6 } ], "sources" : [ { "description" : "KafkaSource[Subscribe[xyz]]", "startOffset" : { "xyz" : { "2" : 9183, "1" : 9184, "3" : 9184, "0" : 9183 } }, "endOffset" : { "xyz" : { "2" : 10628, "1" : 10629, "3" : 10629, "0" : 10628 } }, "numInputRows" : 5780, "inputRowsPerSecond" : 96.32851690748795, "processedRowsPerSecond" : 583.9563548191554 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c" } }
Apache Spark - Structured Streaming - Updating UDF state dynamically at run time
Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state periodically. Here is the scenario: 1. I have a udf with a variable with default value (eg: 1) This value is applied to a column (eg: subtract the variable from the column value )2. The variable is to be updated periodically asynchronously (eg: reading a file every 5 minutes) and the new rows will have the new value applied to the column value. Spark natively supports broadcast variables, but I could not find a way to update the broadcasted variables dynamically or rebroadcast them once so that the udf internal state can be updated while the structure streaming application is running. I can try to read the variable from the file on each invocation of the udf but it will not scale since each invocation open/read/close the file. Please let me know if there is any documentation/example to support this scenario. Thanks
Free access to Index Conf for Apache Spark community attendees
Free access to Index Conf for Apache Spark session attendees. For info go to: https://www.meetup.com/SF-Big-Analytic IBM is hosting a developer conference - Essentially the conference is ‘By Developers, for Developers’ based on Open technologies. This will be held Feb 20 - 22nd in Moscone West. http://indexconf.com We have a phenomenal list of speakers for the Spark community as well as participation by the Tensorflow, R, and other communities. https://developer.ibm.com/indexconf/communities/ Register using this Promo Code to get your free access. Usage Instructions: Follow this link: https://www.ibm.com/events/wwe/indexconf/indexconf18.nsf/Registration.xsp?open Select “Attendee” as your Registration Type Enter your Registration Promotion Code*: IND18FULL *Restrictions: Promotion Code expires February 12th, 11:59PM Pacific Government Owned Entities (GOE’s) not eligible *If you have previously registered, please reach out to Jeff Borek (jbo...@us.ibm.com) to take advantage of the new discount code.* Detailed Agenda for Spark Community Day Feb 20th 2:00 - 2:30 What the community does in the coming release Spark 2.3. Sean Li, Apache Spark committer & PMC member from Databricks There are many great features added to Apache Spark. This talk is to provide a preview of the new features and updates in the coming release Spark 2.3. 2:30 - 3:00 Data Warehouse Features in Spark SQL Ioana Ursu, IBM Lead contributor on SparkSQL This talk covers advanced Spark SQL features for data warehouse such as star-schema optimizations and informational constraints support. Star-schema consists of a fact table referencing a number of dimension tables. Fact and dimension tables are in a primary key – foreign key relationship. An informational or statistical constraint can be used by Spark to improve query performance. 3:00- 3:30 Building an Enterprise/Cloud Analytics Platform with Jupyter Notebooks and Apache Spark Frederick Reiss Chief architect of IBM Spark Technology Center Data Scientists are becoming a necessity of every company in the data-centric world of today, and with them comes the requirement to make available a flexible and interactive analytics platform that exposes Notebook services at web scale. In this session we will describe our experience and best practices building the platform, in particular how we built the Enterprise Gateway that enables all the Notebooks to share the Spark cluster computational resources. 3:45-4:15 The State of Spark MLlib and New Scalability Features in 2.3 Nick Pentreath, Spark committer & PMC member This talk will give an overview of Spark’s machine learning library, MLlib. The new 2.3 release of Spark brings some exciting scalability enhancements to MLlib, which we will explore in depth, including parallel cross-validation and performance improvements for larger-scale datasets through adding multi-column support to the most widely-used Spark transformers. 4:15 - 5:15 Spark and AI Nick Pentreath & Fred Reiss This session will be an open discussion of the role of Spark within the AI landscape, and what the future holds for AI / deep learning on Spark. In recent years specialized systems (such as TensorFlow, Caffe, PyTorch and MXNet) have been dominant in the domain of AI and deep learning. While there are a few deep learning frameworks that are Spark specific, often these frameworks are separate from Spark and the ease of integration and feature set exposed varies considerably. 5:15 - 6:00 HSpark: enable Spark SQL query on NoSQL Hbase tables Bo Meng, IBM Spark contributor, Yan Zhou IBM Hadoop Architect HBase is a NoSQL data source which allows flexible data storage and access mechanisms. While leveraging Spark’s high scalable framework and programming interface, we added SQL capability to HBase and an easy of use interface for data scientists and traditional analysts. We will discuss how we implement HSpark by leveraging Spark SQL parser, mapping different data types, pushing down the predicates to HBase and improving the query performance - Xin Wu | @xwu0226 Spark Technology Center -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark - Spark Structured Streaming - Watermark usage
Hi Jacek: Thanks for your response. I am just trying to understand the fundamentals of watermarking and how it behaves in aggregation vs non-aggregation scenarios. On Tuesday, February 6, 2018 9:04 AM, Jacek Laskowski wrote: Hi, What would you expect? The data is simply dropped as that's the purpose of watermarking it. That's my understanding at least. Pozdrawiam,Jacek Laskowskihttps://about.me/JacekLaskowskiMastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Feb 5, 2018 at 8:11 PM, M Singh wrote: Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018 8:47 AM, M Singh wrote: Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is processing time. Vishnu - Spark documentation (https://spark.apache.org/ docs/latest/structured- streaming-programming-guide. html) does indicate that it can dedup using watermark. So I believe there are more use cases for watermark and that is what I am trying to find. I am hoping that TD can clarify or point me to the documentation. Thanks On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath wrote: Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored.I recently wrote a blog post on this : http://vishnuviswanath.com/ spark_structured_streaming. html#watermark Yes, this State is applicable for aggregation only. If you are having only a map function and don't want to process it, you could do a filter based on its EventTime field, but I guess you will have to compare it with the processing time since there is no API to access Watermark by the user. -Vishnu On Fri, Jan 26, 2018 at 1:14 PM, M Singh wrote: Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time. Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ? I could not get a clear understanding from the documentation which only refers to it's usage with aggregation. Thanks Mans
Re: Apache Spark - Spark Structured Streaming - Watermark usage
Hi, What would you expect? The data is simply dropped as that's the purpose of watermarking it. That's my understanding at least. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Feb 5, 2018 at 8:11 PM, M Singh wrote: > Just checking if anyone has more details on how watermark works in cases > where event time is earlier than processing time stamp. > > > On Friday, February 2, 2018 8:47 AM, M Singh wrote: > > > Hi Vishu/Jacek: > > Thanks for your responses. > > Jacek - At the moment, the current time for my use case is processing time. > > Vishnu - Spark documentation (https://spark.apache.org/ > docs/latest/structured-streaming-programming-guide.html) does indicate > that it can dedup using watermark. So I believe there are more use cases > for watermark and that is what I am trying to find. > > I am hoping that TD can clarify or point me to the documentation. > > Thanks > > > On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath < > vishnu.viswanat...@gmail.com> wrote: > > > Hi Mans, > > Watermark is Spark is used to decide when to clear the state, so if the > even it delayed more than when the state is cleared by Spark, then it will > be ignored. > I recently wrote a blog post on this : http://vishnuviswanath.com/ > spark_structured_streaming.html#watermark > > Yes, this State is applicable for aggregation only. If you are having only > a map function and don't want to process it, you could do a filter based on > its EventTime field, but I guess you will have to compare it with the > processing time since there is no API to access Watermark by the user. > > -Vishnu > > On Fri, Jan 26, 2018 at 1:14 PM, M Singh > wrote: > > Hi: > > I am trying to filter out records which are lagging behind (based on event > time) by a certain amount of time. > > Is the watermark api applicable to this scenario (ie, filtering lagging > records) or it is only applicable with aggregation ? I could not get a > clear understanding from the documentation which only refers to it's usage > with aggregation. > > Thanks > > Mans > > > > > > >
Re: Apache Spark - Spark Structured Streaming - Watermark usage
Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018 8:47 AM, M Singh wrote: Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is processing time. Vishnu - Spark documentation (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) does indicate that it can dedup using watermark. So I believe there are more use cases for watermark and that is what I am trying to find. I am hoping that TD can clarify or point me to the documentation. Thanks On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath wrote: Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored.I recently wrote a blog post on this : http://vishnuviswanath.com/spark_structured_streaming.html#watermark Yes, this State is applicable for aggregation only. If you are having only a map function and don't want to process it, you could do a filter based on its EventTime field, but I guess you will have to compare it with the processing time since there is no API to access Watermark by the user. -Vishnu On Fri, Jan 26, 2018 at 1:14 PM, M Singh wrote: Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time. Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ? I could not get a clear understanding from the documentation which only refers to it's usage with aggregation. Thanks Mans
Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame
Hi TD: Just wondering if you have any insight for me or need more info. Thanks On Thursday, February 1, 2018 7:43 AM, M Singh wrote: Hi TD: Here is the udpated code with explain and full stack trace. Please let me know what could be the issue and what to look for in the explain output. Updated code: import scala.collection.immutableimport org.apache.spark.sql.functions._import org.joda.time._import org.apache.spark.sql._import org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import org.apache.log4j._ object StreamingTest { def main(args:Array[String]) : Unit = { val sparkBuilder = SparkSession .builder. config("spark.sql.streaming.checkpointLocation", "./checkpointes"). appName("StreamingTest").master("local[4]") val spark = sparkBuilder.getOrCreate() val schema = StructType( Array( StructField("id", StringType, false), StructField("visit", StringType, false) )) var dataframeInput = spark.readStream.option("sep","\t").schema(schema).csv("./data/") var dataframe2 = dataframeInput.select("*") dataframe2 = dataframe2.withColumn("cts", current_timestamp().cast("long")) dataframe2.explain(true) val query = dataframe2.writeStream.option("trucate","false").format("console").start query.awaitTermination() }} Explain output: == Parsed Logical Plan ==Project [id#0, visit#1, cast(current_timestamp() as bigint) AS cts#6L]+- AnalysisBarrier +- Project [id#0, visit#1] +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Analyzed Logical Plan ==id: string, visit: string, cts: bigintProject [id#0, visit#1, cast(current_timestamp() as bigint) AS cts#6L]+- Project [id#0, visit#1] +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Optimized Logical Plan ==Project [id#0, visit#1, 1517499591 AS cts#6L]+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Physical Plan ==*(1) Project [id#0, visit#1, 1517499591 AS cts#6L]+- StreamingRelation FileSource[./data/], [id#0, visit#1] Here is the exception: 18/02/01 07:39:52 ERROR MicroBatchExecution: Query [id = a0e573f0-e93b-48d9-989c-1aaa73539b58, runId = b5c618cb-30c7-4eff-8f09-ea1d064878ae] terminated with errororg.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'cts at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:448) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:134) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportT
Re: Apache Spark - Spark Structured Streaming - Watermark usage
Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is processing time. Vishnu - Spark documentation (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) does indicate that it can dedup using watermark. So I believe there are more use cases for watermark and that is what I am trying to find. I am hoping that TD can clarify or point me to the documentation. Thanks On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath wrote: Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored.I recently wrote a blog post on this : http://vishnuviswanath.com/spark_structured_streaming.html#watermark Yes, this State is applicable for aggregation only. If you are having only a map function and don't want to process it, you could do a filter based on its EventTime field, but I guess you will have to compare it with the processing time since there is no API to access Watermark by the user. -Vishnu On Fri, Jan 26, 2018 at 1:14 PM, M Singh wrote: Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time. Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ? I could not get a clear understanding from the documentation which only refers to it's usage with aggregation. Thanks Mans
Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame
Hi TD: Here is the udpated code with explain and full stack trace. Please let me know what could be the issue and what to look for in the explain output. Updated code: import scala.collection.immutableimport org.apache.spark.sql.functions._import org.joda.time._import org.apache.spark.sql._import org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import org.apache.log4j._ object StreamingTest { def main(args:Array[String]) : Unit = { val sparkBuilder = SparkSession .builder. config("spark.sql.streaming.checkpointLocation", "./checkpointes"). appName("StreamingTest").master("local[4]") val spark = sparkBuilder.getOrCreate() val schema = StructType( Array( StructField("id", StringType, false), StructField("visit", StringType, false) )) var dataframeInput = spark.readStream.option("sep","\t").schema(schema).csv("./data/") var dataframe2 = dataframeInput.select("*") dataframe2 = dataframe2.withColumn("cts", current_timestamp().cast("long")) dataframe2.explain(true) val query = dataframe2.writeStream.option("trucate","false").format("console").start query.awaitTermination() }} Explain output: == Parsed Logical Plan ==Project [id#0, visit#1, cast(current_timestamp() as bigint) AS cts#6L]+- AnalysisBarrier +- Project [id#0, visit#1] +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Analyzed Logical Plan ==id: string, visit: string, cts: bigintProject [id#0, visit#1, cast(current_timestamp() as bigint) AS cts#6L]+- Project [id#0, visit#1] +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Optimized Logical Plan ==Project [id#0, visit#1, 1517499591 AS cts#6L]+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false), StructField(visit,StringType,false))),List(),None,Map(sep -> , path -> ./data/),None), FileSource[./data/], [id#0, visit#1] == Physical Plan ==*(1) Project [id#0, visit#1, 1517499591 AS cts#6L]+- StreamingRelation FileSource[./data/], [id#0, visit#1] Here is the exception: 18/02/01 07:39:52 ERROR MicroBatchExecution: Query [id = a0e573f0-e93b-48d9-989c-1aaa73539b58, runId = b5c618cb-30c7-4eff-8f09-ea1d064878ae] terminated with errororg.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'cts at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:448) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:134) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfu
Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame
Could you give the full stack trace of the exception? Also, can you do `dataframe2.explain(true)` and show us the plan output? On Wed, Jan 31, 2018 at 3:35 PM, M Singh wrote: > Hi Folks: > > I have to add a column to a structured *streaming* dataframe but when I > do that (using select or withColumn) I get an exception. I can add a > column in structured *non-streaming* structured dataframe. I could not > find any documentation on how to do this in the following doc [ > https://spark.apache.org/docs/latest/ > *structured-streaming-programming-guide*.html] > > I am using spark 2.4.0-SNAPSHOT > > Please let me know what I could be missing. > > Thanks for your help. > > (I am also attaching the source code for the structured streaming, > structured non-streaming classes and input file with this email) > > > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call > to dataType on unresolved object, tree: 'cts > at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType( > unresolved.scala:105) > at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply( > StructType.scala:435) > > > Here is the input file (in the ./data directory) - note tokens are > separated by '\t' > > 1 v1 > 2 v1 > 2 v2 > 3 v3 > 3 v1 > > Here is the code with dataframe (*non-streaming*) which works: > > import scala.collection.immutable > import org.apache.spark.sql.functions._ > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > > object StructuredTest { > def main(args:Array[String]) : Unit = { > val sparkBuilder = SparkSession > .builder. > appName("StreamingTest").master("local[4]") > > val spark = sparkBuilder.getOrCreate() > > val schema = StructType( > Array( > StructField("id", StringType, false), > StructField("visit", StringType, false) > )) > var dataframe = spark.read.option("sep","\t").schema(schema).csv( > "./data/") > var dataframe2 = dataframe.select(expr("*"), current_timestamp().as( > "cts")) > dataframe2.show(false) > spark.stop() > > } > } > > Output of the above code is: > > +---+-+---+ > |id |visit|cts| > +---+-+---+ > |1 |v1 |2018-01-31 15:07:00.758| > |2 |v1 |2018-01-31 15:07:00.758| > |2 |v2 |2018-01-31 15:07:00.758| > |3 |v3 |2018-01-31 15:07:00.758| > |3 |v1 |2018-01-31 15:07:00.758| > +---+-+---+ > > > Here is the code with *structured streaming* which throws the exception: > > import scala.collection.immutable > import org.apache.spark.sql.functions._ > import org.joda.time._ > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.streaming._ > import org.apache.log4j._ > > object StreamingTest { > def main(args:Array[String]) : Unit = { > val sparkBuilder = SparkSession > .builder. > config("spark.sql.streaming.checkpointLocation", "./checkpointes"). > appName("StreamingTest").master("local[4]") > > val spark = sparkBuilder.getOrCreate() > > val schema = StructType( > Array( > StructField("id", StringType, false), > StructField("visit", StringType, false) > )) > var dataframeInput = spark.readStream.option("sep","\t" > ).schema(schema).csv("./data/") > var dataframe2 = dataframeInput.select("*") > dataframe2 = dataframe2.withColumn("cts", current_timestamp()) > val query = dataframe2.writeStream.option("trucate","false").format(" > console").start > query.awaitTermination() > } > } > > Here is the exception: > > 18/01/31 15:10:25 ERROR MicroBatchExecution: Query [id = > 0fe655de-9096-4d69-b6a5-c593400d2eba, runId = > 2394a402-dd52-49b4-854e-cb46684bf4d8] > terminated with error > *org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call > to dataType on unresolved object, tree: 'cts* > at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType( > unresolved.scala:105) > at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply( > StructType.scala:435) > at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply( > StructType.scala:435) > > I've also used snippets (shown in bold below) from ( > https://docs.databricks.com/spark/latest/structured- > streaming/examples.html) > but still get the same exception: > > Here is the code: > > import scala.collection.immutable > import org.apache.spark.sql.functions._ > import org.joda.time._ > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.streaming._ > import org.apache.log4j._ > > object StreamingTest { > def main(args:Array[String]) : Unit = { > val sparkBuilder = SparkSession > .builder. > config("spark.sql.streaming.checkpointLocation", "./checkpointes"). > appName("StreamingTest").master("local[4]") > > val spark = sparkBuilder.getOrCreate
Apache Spark - Exception on adding column to Structured Streaming DataFrame
Hi Folks: I have to add a column to a structured streaming dataframe but when I do that (using select or withColumn) I get an exception. I can add a column in structured non-streaming structured dataframe. I could not find any documentation on how to do this in the following doc [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] I am using spark 2.4.0-SNAPSHOT Please let me know what I could be missing. Thanks for your help. (I am also attaching the source code for the structured streaming, structured non-streaming classes and input file with this email) org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'cts at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) Here is the input file (in the ./data directory) - note tokens are separated by '\t' 1 v12 v12 v23 v33 v1 Here is the code with dataframe (non-streaming) which works: import scala.collection.immutableimport org.apache.spark.sql.functions._import org.apache.spark.sql._import org.apache.spark.sql.types._ object StructuredTest { def main(args:Array[String]) : Unit = { val sparkBuilder = SparkSession .builder. appName("StreamingTest").master("local[4]") val spark = sparkBuilder.getOrCreate() val schema = StructType( Array( StructField("id", StringType, false), StructField("visit", StringType, false) )) var dataframe = spark.read.option("sep","\t").schema(schema).csv("./data/") var dataframe2 = dataframe.select(expr("*"), current_timestamp().as("cts")) dataframe2.show(false) spark.stop() }} Output of the above code is: +---+-+---+|id |visit|cts |+---+-+---+|1 |v1 |2018-01-31 15:07:00.758||2 |v1 |2018-01-31 15:07:00.758||2 |v2 |2018-01-31 15:07:00.758||3 |v3 |2018-01-31 15:07:00.758||3 |v1 |2018-01-31 15:07:00.758|+---+-+---+ Here is the code with structured streaming which throws the exception: import scala.collection.immutableimport org.apache.spark.sql.functions._import org.joda.time._import org.apache.spark.sql._import org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import org.apache.log4j._ object StreamingTest { def main(args:Array[String]) : Unit = { val sparkBuilder = SparkSession .builder. config("spark.sql.streaming.checkpointLocation", "./checkpointes"). appName("StreamingTest").master("local[4]") val spark = sparkBuilder.getOrCreate() val schema = StructType( Array( StructField("id", StringType, false), StructField("visit", StringType, false) )) var dataframeInput = spark.readStream.option("sep","\t").schema(schema).csv("./data/") var dataframe2 = dataframeInput.select("*") dataframe2 = dataframe2.withColumn("cts", current_timestamp()) val query = dataframe2.writeStream.option("trucate","false").format("console").start query.awaitTermination() }} Here is the exception: 18/01/31 15:10:25 ERROR MicroBatchExecution: Query [id = 0fe655de-9096-4d69-b6a5-c593400d2eba, runId = 2394a402-dd52-49b4-854e-cb46684bf4d8] terminated with errororg.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'cts at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) I've also used snippets (shown in bold below) from (https://docs.databricks.com/spark/latest/structured-streaming/examples.html)but still get the same exception: Here is the code: import scala.collection.immutableimport org.apache.spark.sql.functions._import org.joda.time._import org.apache.spark.sql._import org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import org.apache.log4j._ object StreamingTest { def main(args:Array[String]) : Unit = { val sparkBuilder = SparkSession .builder. config("spark.sql.streaming.checkpointLocation", "./checkpointes"). appName("StreamingTest").master("local[4]") val spark = sparkBuilder.getOrCreate() val schema = StructType( Array( StructField("id", StringType, false), StructField("visit", StringType, false) )) var dataframeInput = spark.readStream.option("sep","\t").schema(schema).csv("./data/") var dataframe2 = dataframeInput.select( current_timestamp().cast("timestamp").alias("timestamp"), expr("*")) val query = dataframe2.writeStream.option("trucate","false").format("console").start query.awaitTermination()
Re: Apache Spark - Spark Structured Streaming - Watermark usage
Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored. I recently wrote a blog post on this : http://vishnuviswanath.com/spark_structured_streaming.html#watermark Yes, this State is applicable for aggregation only. If you are having only a map function and don't want to process it, you could do a filter based on its EventTime field, but I guess you will have to compare it with the processing time since there is no API to access Watermark by the user. -Vishnu On Fri, Jan 26, 2018 at 1:14 PM, M Singh wrote: > Hi: > > I am trying to filter out records which are lagging behind (based on event > time) by a certain amount of time. > > Is the watermark api applicable to this scenario (ie, filtering lagging > records) or it is only applicable with aggregation ? I could not get a > clear understanding from the documentation which only refers to it's usage > with aggregation. > > Thanks > > Mans >
Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Hi, Nicolas. Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901 (Feature parity for ORC with Parquet). For your questions, the following three are related. 1. spark.sql.orc.impl="native" By default, `native` ORC implementation (based on the latest ORC 1.4.1) is added. The old one is `hive` implementation. 2. spark.sql.orc.enableVectorizedReader="true" By default, `native` ORC implementation uses Vectorized Reader code path if possible. Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only supported only for simple data types. 3. spark.sql.hive.convertMetastoreOrc=true Like Parquet, by default, Hive tables are converted into file-based data sources to use Vectorization technique. Bests, Dongjoon. On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris wrote: > Hi > > Thanks for this work. > > Will this affect both: > 1) spark.read.format("orc").load("...") > 2) spark.sql("select ... from my_orc_table_in_hive") > > ? > > > Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait : > > Hi, All. > > > > Vectorized ORC Reader is now supported in Apache Spark 2.3. > > > > https://issues.apache.org/jira/browse/SPARK-16060 > > > > It has been a long journey. From now, Spark can read ORC files faster > without > > feature penalty. > > > > Thank you for all your support, especially Wenchen Fan. > > > > It's done by two commits. > > > > [SPARK-16060][SQL] Support Vectorized ORC Reader > > https://github.com/apache/spark/commit/ > f44ba910f58083458e1133502e193a > > 9d6f2bf766 > > > > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized > orc > > reader > > https://github.com/apache/spark/commit/ > eaac60a1e20e29084b7151ffca964c > > faa5ba99d1 > > > > Please check OrcReadBenchmark for the final speed-up from `Hive built-in > ORC` > > to `Native ORC Vectorized`. > > > > https://github.com/apache/spark/blob/master/sql/hive/ > src/test/scala/org/ > > apache/spark/sql/hive/orc/OrcReadBenchmark.scala > > > > Thank you. > > > > Bests, > > Dongjoon. >
Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Hi Thanks for this work. Will this affect both: 1) spark.read.format("orc").load("...") 2) spark.sql("select ... from my_orc_table_in_hive") ? Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait : > Hi, All. > > Vectorized ORC Reader is now supported in Apache Spark 2.3. > > https://issues.apache.org/jira/browse/SPARK-16060 > > It has been a long journey. From now, Spark can read ORC files faster without > feature penalty. > > Thank you for all your support, especially Wenchen Fan. > > It's done by two commits. > > [SPARK-16060][SQL] Support Vectorized ORC Reader > https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a > 9d6f2bf766 > > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc > reader > https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c > faa5ba99d1 > > Please check OrcReadBenchmark for the final speed-up from `Hive built-in ORC` > to `Native ORC Vectorized`. > > https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/ > apache/spark/sql/hive/orc/OrcReadBenchmark.scala > > Thank you. > > Bests, > Dongjoon. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark - Spark Structured Streaming - Watermark usage
Hi, I'm curious how would you do the requirement "by a certain amount of time" without a watermark? How would you know what's current and compute the lag? Let's forget about watermark for a moment and see if it pops up as an inevitable feature :) "I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time." Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Fri, Jan 26, 2018 at 7:14 PM, M Singh wrote: > Hi: > > I am trying to filter out records which are lagging behind (based on event > time) by a certain amount of time. > > Is the watermark api applicable to this scenario (ie, filtering lagging > records) or it is only applicable with aggregation ? I could not get a > clear understanding from the documentation which only refers to it's usage > with aggregation. > > Thanks > > Mans >
Apache Spark - Spark Structured Streaming - Watermark usage
Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time. Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ? I could not get a clear understanding from the documentation which only refers to it's usage with aggregation. Thanks Mans
Re: Apache Spark - Custom structured streaming data source
Thanks TD. When will 2.3 scheduled for release ? On Thursday, January 25, 2018 11:32 PM, Tathagata Das wrote: Hello Mans, The streaming DataSource APIs are still evolving and are not public yet. Hence there is no official documentation. In fact, there is a new DataSourceV2 API (in Spark 2.3) that we are migrating towards. So at this point of time, it's hard to make any concrete suggestion. You can take a look at the classes DataSourceV2, DataReader, MicroBatchDataReader in the spark source code, along with their implementations. Hope this helps. TD On Jan 25, 2018 8:36 PM, "M Singh" wrote: Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package: private[sql] def internalCreateDataFrame( catalystRows: RDD[InternalRow], schema: StructType, isStreaming: Boolean = false): DataFrame = { // TODO: use MutableProjection when rowRDD is another DataFrame and the applied // schema differs from the existing schema on any field data type. val logicalPlan = LogicalRDD( schema.toAttributes, catalystRows, isStreaming = isStreaming)(self) Dataset.ofRows(self, logicalPlan) } Please let me know where I can find the appropriate API or documentation. Thanks Mans
Re: Apache Spark - Custom structured streaming data source
Hello Mans, The streaming DataSource APIs are still evolving and are not public yet. Hence there is no official documentation. In fact, there is a new DataSourceV2 API (in Spark 2.3) that we are migrating towards. So at this point of time, it's hard to make any concrete suggestion. You can take a look at the classes DataSourceV2, DataReader, MicroBatchDataReader in the spark source code, along with their implementations. Hope this helps. TD On Jan 25, 2018 8:36 PM, "M Singh" wrote: Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package: *private**[sql]* def internalCreateDataFrame( catalystRows: RDD[InternalRow], schema: StructType, isStreaming: Boolean = false): DataFrame = { // TODO: use MutableProjection when rowRDD is another DataFrame and the applied // schema differs from the existing schema on any field data type. val logicalPlan = LogicalRDD( schema.toAttributes, catalystRows, isStreaming = isStreaming)(self) Dataset.ofRows(self, logicalPlan) } Please let me know where I can find the appropriate API or documentation. Thanks Mans
Apache Spark - Custom structured streaming data source
Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package: private[sql] def internalCreateDataFrame( catalystRows: RDD[InternalRow], schema: StructType, isStreaming: Boolean = false): DataFrame = { // TODO: use MutableProjection when rowRDD is another DataFrame and the applied // schema differs from the existing schema on any field data type. val logicalPlan = LogicalRDD( schema.toAttributes, catalystRows, isStreaming = isStreaming)(self) Dataset.ofRows(self, logicalPlan) } Please let me know where I can find the appropriate API or documentation. Thanks Mans
Re: good materiala to learn apache spark
Jacek lawskowski on this mail list wrote a book which is available online. Hth On Jan 18, 2018 6:16 AM, "Manuel Sopena Ballesteros" < manuel...@garvan.org.au> wrote: > Dear Spark community, > > > > I would like to learn more about apache spark. I have a Horton works HDP > platform and have ran a few spark jobs in a cluster but now I need to know > more in depth how spark works. > > > > My main interest is sys admin and operational point of Spark and it’s > ecosystem. > > > > Is there any material? > > > > Thank you very much > > > > *Manuel Sopena Ballesteros *| Big data Engineer > *Garvan Institute of Medical Research * > The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010 > <https://maps.google.com/?q=370+Victoria+Street,+Darlinghurst,+NSW+2010&entry=gmail&source=g> > *T:* + 61 (0)2 9355 5760 <+61%202%209355%205760> | *F:* +61 (0)2 9295 8507 > <+61%202%209295%208507> | *E:* manuel...@garvan.org.au > > > NOTICE > Please consider the environment before printing this email. This message > and any attachments are intended for the addressee named and may contain > legally privileged/confidential/copyright information. If you are not the > intended recipient, you should not read, use, disclose, copy or distribute > this communication. If you have received this message in error please > notify us at once by return email and then delete both messages. We accept > no liability for the distribution of viruses or similar in electronic > communications. This notice should not be removed. >
good materiala to learn apache spark
Dear Spark community, I would like to learn more about apache spark. I have a Horton works HDP platform and have ran a few spark jobs in a cluster but now I need to know more in depth how spark works. My main interest is sys admin and operational point of Spark and it's ecosystem. Is there any material? Thank you very much Manuel Sopena Ballesteros | Big data Engineer Garvan Institute of Medical Research The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010 T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: manuel...@garvan.org.au<mailto:manuel...@garvan.org.au> NOTICE Please consider the environment before printing this email. This message and any attachments are intended for the addressee named and may contain legally privileged/confidential/copyright information. If you are not the intended recipient, you should not read, use, disclose, copy or distribute this communication. If you have received this message in error please notify us at once by return email and then delete both messages. We accept no liability for the distribution of viruses or similar in electronic communications. This notice should not be removed.
Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Hi, All. Vectorized ORC Reader is now supported in Apache Spark 2.3. https://issues.apache.org/jira/browse/SPARK-16060 It has been a long journey. From now, Spark can read ORC files faster without feature penalty. Thank you for all your support, especially Wenchen Fan. It's done by two commits. [SPARK-16060][SQL] Support Vectorized ORC Reader https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a 9d6f2bf766 [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc reader https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c faa5ba99d1 Please check OrcReadBenchmark for the final speed-up from `Hive built-in ORC` to `Native ORC Vectorized`. https://github.com/apache/spark/blob/master/sql/hive/ src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala Thank you. Bests, Dongjoon.
Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0
And Hadoop-3.x is not part of the release and sign off for 2.2.1. Maybe we could update the website to avoid any confusion with "later". From: Josh Rosen Sent: Monday, January 8, 2018 10:17:14 AM To: akshay naidu Cc: Saisai Shao; Raj Adyanthaya; spark users Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0 My current best guess is that Spark does not fully support Hadoop 3.x because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for Hadoop 3.x) has not been resolved. There are also likely to be transitive dependency conflicts which will need to be resolved. On Mon, Jan 8, 2018 at 8:52 AM akshay naidu mailto:akshaynaid...@gmail.com>> wrote: yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and later', but my confusion is because spark was released on 1st dec and hadoop-3 stable version released on 13th Dec. And to my similar question on stackoverflow.com<https://stackoverflow.com/questions/47920005/how-is-hadoop-3-0-0-s-compatibility-with-older-versions-of-hive-pig-sqoop-and> , Mr. jacek-laskowski<https://stackoverflow.com/users/1305344/jacek-laskowski> replied that spark-2.2.1 doesn't support hadoop-3. so I am just looking for more clarity on this doubt before moving on to upgrades. Thanks all for help. Akshay. On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao mailto:sai.sai.s...@gmail.com>> wrote: AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it is not clear whether it is supported or not (or has some issues). I think in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC). Thanks Jerry 2018-01-08 4:50 GMT+08:00 Raj Adyanthaya mailto:raj...@gmail.com>>: Hi Akshay On the Spark Download page when you select Spark 2.2.1 it gives you an option to select package type. In that, there is an option to select "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it does support Hadoop 3.0. http://spark.apache.org/downloads.html Thanks, Raj A. On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu mailto:akshaynaid...@gmail.com>> wrote: hello Users, I need to know whether we can run latest spark on latest hadoop version i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec. thanks.