Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
Congrats, Great work Dongjoon.



Dongjoon Hyun  于2019年1月15日周二 下午3:47写道:

> We are happy to announce the availability of Spark 2.2.3!
>
> Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
> maintenance branch of Spark. We strongly recommend all 2.2.x users to
> upgrade to this stable release.
>
> To download Spark 2.2.3, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-2-3.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
> Bests,
> Dongjoon.
>


-- 
Best Regards

Jeff Zhang


[ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-14 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 2.2.3!

Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
maintenance branch of Spark. We strongly recommend all 2.2.x users to
upgrade to this stable release.

To download Spark 2.2.3, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-2-3.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

Bests,
Dongjoon.


Monthly Apache Spark Newsletter

2018-11-20 Thread Ankur Gupta
Hey Guys,

Just launched a monthly Apache Spark Newsletter.

https://newsletterspot.com/apache-spark/

Cheers,
Ankur

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



CVE-2018-17190: Unsecured Apache Spark standalone executes user code

2018-11-18 Thread Sean Owen
Severity: Low

Vendor: The Apache Software Foundation

Versions Affected:
All versions of Apache Spark

Description:
Spark's standalone resource manager accepts code to execute on a 'master' host,
that then runs that code on 'worker' hosts. The master itself does not, by
design, execute user code. A specially-crafted request to the master can,
however, cause the master to execute code too. Note that this does not affect
standalone clusters with authentication enabled. While the master host
typically has less outbound access to other resources than a worker, the
execution of code on the master is nevertheless unexpected.

Mitigation:
Enable authentication on any Spark standalone cluster that is not otherwise
secured from unwanted access, for example by network-level restrictions. Use
spark.authenticate and related security properties described at
https://spark.apache.org/docs/latest/security.html

Credit:
Andre Protas, Apple Information Security

References:
https://spark.apache.org/security.html

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Testing Apache Spark applications

2018-11-15 Thread Lars Albertsson
My previous answers to this question can be found in the archives, along
with some other responses:

http://apache-spark-user-list.1001560.n3.nabble.com/testing-frameworks-td32251.html
https://www.mail-archive.com/user%40spark.apache.org/msg48032.html

I have made a couple of presentations on the subject. Slides and video
are linked on this page: http://www.mapflat.com/presentations/

You can find more material in this list of resources:
http://www.mapflat.com/lands/resources/reading-list

Happy testing!

Regards,


Lars Albertsson
Data engineering entrepreneur
www.mimeria.com, www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109


On Thu, Nov 15, 2018 at 6:45 PM  wrote:

> Hi all,
>
>
>
> How are you testing your Spark applications?
>
> We are writing features by using Cucumber. This is testing the behaviours.
> Is this called functional test or integration test?
>
>
>
> We are also planning to write unit tests.
>
>
>
> For instance we have a class like below. It has one method. This methos is
> implementing several things: like DataFrame operations, saving DataFrame
> into database table, insert, update,delete statements.
>
>
>
> Our classes generally contains 2 or 3 methods. These methods cover a lot
> of tasks in the same function defintion. (like the function below)
>
> So I am not sure how I can write unit tests for these classes and methods.
>
> Do you have any suggestion?
>
>
>
> class CustomerOperations
>
>
>
>def doJob(inputDataFrame : DataFrame) = {
>
>// definitions (value/variable)
>
>// spark context, session etc definition
>
>
>
>   //  filtering, cleansing on inputDataframe and save results on a
> new dataframe
>
>   // insert new dataframe to a database table
>
>  //  several insert/update/delete statements on the database tables
>
>
>
> }
>
>
>
>
>


Re: Testing Apache Spark applications

2018-11-15 Thread Vitaliy Pisarev
Hard to answer in a succinct manner but I'll give it a shot.

Cucumber is a tool for writing *Behaviour* Driven Tests (closely related to
behaviour driven development, BDD).
It is not a mere *technical* approach to testing but a mindset, a way of
work and a different (different, whether it is better is a matter of
controversy) way to structure communication between product and R

I will not elaborate more as there is plenty of material out there if you
want to educate yourself. Just bear in mind that BDD is riddled with
misconception. Most often than not I see people just using Cucumber, but
not doing actual BDD.

Regarding unit testing, I do not consider the code you showed to be a good
candidate for unit testing. There is very little procedural logic there and
there is a good chance that if you go about unit testing it you will end up
with lots and lots of mocks overly bound to the implementation details of
the suit under test , rendering the tests unmaintainable and brittle.

I would argue that unit tests are more appropriate for code that is
algorithmic in nature, that has no or very little dependencies and where
you have an absolute oracle of truth regrading your expectations from it.

I think that in your situation going for integration tests (on small scale
data) and regression tests would give you the most ROI.






On Thu, Nov 15, 2018 at 8:43 PM ☼ R Nair  wrote:

> Sparklens from qubole is a good source. Other tests are to be handled by
> developer.
>
> Best,
> Ravi
>
> On Thu, Nov 15, 2018, 12:45 PM 
>> Hi all,
>>
>>
>>
>> How are you testing your Spark applications?
>>
>> We are writing features by using Cucumber. This is testing the
>> behaviours. Is this called functional test or integration test?
>>
>>
>>
>> We are also planning to write unit tests.
>>
>>
>>
>> For instance we have a class like below. It has one method. This methos
>> is implementing several things: like DataFrame operations, saving DataFrame
>> into database table, insert, update,delete statements.
>>
>>
>>
>> Our classes generally contains 2 or 3 methods. These methods cover a lot
>> of tasks in the same function defintion. (like the function below)
>>
>> So I am not sure how I can write unit tests for these classes and methods.
>>
>> Do you have any suggestion?
>>
>>
>>
>> class CustomerOperations
>>
>>
>>
>>def doJob(inputDataFrame : DataFrame) = {
>>
>>// definitions (value/variable)
>>
>>// spark context, session etc definition
>>
>>
>>
>>   //  filtering, cleansing on inputDataframe and save results on
>> a new dataframe
>>
>>   // insert new dataframe to a database table
>>
>>  //  several insert/update/delete statements on the database
>> tables
>>
>>
>>
>> }
>>
>>
>>
>>
>>
>


Re: Testing Apache Spark applications

2018-11-15 Thread ☼ R Nair
Sparklens from qubole is a good source. Other tests are to be handled by
developer.

Best,
Ravi

On Thu, Nov 15, 2018, 12:45 PM  Hi all,
>
>
>
> How are you testing your Spark applications?
>
> We are writing features by using Cucumber. This is testing the behaviours.
> Is this called functional test or integration test?
>
>
>
> We are also planning to write unit tests.
>
>
>
> For instance we have a class like below. It has one method. This methos is
> implementing several things: like DataFrame operations, saving DataFrame
> into database table, insert, update,delete statements.
>
>
>
> Our classes generally contains 2 or 3 methods. These methods cover a lot
> of tasks in the same function defintion. (like the function below)
>
> So I am not sure how I can write unit tests for these classes and methods.
>
> Do you have any suggestion?
>
>
>
> class CustomerOperations
>
>
>
>def doJob(inputDataFrame : DataFrame) = {
>
>// definitions (value/variable)
>
>// spark context, session etc definition
>
>
>
>   //  filtering, cleansing on inputDataframe and save results on a
> new dataframe
>
>   // insert new dataframe to a database table
>
>  //  several insert/update/delete statements on the database tables
>
>
>
> }
>
>
>
>
>


Testing Apache Spark applications

2018-11-15 Thread Omer.Ozsakarya
Hi all,

How are you testing your Spark applications?
We are writing features by using Cucumber. This is testing the behaviours. Is 
this called functional test or integration test?

We are also planning to write unit tests.

For instance we have a class like below. It has one method. This methos is 
implementing several things: like DataFrame operations, saving DataFrame into 
database table, insert, update,delete statements.

Our classes generally contains 2 or 3 methods. These methods cover a lot of 
tasks in the same function defintion. (like the function below)
So I am not sure how I can write unit tests for these classes and methods.
Do you have any suggestion?


class CustomerOperations

   def doJob(inputDataFrame : DataFrame) = {
   // definitions (value/variable)
   // spark context, session etc definition

  //  filtering, cleansing on inputDataframe and save results on a new 
dataframe
  // insert new dataframe to a database table
 //  several insert/update/delete statements on the database tables

}




Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-09 Thread purna pradeep
Thanks this is a great news

Can you please lemme if dynamic resource allocation is available in spark
2.4?

I’m using spark 2.3.2 on Kubernetes, do I still need to provide executor
memory options as part of spark submit command or spark will manage
required executor memory based on the spark job size ?

On Thu, Nov 8, 2018 at 2:18 PM Marcelo Vanzin 
wrote:

> +user@
>
> >> -- Forwarded message -
> >> From: Wenchen Fan 
> >> Date: Thu, Nov 8, 2018 at 10:55 PM
> >> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
> >> To: Spark dev list 
> >>
> >>
> >> Hi all,
> >>
> >> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
> adds Barrier Execution Mode for better integration with deep learning
> frameworks, introduces 30+ built-in and higher-order functions to deal with
> complex data type easier, improves the K8s integration, along with
> experimental Scala 2.12 support. Other major updates include the built-in
> Avro data source, Image data source, flexible streaming sinks, elimination
> of the 2GB block size limitation during transfer, Pandas UDF improvements.
> In addition, this release continues to focus on usability, stability, and
> polish while resolving around 1100 tickets.
> >>
> >> We'd like to thank our contributors and users for their contributions
> and early feedback to this release. This release would not have been
> possible without you.
> >>
> >> To download Spark 2.4.0, head over to the download page:
> http://spark.apache.org/downloads.html
> >>
> >> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-4-0.html
> >>
> >> Thanks,
> >> Wenchen
> >>
> >> PS: If you see any issues with the release notes, webpage or published
> artifacts, please contact me directly off-list.
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Xiao Li
Try to clear your browsing data or use a different web browser.

Enjoy it,

Xiao

On Thu, Nov 8, 2018 at 4:15 PM Reynold Xin  wrote:

> Do you have a cached copy? I see it here
>
> http://spark.apache.org/downloads.html
>
>
>
> On Thu, Nov 8, 2018 at 4:12 PM Li Gao  wrote:
>
>> this is wonderful !
>> I noticed the official spark download site does not have 2.4 download
>> links yet.
>>
>> On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde > wrote:
>>
>>> Great news.. thank you very much!
>>>
>>> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com wrote:
>>>
>>>> Awesome!
>>>>
>>>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji 
>>>> wrote:
>>>>
>>>>> Indeed!
>>>>>
>>>>> Sent from my iPhone
>>>>> Pardon the dumb thumb typos :)
>>>>>
>>>>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>> Finally, thank you all. Especially, thanks to the release manager,
>>>>> Wenchen!
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan 
>>>>> wrote:
>>>>>
>>>>>> + user list
>>>>>>
>>>>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan 
>>>>>> wrote:
>>>>>>
>>>>>>> resend
>>>>>>>
>>>>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- Forwarded message -
>>>>>>>> From: Wenchen Fan 
>>>>>>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>>>>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>>>>>>> To: Spark dev list 
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This
>>>>>>>> release adds Barrier Execution Mode for better integration with deep
>>>>>>>> learning frameworks, introduces 30+ built-in and higher-order 
>>>>>>>> functions to
>>>>>>>> deal with complex data type easier, improves the K8s integration, along
>>>>>>>> with experimental Scala 2.12 support. Other major updates include the
>>>>>>>> built-in Avro data source, Image data source, flexible streaming sinks,
>>>>>>>> elimination of the 2GB block size limitation during transfer, Pandas 
>>>>>>>> UDF
>>>>>>>> improvements. In addition, this release continues to focus on 
>>>>>>>> usability,
>>>>>>>> stability, and polish while resolving around 1100 tickets.
>>>>>>>>
>>>>>>>> We'd like to thank our contributors and users for their
>>>>>>>> contributions and early feedback to this release. This release would 
>>>>>>>> not
>>>>>>>> have been possible without you.
>>>>>>>>
>>>>>>>> To download Spark 2.4.0, head over to the download page:
>>>>>>>> http://spark.apache.org/downloads.html
>>>>>>>>
>>>>>>>> To view the release notes:
>>>>>>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Wenchen
>>>>>>>>
>>>>>>>> PS: If you see any issues with the release notes, webpage or
>>>>>>>> published artifacts, please contact me directly off-list.
>>>>>>>>
>>>>>>>
>>>>
>>>>
>>>>
>>>>

-- 
[image: Spark+AI Summit North America 2019]
<http://t.sidekickopen24.com/s1t/c/5/f18dQhb0S7lM8dDMPbW2n0x6l2B9nMJN7t5X-FfhMynN2z8MDjQsyTKW56dzQQ1-_gV6102?t=https%3A%2F%2Fdatabricks.com%2Fsparkaisummit%2Fnorth-america=undefined=406b8c9a-b648-4923-9ed1-9a51ffe213fa>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Reynold Xin
Do you have a cached copy? I see it here

http://spark.apache.org/downloads.html



On Thu, Nov 8, 2018 at 4:12 PM Li Gao  wrote:

> this is wonderful !
> I noticed the official spark download site does not have 2.4 download
> links yet.
>
> On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde  wrote:
>
>> Great news.. thank you very much!
>>
>> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com wrote:
>>
>>> Awesome!
>>>
>>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:
>>>
>>>> Indeed!
>>>>
>>>> Sent from my iPhone
>>>> Pardon the dumb thumb typos :)
>>>>
>>>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
>>>> wrote:
>>>>
>>>> Finally, thank you all. Especially, thanks to the release manager,
>>>> Wenchen!
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan 
>>>> wrote:
>>>>
>>>>> + user list
>>>>>
>>>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan 
>>>>> wrote:
>>>>>
>>>>>> resend
>>>>>>
>>>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Forwarded message -
>>>>>>> From: Wenchen Fan 
>>>>>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>>>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>>>>>> To: Spark dev list 
>>>>>>>
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This
>>>>>>> release adds Barrier Execution Mode for better integration with deep
>>>>>>> learning frameworks, introduces 30+ built-in and higher-order functions 
>>>>>>> to
>>>>>>> deal with complex data type easier, improves the K8s integration, along
>>>>>>> with experimental Scala 2.12 support. Other major updates include the
>>>>>>> built-in Avro data source, Image data source, flexible streaming sinks,
>>>>>>> elimination of the 2GB block size limitation during transfer, Pandas UDF
>>>>>>> improvements. In addition, this release continues to focus on usability,
>>>>>>> stability, and polish while resolving around 1100 tickets.
>>>>>>>
>>>>>>> We'd like to thank our contributors and users for their
>>>>>>> contributions and early feedback to this release. This release would not
>>>>>>> have been possible without you.
>>>>>>>
>>>>>>> To download Spark 2.4.0, head over to the download page:
>>>>>>> http://spark.apache.org/downloads.html
>>>>>>>
>>>>>>> To view the release notes:
>>>>>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Wenchen
>>>>>>>
>>>>>>> PS: If you see any issues with the release notes, webpage or
>>>>>>> published artifacts, please contact me directly off-list.
>>>>>>>
>>>>>>
>>>
>>>
>>>
>>>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Li Gao
this is wonderful !
I noticed the official spark download site does not have 2.4 download links
yet.

On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde  Great news.. thank you very much!
>
> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com wrote:
>
>> Awesome!
>>
>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:
>>
>>> Indeed!
>>>
>>> Sent from my iPhone
>>> Pardon the dumb thumb typos :)
>>>
>>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
>>> wrote:
>>>
>>> Finally, thank you all. Especially, thanks to the release manager,
>>> Wenchen!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>>>
>>>> + user list
>>>>
>>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>>>
>>>>> resend
>>>>>
>>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> -- Forwarded message -
>>>>>> From: Wenchen Fan 
>>>>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>>>>> To: Spark dev list 
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>>>>>> adds Barrier Execution Mode for better integration with deep learning
>>>>>> frameworks, introduces 30+ built-in and higher-order functions to deal 
>>>>>> with
>>>>>> complex data type easier, improves the K8s integration, along with
>>>>>> experimental Scala 2.12 support. Other major updates include the built-in
>>>>>> Avro data source, Image data source, flexible streaming sinks, 
>>>>>> elimination
>>>>>> of the 2GB block size limitation during transfer, Pandas UDF 
>>>>>> improvements.
>>>>>> In addition, this release continues to focus on usability, stability, and
>>>>>> polish while resolving around 1100 tickets.
>>>>>>
>>>>>> We'd like to thank our contributors and users for their contributions
>>>>>> and early feedback to this release. This release would not have been
>>>>>> possible without you.
>>>>>>
>>>>>> To download Spark 2.4.0, head over to the download page:
>>>>>> http://spark.apache.org/downloads.html
>>>>>>
>>>>>> To view the release notes:
>>>>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>> PS: If you see any issues with the release notes, webpage or
>>>>>> published artifacts, please contact me directly off-list.
>>>>>>
>>>>>
>>
>>
>>
>>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Swapnil Shinde
Great news.. thank you very much!

On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com wrote:

> Awesome!
>
> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:
>
>> Indeed!
>>
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
>> wrote:
>>
>> Finally, thank you all. Especially, thanks to the release manager,
>> Wenchen!
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>>
>>> + user list
>>>
>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>>
>>>> resend
>>>>
>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> -- Forwarded message -
>>>>> From: Wenchen Fan 
>>>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>>>> To: Spark dev list 
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>>>>> adds Barrier Execution Mode for better integration with deep learning
>>>>> frameworks, introduces 30+ built-in and higher-order functions to deal 
>>>>> with
>>>>> complex data type easier, improves the K8s integration, along with
>>>>> experimental Scala 2.12 support. Other major updates include the built-in
>>>>> Avro data source, Image data source, flexible streaming sinks, elimination
>>>>> of the 2GB block size limitation during transfer, Pandas UDF improvements.
>>>>> In addition, this release continues to focus on usability, stability, and
>>>>> polish while resolving around 1100 tickets.
>>>>>
>>>>> We'd like to thank our contributors and users for their contributions
>>>>> and early feedback to this release. This release would not have been
>>>>> possible without you.
>>>>>
>>>>> To download Spark 2.4.0, head over to the download page:
>>>>> http://spark.apache.org/downloads.html
>>>>>
>>>>> To view the release notes:
>>>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>> PS: If you see any issues with the release notes, webpage or published
>>>>> artifacts, please contact me directly off-list.
>>>>>
>>>>
>
>
>
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Stavros Kontopoulos
Awesome!

On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:

> Indeed!
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
> wrote:
>
> Finally, thank you all. Especially, thanks to the release manager, Wenchen!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>
>> + user list
>>
>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>
>>> resend
>>>
>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>>>
>>>>
>>>>
>>>> ------ Forwarded message -
>>>> From: Wenchen Fan 
>>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>>> To: Spark dev list 
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>>>> adds Barrier Execution Mode for better integration with deep learning
>>>> frameworks, introduces 30+ built-in and higher-order functions to deal with
>>>> complex data type easier, improves the K8s integration, along with
>>>> experimental Scala 2.12 support. Other major updates include the built-in
>>>> Avro data source, Image data source, flexible streaming sinks, elimination
>>>> of the 2GB block size limitation during transfer, Pandas UDF improvements.
>>>> In addition, this release continues to focus on usability, stability, and
>>>> polish while resolving around 1100 tickets.
>>>>
>>>> We'd like to thank our contributors and users for their contributions
>>>> and early feedback to this release. This release would not have been
>>>> possible without you.
>>>>
>>>> To download Spark 2.4.0, head over to the download page:
>>>> http://spark.apache.org/downloads.html
>>>>
>>>> To view the release notes: https://spark.apache.org/
>>>> releases/spark-release-2-4-0.html
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>> PS: If you see any issues with the release notes, webpage or published
>>>> artifacts, please contact me directly off-list.
>>>>
>>>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Jules Damji
Indeed! 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun  wrote:
> 
> Finally, thank you all. Especially, thanks to the release manager, Wenchen!
> 
> Bests,
> Dongjoon.
> 
> 
>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>> + user list
>> 
>>> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>>> resend
>>> 
>>>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>>>> 
>>>> 
>>>> -- Forwarded message -
>>>> From: Wenchen Fan 
>>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>>> To: Spark dev list 
>>>> 
>>>> 
>>>> Hi all,
>>>> 
>>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds 
>>>> Barrier Execution Mode for better integration with deep learning 
>>>> frameworks, introduces 30+ built-in and higher-order functions to deal 
>>>> with complex data type easier, improves the K8s integration, along with 
>>>> experimental Scala 2.12 support. Other major updates include the built-in 
>>>> Avro data source, Image data source, flexible streaming sinks, elimination 
>>>> of the 2GB block size limitation during transfer, Pandas UDF improvements. 
>>>> In addition, this release continues to focus on usability, stability, and 
>>>> polish while resolving around 1100 tickets.
>>>> 
>>>> We'd like to thank our contributors and users for their contributions and 
>>>> early feedback to this release. This release would not have been possible 
>>>> without you.
>>>> 
>>>> To download Spark 2.4.0, head over to the download page: 
>>>> http://spark.apache.org/downloads.html
>>>> 
>>>> To view the release notes: 
>>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>> 
>>>> Thanks,
>>>> Wenchen
>>>> 
>>>> PS: If you see any issues with the release notes, webpage or published 
>>>> artifacts, please contact me directly off-list.


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Dongjoon Hyun
Finally, thank you all. Especially, thanks to the release manager, Wenchen!

Bests,
Dongjoon.


On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:

> + user list
>
> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>
>> resend
>>
>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Wenchen Fan 
>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>> To: Spark dev list 
>>>
>>>
>>> Hi all,
>>>
>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>>> adds Barrier Execution Mode for better integration with deep learning
>>> frameworks, introduces 30+ built-in and higher-order functions to deal with
>>> complex data type easier, improves the K8s integration, along with
>>> experimental Scala 2.12 support. Other major updates include the built-in
>>> Avro data source, Image data source, flexible streaming sinks, elimination
>>> of the 2GB block size limitation during transfer, Pandas UDF improvements.
>>> In addition, this release continues to focus on usability, stability, and
>>> polish while resolving around 1100 tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 2.4.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> PS: If you see any issues with the release notes, webpage or published
>>> artifacts, please contact me directly off-list.
>>>
>>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
+ user list

On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:

> resend
>
> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>
>>
>>
>> -- Forwarded message -
>> From: Wenchen Fan 
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>> To: Spark dev list 
>>
>>
>> Hi all,
>>
>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>> adds Barrier Execution Mode for better integration with deep learning
>> frameworks, introduces 30+ built-in and higher-order functions to deal with
>> complex data type easier, improves the K8s integration, along with
>> experimental Scala 2.12 support. Other major updates include the built-in
>> Avro data source, Image data source, flexible streaming sinks, elimination
>> of the 2GB block size limitation during transfer, Pandas UDF improvements.
>> In addition, this release continues to focus on usability, stability, and
>> polish while resolving around 1100 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 2.4.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>
>> Thanks,
>> Wenchen
>>
>> PS: If you see any issues with the release notes, webpage or published
>> artifacts, please contact me directly off-list.
>>
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Marcelo Vanzin
+user@

>> -- Forwarded message -
>> From: Wenchen Fan 
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>> To: Spark dev list 
>>
>>
>> Hi all,
>>
>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release adds 
>> Barrier Execution Mode for better integration with deep learning frameworks, 
>> introduces 30+ built-in and higher-order functions to deal with complex data 
>> type easier, improves the K8s integration, along with experimental Scala 
>> 2.12 support. Other major updates include the built-in Avro data source, 
>> Image data source, flexible streaming sinks, elimination of the 2GB block 
>> size limitation during transfer, Pandas UDF improvements. In addition, this 
>> release continues to focus on usability, stability, and polish while 
>> resolving around 1100 tickets.
>>
>> We'd like to thank our contributors and users for their contributions and 
>> early feedback to this release. This release would not have been possible 
>> without you.
>>
>> To download Spark 2.4.0, head over to the download page: 
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes: 
>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>
>> Thanks,
>> Wenchen
>>
>> PS: If you see any issues with the release notes, webpage or published 
>> artifacts, please contact me directly off-list.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread gpatcham
When I run spark.read.orc("hdfs://test").filter("conv_date = 20181025").count
with "spark.sql.orc.filterPushdown=true" I see below in executors logs.
Predicate push down is happening

18/11/01 17:31:17 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 =
(IS_NULL conv_date)
leaf-1 = (EQUALS conv_date 20181025)
expr = (and (not leaf-0) leaf-1)


But when I run hive query in spark I see below logs

Hive table: Hive

spark.sql("select * from test where conv_date = 20181025").count

18/11/01 17:37:57 INFO HadoopRDD: Input split: hdfs://test/test1.orc:0+34568
18/11/01 17:37:57 INFO OrcRawRecordMerger: min key = null, max key = null
18/11/01 17:37:57 INFO ReaderImpl: Reading ORC rows from
hdfs://test/test1.orc with {include: [true, false, false, false, true,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false], offset: 0, length: 9223372036854775807}
18/11/01 17:37:57 INFO Executor: Finished task 224.0 in stage 0.0 (TID 33).
1662 bytes result sent to driver
18/11/01 17:37:57 INFO CoarseGrainedExecutorBackend: Got assigned task 40
18/11/01 17:37:57 INFO Executor: Running task 956.0 in stage 0.0 (TID 40)





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread Jörn Franke
A lot of small files is very inefficient itself and predicate push down will 
not help you much there unless you merge them into one large file (one large 
file can be much more efficiently processed).

How did you validate that predicate pushdown did not work on Hive? You Hive 
Version is also very old - consider upgrading to at least Hive 2.x

> Am 31.10.2018 um 20:35 schrieb gpatcham :
> 
> spark version 2.2.0
> Hive version 1.1.0
> 
> There are lot of small files
> 
> Spark code :
> 
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true 
> 
> val logs
> =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date >
> 20181003")
> 
> Hive:
> 
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true 
> 
> test  table in Hive is pointing to hdfs://test/  and partitioned on date
> 
> val sqlStr = s"select * from test where date > 20181001"
> val logs = spark.sql(sqlStr)
> 
> With Hive query I don't see filter pushdown is  happening. I tried setting
> these configs in both hive-site.xml and also spark.sqlContext.setConf
> 
> "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true" 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
spark version 2.2.0
Hive version 1.1.0

There are lot of small files

Spark code :

"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true 

val logs
=spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date >
20181003")

Hive:

"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true 

test  table in Hive is pointing to hdfs://test/  and partitioned on date

val sqlStr = s"select * from test where date > 20181001"
val logs = spark.sql(sqlStr)

With Hive query I don't see filter pushdown is  happening. I tried setting
these configs in both hive-site.xml and also spark.sqlContext.setConf

"hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true" 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke
How large are they? A lot of (small) files will cause significant delay in 
progressing - try to merge as much as possible into one file.

Can you please share full source code in Hive and Spark as well as the versions 
you are using?

> Am 31.10.2018 um 18:23 schrieb gpatcham :
> 
> 
> 
> When reading large number of orc files from HDFS under a directory spark
> doesn't launch any tasks until some amount of time and I don't see any tasks
> running during that time. I'm using below command to read orc and spark.sql
> configs.
> 
> What spark is doing under hoods when spark.read.orc is issued?
> 
> spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true
> 
> Also instead of directly reading orc files I tried running Hive query on
> same dataset. But I was not able to push filter predicate. Where should I
> set the below config's "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true"
> 
> Suggest what is the best way to read orc files from HDFS and tuning
> parameters ?
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham



When reading large number of orc files from HDFS under a directory spark
doesn't launch any tasks until some amount of time and I don't see any tasks
running during that time. I'm using below command to read orc and spark.sql
configs.

What spark is doing under hoods when spark.read.orc is issued?

spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true

Also instead of directly reading orc files I tried running Hive query on
same dataset. But I was not able to push filter predicate. Where should I
set the below config's "hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true"

Suggest what is the best way to read orc files from HDFS and tuning
parameters ?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-30 Thread Jörn Franke
Older versions of Spark had indeed a lower performance on Python and R due to a 
conversion need between JVM datatypes and python/r datatypes. This changed in 
Spark 2.2, I think, with the integration of Apache Arrow.  However, what you do 
after the conversion in those languages can be still slower than, for instance, 
in Java if you do not use Spark only functions. It could be also faster (eg you 
use a python module implemented natively in C and if there is no translation 
into c datatypes needed). 
Scala has in certain cases a more elegant syntax than Java (if you do not use 
Lambda). Sometimes this elegant syntax can lead to (unintentional) inefficient 
things for which there is a better way to express them (eg implicit 
conversions, use of collection methods etc). However there are better ways and 
you just have to spot these issues in the source code and address them, if 
needed. 
So a comparison does not make really sense between those languages - it always 
depends.

> Am 30.10.2018 um 07:00 schrieb akshay naidu :
> 
> how about Python. 
> java vs scala vs python vs R
> which is better.
> 
>> On Sat, Oct 27, 2018 at 3:34 AM karan alang  wrote:
>> Hello 
>> - is there a "performance" difference when using Java or Scala for Apache 
>> Spark ?
>> 
>> I understand, there are other obvious differences (less code with scala, 
>> easier to focus on logic etc), 
>> but wrt performance - i think there would not be much of a difference since 
>> both of them are JVM based, 
>> pls. let me know if this is not the case.
>> 
>> thanks!


Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-30 Thread akshay naidu
how about Python.
java vs scala vs python vs R
which is better.

On Sat, Oct 27, 2018 at 3:34 AM karan alang  wrote:

> Hello
> - is there a "performance" difference when using Java or Scala for Apache
> Spark ?
>
> I understand, there are other obvious differences (less code with scala,
> easier to focus on logic etc),
> but wrt performance - i think there would not be much of a difference
> since both of them are JVM based,
> pls. let me know if this is not the case.
>
> thanks!
>


Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Gourav Sengupta
I genuinely do not think that Scala for Spark needs us to be super in
Scala. There is infact a tutorial called as "Just enough Scala for Spark"
which even with my IQ does not take more than 40 mins to go through. Also
the sytax of Scala is almost always similar to that of Python.

Data processing is much more amenable to functional thinking and therefore
Scala suits best also Spark is written in Scala.

Regards,
Gourav

On Mon, Oct 29, 2018 at 11:33 PM kant kodali  wrote:

> Most people when they compare two different programming languages 99% of
> the time it all seems to boil down to syntax sugar.
>
> Performance I doubt Scala is ever faster than Java given that Scala likes
> Heap more than Java. I had also written some pointless micro-benchmarking
> code like (Random String Generation, hash computations, etc..) on Java,
> Scala and Golang and Java had outperformed both Scala and Golang as well on
> many occasions.
>
> Now that Java 11 had released things seem to get even better given the
> startup time is also very low.
>
> I am happy to change my view as long as I can see some code and benchmarks!
>
>
>
> On Mon, Oct 29, 2018 at 1:58 PM Jean Georges Perrin  wrote:
>
>> did not see anything, but curious if you find something.
>>
>> I think one of the big benefit of using Java, for data engineering in the
>> context of  Spark, is that you do not have to train a lot of your team to
>> Scala. Now if you want to do data science, Java is probably not the best
>> tool yet...
>>
>> On Oct 26, 2018, at 6:04 PM, karan alang  wrote:
>>
>> Hello
>> - is there a "performance" difference when using Java or Scala for Apache
>> Spark ?
>>
>> I understand, there are other obvious differences (less code with scala,
>> easier to focus on logic etc),
>> but wrt performance - i think there would not be much of a difference
>> since both of them are JVM based,
>> pls. let me know if this is not the case.
>>
>> thanks!
>>
>>
>>


Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread kant kodali
Most people when they compare two different programming languages 99% of
the time it all seems to boil down to syntax sugar.

Performance I doubt Scala is ever faster than Java given that Scala likes
Heap more than Java. I had also written some pointless micro-benchmarking
code like (Random String Generation, hash computations, etc..) on Java,
Scala and Golang and Java had outperformed both Scala and Golang as well on
many occasions.

Now that Java 11 had released things seem to get even better given the
startup time is also very low.

I am happy to change my view as long as I can see some code and benchmarks!



On Mon, Oct 29, 2018 at 1:58 PM Jean Georges Perrin  wrote:

> did not see anything, but curious if you find something.
>
> I think one of the big benefit of using Java, for data engineering in the
> context of  Spark, is that you do not have to train a lot of your team to
> Scala. Now if you want to do data science, Java is probably not the best
> tool yet...
>
> On Oct 26, 2018, at 6:04 PM, karan alang  wrote:
>
> Hello
> - is there a "performance" difference when using Java or Scala for Apache
> Spark ?
>
> I understand, there are other obvious differences (less code with scala,
> easier to focus on logic etc),
> but wrt performance - i think there would not be much of a difference
> since both of them are JVM based,
> pls. let me know if this is not the case.
>
> thanks!
>
>
>


Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Jean Georges Perrin
did not see anything, but curious if you find something.

I think one of the big benefit of using Java, for data engineering in the 
context of  Spark, is that you do not have to train a lot of your team to 
Scala. Now if you want to do data science, Java is probably not the best tool 
yet...

> On Oct 26, 2018, at 6:04 PM, karan alang  wrote:
> 
> Hello 
> - is there a "performance" difference when using Java or Scala for Apache 
> Spark ?
> 
> I understand, there are other obvious differences (less code with scala, 
> easier to focus on logic etc), 
> but wrt performance - i think there would not be much of a difference since 
> both of them are JVM based, 
> pls. let me know if this is not the case.
> 
> thanks!



Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-26 Thread Battini Lakshman
On Oct 27, 2018 3:34 AM, "karan alang"  wrote:

Hello
- is there a "performance" difference when using Java or Scala for Apache
Spark ?

I understand, there are other obvious differences (less code with scala,
easier to focus on logic etc),
but wrt performance - i think there would not be much of a difference since
both of them are JVM based,
pls. let me know if this is not the case.

thanks!


java vs scala for Apache Spark - is there a performance difference ?

2018-10-26 Thread karan alang
Hello
- is there a "performance" difference when using Java or Scala for Apache
Spark ?

I understand, there are other obvious differences (less code with scala,
easier to focus on logic etc),
but wrt performance - i think there would not be much of a difference since
both of them are JVM based,
pls. let me know if this is not the case.

thanks!


CVE-2018-11804: Apache Spark build/mvn runs zinc, and can expose information from build machines

2018-10-24 Thread Sean Owen
Severity: Low

Vendor: The Apache Software Foundation

Versions Affected:
1.3.x release branch and later, including master

Description:
Spark's Apache Maven-based build includes a convenience script, 'build/mvn',
that downloads and runs a zinc server to speed up compilation. This server
will accept connections from external hosts by default. A specially-crafted
request to the zinc server could cause it to reveal information in files
readable to the developer account running the build. Note that this issue
does not affect end users of Spark, only developers building Spark from
source code.

Mitigation:
Spark users are not affected, as zinc is only a part of the build process.
Spark developers may simply use a local Maven installation's 'mvn' command
to build, and avoid running build/mvn and zinc.
Spark developers building actively-developed branches (2.2.x, 2.3.x, 2.4.x,
master) may update their branches to receive mitigations already patched
onto the build/mvn script.
Spark developers running zinc separately may include "-server 127.0.0.1" in
its command line, and consider additional flags like "-idle-timeout 30m" to
achieve similar mitigation.

Credit:
Andre Protas, Apple Information Security

References:
https://spark.apache.org/security.html

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Gourav Sengupta
I do not think security and governance has become important it always was.
Horton works and Cloudera has fantastic security implementations and hence
I mentioned about updates via Hive.

Regards,
Gourav

On Wed, 24 Oct 2018, 17:32 ,  wrote:

> Thank you Gourav,
>
> Today I saw the article:
> https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important
>
> It seems also interesting.
>
> I was in meeting, I will also watch it.
>
>
>
> *From: *Gourav Sengupta 
> *Date: *24 October 2018 Wednesday 13:39
> *To: *"Ozsakarya, Omer" 
> *Cc: *Spark Forum 
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> Also try to read about SCD and the fact that Hive may be a very good
> alternative as well for running updates on data
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Wed, 24 Oct 2018, 14:53 ,  wrote:
>
> Thank you very much 
>
>
>
> *From: *Gourav Sengupta 
> *Date: *24 October 2018 Wednesday 11:20
> *To: *"Ozsakarya, Omer" 
> *Cc: *Spark Forum 
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> This is interesting you asked and then answered the questions (almost) as
> well
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Tue, 23 Oct 2018, 13:23 ,  wrote:
>
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>1. CRM application will send a file to a folder. This file contains
>customer information of all customers. This file is in a folder in the
>local server. File name is: customer.tsv
>
>
>1. Customer.tsv contains customerid, country, birty_month,
>   activation_date etc
>
>
>1. I need to read the contents of customer.tsv.
>2. I will add current timestamp info to the file.
>3. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>1.  CRM application will send a new file which contains the
>updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>1. Daily_customer.tsv contains contains customerid, cdc_field,
>   country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>1. I need to read the contents of daily_customer.tsv.
>2. I will add current timestamp info to the file.
>3. I will transfer daily_customer.tsv to the S3 bucket:
>customer.daily.data
>4. I need to merge two buckets customer.history.data and
>customer.daily.data.
>
>
>1. Two buckets have timestamp fields. So I need to query all records
>   whose timestamp is the last timestamp.
>   2. I can use row_number() over(partition by customer_id order by
>   timestamp_field desc) as version_number
>   3. Then I can put the records whose version is one, to the final
>   bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>on a local Spark cluster?
>- Is this approach efficient? Will the queries transfer all historical
>data from AWS S3 to the local cluster?
>- How can I implement this scenario in a more effective way? Like just
>transferring daily data to AWS S3 and then running queries on AWS.
>
>
>- For instance Athena can query on AWS. But it is just a query engine.
>   As I know I can not call it by using an sdk and I can not write the 
> results
>   to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>
>


Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Omer.Ozsakarya
Thank you Gourav,

Today I saw the article: 
https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important
It seems also interesting.
I was in meeting, I will also watch it.

From: Gourav Sengupta 
Date: 24 October 2018 Wednesday 13:39
To: "Ozsakarya, Omer" 
Cc: Spark Forum 
Subject: Re: Triggering sql on Was S3 via Apache Spark

Also try to read about SCD and the fact that Hive may be a very good 
alternative as well for running updates on data

Regards,
Gourav

On Wed, 24 Oct 2018, 14:53 , 
mailto:omer.ozsaka...@sony.com>> wrote:
Thank you very much 

From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" mailto:omer.ozsaka...@sony.com>>
Cc: Spark Forum mailto:user@spark.apache.org>>
Subject: Re: Triggering sql on Was S3 via Apache Spark

This is interesting you asked and then answered the questions (almost) as well

Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 , 
mailto:omer.ozsaka...@sony.com>> wrote:
Hi guys,

We are using Apache Spark on a local machine.

I need to implement the scenario below.

In the initial load:

  1.  CRM application will send a file to a folder. This file contains customer 
information of all customers. This file is in a folder in the local server. 
File name is: customer.tsv

 *   Customer.tsv contains customerid, country, birty_month, 
activation_date etc

  1.  I need to read the contents of customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer customer.tsv to the S3 bucket: customer.history.data

In the daily loads:

  1.   CRM application will send a new file which contains the 
updated/deleted/inserted customer information.

  File name is daily_customer.tsv

 *   Daily_customer.tsv contains contains customerid, cdc_field, country, 
birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1.  I need to read the contents of daily_customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4.  I need to merge two buckets customer.history.data and customer.daily.data.

 *   Two buckets have timestamp fields. So I need to query all records 
whose timestamp is the last timestamp.
 *   I can use row_number() over(partition by customer_id order by 
timestamp_field desc) as version_number
 *   Then I can put the records whose version is one, to the final bucket: 
customer.dimension.data

I am running Spark on premise.

  *   Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a 
local Spark cluster?
  *   Is this approach efficient? Will the queries transfer all historical data 
from AWS S3 to the local cluster?
  *   How can I implement this scenario in a more effective way? Like just 
transferring daily data to AWS S3 and then running queries on AWS.

 *   For instance Athena can query on AWS. But it is just a query engine. 
As I know I can not call it by using an sdk and I can not write the results to 
a bucket/folder.

Thanks in advance,
Ömer






Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Gourav Sengupta
Also try to read about SCD and the fact that Hive may be a very good
alternative as well for running updates on data

Regards,
Gourav

On Wed, 24 Oct 2018, 14:53 ,  wrote:

> Thank you very much 
>
>
>
> *From: *Gourav Sengupta 
> *Date: *24 October 2018 Wednesday 11:20
> *To: *"Ozsakarya, Omer" 
> *Cc: *Spark Forum 
> *Subject: *Re: Triggering sql on Was S3 via Apache Spark
>
>
>
> This is interesting you asked and then answered the questions (almost) as
> well
>
>
>
> Regards,
>
> Gourav
>
>
>
> On Tue, 23 Oct 2018, 13:23 ,  wrote:
>
> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>1. CRM application will send a file to a folder. This file contains
>customer information of all customers. This file is in a folder in the
>local server. File name is: customer.tsv
>
>
>1. Customer.tsv contains customerid, country, birty_month,
>   activation_date etc
>
>
>1. I need to read the contents of customer.tsv.
>2. I will add current timestamp info to the file.
>3. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>1.  CRM application will send a new file which contains the
>updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>1. Daily_customer.tsv contains contains customerid, cdc_field,
>   country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>1. I need to read the contents of daily_customer.tsv.
>2. I will add current timestamp info to the file.
>3. I will transfer daily_customer.tsv to the S3 bucket:
>customer.daily.data
>4. I need to merge two buckets customer.history.data and
>customer.daily.data.
>
>
>1. Two buckets have timestamp fields. So I need to query all records
>   whose timestamp is the last timestamp.
>   2. I can use row_number() over(partition by customer_id order by
>   timestamp_field desc) as version_number
>   3. Then I can put the records whose version is one, to the final
>   bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>on a local Spark cluster?
>- Is this approach efficient? Will the queries transfer all historical
>data from AWS S3 to the local cluster?
>- How can I implement this scenario in a more effective way? Like just
>transferring daily data to AWS S3 and then running queries on AWS.
>
>
>- For instance Athena can query on AWS. But it is just a query engine.
>   As I know I can not call it by using an sdk and I can not write the 
> results
>   to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>
>


Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Omer.Ozsakarya
Thank you very much 

From: Gourav Sengupta 
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" 
Cc: Spark Forum 
Subject: Re: Triggering sql on Was S3 via Apache Spark

This is interesting you asked and then answered the questions (almost) as well

Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 , 
mailto:omer.ozsaka...@sony.com>> wrote:
Hi guys,

We are using Apache Spark on a local machine.

I need to implement the scenario below.

In the initial load:

  1.  CRM application will send a file to a folder. This file contains customer 
information of all customers. This file is in a folder in the local server. 
File name is: customer.tsv

 *   Customer.tsv contains customerid, country, birty_month, 
activation_date etc

  1.  I need to read the contents of customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer customer.tsv to the S3 bucket: customer.history.data

In the daily loads:

  1.   CRM application will send a new file which contains the 
updated/deleted/inserted customer information.

  File name is daily_customer.tsv

 *   Daily_customer.tsv contains contains customerid, cdc_field, country, 
birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1.  I need to read the contents of daily_customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4.  I need to merge two buckets customer.history.data and customer.daily.data.

 *   Two buckets have timestamp fields. So I need to query all records 
whose timestamp is the last timestamp.
 *   I can use row_number() over(partition by customer_id order by 
timestamp_field desc) as version_number
 *   Then I can put the records whose version is one, to the final bucket: 
customer.dimension.data

I am running Spark on premise.

  *   Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a 
local Spark cluster?
  *   Is this approach efficient? Will the queries transfer all historical data 
from AWS S3 to the local cluster?
  *   How can I implement this scenario in a more effective way? Like just 
transferring daily data to AWS S3 and then running queries on AWS.

 *   For instance Athena can query on AWS. But it is just a query engine. 
As I know I can not call it by using an sdk and I can not write the results to 
a bucket/folder.

Thanks in advance,
Ömer






Re: Triggering sql on Was S3 via Apache Spark

2018-10-24 Thread Gourav Sengupta
This is interesting you asked and then answered the questions (almost) as
well

Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 ,  wrote:

> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>1. CRM application will send a file to a folder. This file contains
>customer information of all customers. This file is in a folder in the
>local server. File name is: customer.tsv
>   1. Customer.tsv contains customerid, country, birty_month,
>   activation_date etc
>2. I need to read the contents of customer.tsv.
>3. I will add current timestamp info to the file.
>4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>1.  CRM application will send a new file which contains the
>updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>1. Daily_customer.tsv contains contains customerid, cdc_field,
>   country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>1. I need to read the contents of daily_customer.tsv.
>2. I will add current timestamp info to the file.
>3. I will transfer daily_customer.tsv to the S3 bucket:
>customer.daily.data
>4. I need to merge two buckets customer.history.data and
>customer.daily.data.
>   1. Two buckets have timestamp fields. So I need to query all
>   records whose timestamp is the last timestamp.
>   2. I can use row_number() over(partition by customer_id order by
>   timestamp_field desc) as version_number
>   3. Then I can put the records whose version is one, to the final
>   bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>on a local Spark cluster?
>- Is this approach efficient? Will the queries transfer all historical
>data from AWS S3 to the local cluster?
>- How can I implement this scenario in a more effective way? Like just
>transferring daily data to AWS S3 and then running queries on AWS.
>   - For instance Athena can query on AWS. But it is just a query
>   engine. As I know I can not call it by using an sdk and I can not write 
> the
>   results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>


Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Jörn Franke
Why not directly access the S3 file from Spark?


You need to configure the IAM roles so that the machine running the S3 code is 
allowed to access the bucket.

> Am 24.10.2018 um 06:40 schrieb Divya Gehlot :
> 
> Hi Omer ,
> Here are couple of the solutions which you can implement for your use case : 
> Option 1 : 
> you can mount the S3 bucket as local file system 
> Here are the details : 
> https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
> Option 2 :
>  You can use Amazon Glue for your use case 
> here are the details : 
> https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/
> 
> Option 3 :
> Store the file in the local file system and later push it s3 bucket 
> here are the details 
> https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket
> 
> Thanks,
> Divya 
> 
>> On Tue, 23 Oct 2018 at 15:53,  wrote:
>> Hi guys,
>> 
>>  
>> 
>> We are using Apache Spark on a local machine.
>> 
>>  
>> 
>> I need to implement the scenario below.
>> 
>>  
>> 
>> In the initial load:
>> 
>> CRM application will send a file to a folder. This file contains customer 
>> information of all customers. This file is in a folder in the local server. 
>> File name is: customer.tsv
>> Customer.tsv contains customerid, country, birty_month, activation_date etc
>> I need to read the contents of customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer customer.tsv to the S3 bucket: customer.history.data
>>  
>> 
>> In the daily loads:
>> 
>>  CRM application will send a new file which contains the 
>> updated/deleted/inserted customer information.
>>   File name is daily_customer.tsv
>> 
>> Daily_customer.tsv contains contains customerid, cdc_field, country, 
>> birty_month, activation_date etc
>> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>> 
>> I need to read the contents of daily_customer.tsv.
>> I will add current timestamp info to the file.
>> I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
>> I need to merge two buckets customer.history.data and customer.daily.data.
>> Two buckets have timestamp fields. So I need to query all records whose 
>> timestamp is the last timestamp.
>> I can use row_number() over(partition by customer_id order by 
>> timestamp_field desc) as version_number
>> Then I can put the records whose version is one, to the final bucket: 
>> customer.dimension.data
>>  
>> 
>> I am running Spark on premise.
>> 
>> Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a 
>> local Spark cluster?
>> Is this approach efficient? Will the queries transfer all historical data 
>> from AWS S3 to the local cluster?
>> How can I implement this scenario in a more effective way? Like just 
>> transferring daily data to AWS S3 and then running queries on AWS.
>> For instance Athena can query on AWS. But it is just a query engine. As I 
>> know I can not call it by using an sdk and I can not write the results to a 
>> bucket/folder.
>>  
>> 
>> Thanks in advance,
>> 
>> Ömer
>> 
>>  
>> 
>>
>> 
>>  
>> 
>>  


Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Divya Gehlot
Hi Omer ,
Here are couple of the solutions which you can implement for your use case
:
*Option 1 : *
you can mount the S3 bucket as local file system
Here are the details :
https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
*Option 2 :*
 You can use Amazon Glue for your use case
here are the details :
https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/

*Option 3 :*
Store the file in the local file system and later push it s3 bucket
here are the details
https://stackoverflow.com/questions/48067979/simplest-way-to-fetch-the-file-from-ftp-server-on-prem-put-into-s3-bucket

Thanks,
Divya

On Tue, 23 Oct 2018 at 15:53,  wrote:

> Hi guys,
>
>
>
> We are using Apache Spark on a local machine.
>
>
>
> I need to implement the scenario below.
>
>
>
> In the initial load:
>
>1. CRM application will send a file to a folder. This file contains
>customer information of all customers. This file is in a folder in the
>local server. File name is: customer.tsv
>   1. Customer.tsv contains customerid, country, birty_month,
>   activation_date etc
>2. I need to read the contents of customer.tsv.
>3. I will add current timestamp info to the file.
>4. I will transfer customer.tsv to the S3 bucket: customer.history.data
>
>
>
> In the daily loads:
>
>1.  CRM application will send a new file which contains the
>updated/deleted/inserted customer information.
>
>   File name is daily_customer.tsv
>
>1. Daily_customer.tsv contains contains customerid, cdc_field,
>   country, birty_month, activation_date etc
>
> Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.
>
>1. I need to read the contents of daily_customer.tsv.
>2. I will add current timestamp info to the file.
>3. I will transfer daily_customer.tsv to the S3 bucket:
>customer.daily.data
>4. I need to merge two buckets customer.history.data and
>customer.daily.data.
>   1. Two buckets have timestamp fields. So I need to query all
>   records whose timestamp is the last timestamp.
>   2. I can use row_number() over(partition by customer_id order by
>   timestamp_field desc) as version_number
>   3. Then I can put the records whose version is one, to the final
>   bucket: customer.dimension.data
>
>
>
> I am running Spark on premise.
>
>- Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD
>on a local Spark cluster?
>- Is this approach efficient? Will the queries transfer all historical
>data from AWS S3 to the local cluster?
>- How can I implement this scenario in a more effective way? Like just
>transferring daily data to AWS S3 and then running queries on AWS.
>   - For instance Athena can query on AWS. But it is just a query
>   engine. As I know I can not call it by using an sdk and I can not write 
> the
>   results to a bucket/folder.
>
>
>
> Thanks in advance,
>
> Ömer
>
>
>
>
>
>
>
>
>


Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Omer.Ozsakarya
Hi guys,

We are using Apache Spark on a local machine.

I need to implement the scenario below.

In the initial load:

  1.  CRM application will send a file to a folder. This file contains customer 
information of all customers. This file is in a folder in the local server. 
File name is: customer.tsv
 *   Customer.tsv contains customerid, country, birty_month, 
activation_date etc
  2.  I need to read the contents of customer.tsv.
  3.  I will add current timestamp info to the file.
  4.  I will transfer customer.tsv to the S3 bucket: customer.history.data

In the daily loads:

  1.   CRM application will send a new file which contains the 
updated/deleted/inserted customer information.

  File name is daily_customer.tsv

 *   Daily_customer.tsv contains contains customerid, cdc_field, country, 
birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1.  I need to read the contents of daily_customer.tsv.
  2.  I will add current timestamp info to the file.
  3.  I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4.  I need to merge two buckets customer.history.data and customer.daily.data.
 *   Two buckets have timestamp fields. So I need to query all records 
whose timestamp is the last timestamp.
 *   I can use row_number() over(partition by customer_id order by 
timestamp_field desc) as version_number
 *   Then I can put the records whose version is one, to the final bucket: 
customer.dimension.data

I am running Spark on premise.

  *   Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a 
local Spark cluster?
  *   Is this approach efficient? Will the queries transfer all historical data 
from AWS S3 to the local cluster?
  *   How can I implement this scenario in a more effective way? Like just 
transferring daily data to AWS S3 and then running queries on AWS.
 *   For instance Athena can query on AWS. But it is just a query engine. 
As I know I can not call it by using an sdk and I can not write the results to 
a bucket/folder.

Thanks in advance,
Ömer






Triangle Apache Spark Meetup

2018-10-10 Thread Jean Georges Perrin
Hi,


Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh, 
Durham, and Chapel Hill in North Carolina, USA. The group started back in July 
2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/ 
<https://www.meetup.com/Triangle-Apache-Spark-Meetup/>.

Can you add our meetup to http://spark.apache.org/community.html 
<http://spark.apache.org/community.html> ?

jg




[ANNOUNCE] Announcing Apache Spark 2.3.2

2018-09-26 Thread Saisai Shao
We are happy to announce the availability of Spark 2.3.2!

Apache Spark 2.3.2 is a maintenance release, based on the branch-2.3
maintenance branch of Spark. We strongly recommend all 2.3.x users to
upgrade to this stable release.

To download Spark 2.3.2, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-3-2.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.


Best regards
Saisai


Apache Spark and Airflow connection

2018-09-24 Thread Uğur Sopaoğlu
I have a docker based cluster. In my cluster, I try to schedule spark jobs
by using Airflow. Airflow and Spark are running separately in *different
containers*.  However, I cannot run a spark job by using airflow.

Below the code is my airflow script:

from airflow import DAG

from airflow.contrib.operators.spark_submit_operator import
SparkSubmitOperator
from datetime import datetime, timedelta


args = {'owner': 'airflow', 'start_date': datetime(2018, 7, 31) }

dag = DAG('spark_example_new', default_args=args, schedule_interval="@once")

operator = SparkSubmitOperator(task_id='spark_submit_job',
conn_id='spark_default', java_class='Main', application='/SimpleSpark.jar',
name='airflow-spark-example',
dag=dag)

I also configure spark_default in Airflow UI:

[image: Screenshot from 2018-09-24 12-00-46.png]


However, it produce following error:

[Errno 2] No such file or directory: 'spark-submit': 'spark-submit'

I think, airflow try to run spark job in own. How can I configure that it
runs spark code on spark master.

-- 
Uğur Sopaoğlu


Padova Apache Spark Meetup

2018-09-05 Thread Matteo Durighetto
Hello,
we are creating a new meetup of enthusiast Apache Spark Users
in Italy at Padova


https://www.meetup.com/Padova-Apache-Spark-Meetup/

Is it possible to  add the meetup link to the web page
https://spark.apache.org/community.html ?

Moreover is it possible to announce future events in this mailing list ?


Kind Regards

Matteo Durighetto
e-mail: m.durighe...@miriade.it
supporto kandula :
database : support...@miriade.it
business intelligence : support...@miriade.it
infrastructure : supp...@miriade.it

M I R I A D E - P L A Y  T H E  C H A N G E

Via Castelletto 11, 36016 Thiene VI
Tel. 0445030111 - Fax 0445030100
Website: http://www.miriade.it/

<https://www.facebook.com/MiriadePlayTheChange/>
<https://www.linkedin.com/company/miriade-s-p-a-?trk=company_logo>
<https://plus.google.com/+MiriadeIt/about>

 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Le informazioni contenute in questa e-mail sono destinate alla persona
alla quale sono state inviate. Nel rispetto della legge, dei
regolamenti e delle normative vigenti, questa e-mail non deve essere
resa pubblica poiché potrebbe contenere informazioni di natura
strettamente confidenziale. Qualsiasi persona che al di fuori del
destinatario dovesse riceverla o dovesse entrarne in possesso non é
autorizzata a leggerla, diffonderla, inoltrarla o duplicarla. Se chi
legge non é il destinatario del messaggio e' pregato di avvisare
immediatamente il mittente e successivamente di eliminarlo. Miriade
declina ogni responsabilità per l'incompleta e l'errata trasmissione
di questa e-mail o per un ritardo nella ricezione della stessa.


CVE-2018-11770: Apache Spark standalone master, Mesos REST APIs not controlled by authentication

2018-08-13 Thread Sean Owen
Severity: Medium

Vendor: The Apache Software Foundation

Versions Affected:
Spark versions from 1.3.0, running standalone master with REST API enabled,
or running Mesos master with cluster mode enabled

Description:
>From version 1.3.0 onward, Spark's standalone master exposes a REST API for
job submission, in addition to the submission mechanism used by
spark-submit. In standalone, the config property
'spark.authenticate.secret' establishes a shared secret for authenticating
requests to submit jobs via spark-submit. However, the REST API does not
use this or any other authentication mechanism, and this is not adequately
documented. In this case, a user would be able to run a driver program
without authenticating, but not launch executors, using the REST API. This
REST API is also used by Mesos, when set up to run in cluster mode (i.e.,
when also running MesosClusterDispatcher), for job submission. Future
versions of Spark will improve documentation on these points, and prohibit
setting 'spark.authenticate.secret' when running the REST APIs, to make
this clear. Future versions will also disable the REST API by default in
the standalone master by changing the default value of
'spark.master.rest.enabled' to 'false'.

Mitigation:
For standalone masters, disable the REST API by setting
'spark.master.rest.enabled' to 'false' if it is unused, and/or ensure that
all network access to the REST API (port 6066 by default) is restricted to
hosts that are trusted to submit jobs. Mesos users can stop the
MesosClusterDispatcher, though that will prevent them from running jobs in
cluster mode. Alternatively, they can ensure access to the
MesosRestSubmissionServer (port 7077 by default) is restricted to trusted
hosts.

Credit:
Imran Rashid, Cloudera
Fengwei Zhang, Alibaba Cloud Security Team

Reference:
https://spark.apache.org/security.html


Data quality measurement for streaming data with apache spark

2018-08-01 Thread Uttam
Hello,

I have very general question about Apache Spark. I want to know if it is
possible(and where to start, if possible) to implement a data quality
measurement prototype for streaming data using Apache Spark. Let's say I
want to work on Timeliness or Completeness as a data quality metrics, is
similar work already done using spark? Are there other frameworks which are
better designed for this use case? 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache Spark Cluster

2018-07-23 Thread Uğur Sopaoğlu
We try to create a cluster which consists of 4 machines. The cluster will
be used by multiple-users. How can we configured that user can submit jobs
from personal computer and is there any free tool you can suggest to
leverage procedure.
-- 
Uğur Sopaoğlu


CVE-2018-8024 Apache Spark XSS vulnerability in UI

2018-07-11 Thread Sean Owen
Severity: Medium

Vendor: The Apache Software Foundation

Versions Affected:
Spark versions through 2.1.2
Spark 2.2.0 through 2.2.1
Spark 2.3.0

Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's
possible for a malicious user to construct a URL pointing to a Spark
cluster's UI's job and stage info pages, and if a user can be tricked into
accessing the URL, can be used to cause script to execute and expose
information from the user's view of the Spark UI. While some browsers like
recent versions of Chrome and Safari are able to block this type of attack,
current versions of Firefox (and possibly others) do not.

Mitigation:
1.x, 2.0.x, and 2.1.x users should upgrade to 2.1.3 or newer
2.2.x users should upgrade to 2.2.2 or newer
2.3.x users should upgrade to 2.3.1 or newer

Credit:
Spencer Gietzen, Rhino Security Labs

References:
https://spark.apache.org/security.html


CVE-2018-1334 Apache Spark local privilege escalation vulnerability

2018-07-11 Thread Sean Owen
Severity: High

Vendor: The Apache Software Foundation

Versions affected:
Spark versions through 2.1.2
Spark 2.2.0 to 2.2.1
Spark 2.3.0

Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when
using PySpark or SparkR, it's possible for a different local user to
connect to the Spark application and impersonate the user running the Spark
application.

Mitigation:
1.x, 2.0.x, and 2.1.x users should upgrade to 2.1.3 or newer
2.2.x users should upgrade to 2.2.2 or newer
2.3.x users should upgrade to 2.3.1 or newer
Otherwise, affected users should avoid using PySpark and SparkR in
multi-user environments.

Credit:
Nehmé Tohmé, Cloudera, Inc.

References:
https://spark.apache.org/security.html


[ANNOUNCE] Apache Spark 2.2.2

2018-07-10 Thread Tom Graves
We are happy to announce the availability of Spark 2.2.2!
Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2 
maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade 
to this stable release. The release notes are available at 
http://spark.apache.org/releases/spark-release-2-2-2.html

To download Apache Spark 2.2.2 visit http://spark.apache.org/downloads.html. 
This version of Spark is also available on Maven and PyPI.
We would like to acknowledge all community members for contributing patches to 
this release.



[ANNOUNCE] Apache Spark 2.1.3

2018-07-01 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.3!

Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1
maintenance branch of Spark. We strongly recommend all 2.1.x users to
upgrade to this stable release. The release notes are available at
http://spark.apache.org/releases/spark-release-2-1-3.html

To download Apache Spark 2.1.3 visit http://spark.apache.org/downloads.html.
This version of Spark is also available on Maven and PyPI.

We would like to acknowledge all community members for contributing patches
to this release.

Special thanks to Marcelo Vanzin for making the release, I'm just handling
the last few details this time.


Apache Spark use case: correlate data strings from file

2018-06-20 Thread darkdrake
Hi, I’m new on Spark and I’m trying to understand if it can fit my use case.

I have the following scenario.
I have a file (it can be a log file, .txt, .csv, .xml or .json, I can
produce the data in whatever format I prefer) with some data, e.g.:
*Event “X”, City “Y”, Zone “Z”*

with different events, cities and zones. This data can be represented by
string (like the one I wrote) in a .txt, or by XML , CSV, or JSON, as I
wish. I can also send this data through TCP Socket, if I need it.

What I really want to do is to *correlate each single entry with other
similar entries by declaring rules*.
For example, I want to declare some rules on the data flow: if I received
event X1 and event X2 in same city and same zone, I’ll want to do something
(execute a .bat script, write a log file, etc). Same thing if I received the
same string multiple times, or whatever rule I want to produce with these
data strings.
I’m trying to understand if Apache Spark can fit my use case. The only input
data will be these strings from this file. Can I correlate these events and
how? Is there a GUI to do it? 

Any hints and advices will be appreciated.
Best regards,
Simone




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-11 Thread Marcelo Vanzin
We are happy to announce the availability of Spark 2.3.1!

Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
maintenance branch of Spark. We strongly recommend all 2.3.x users to
upgrade to this stable release.

To download Spark 2.3.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-3-1.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.


-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread amihay gonen
If you are using kafka direct connect api it might be committing offset
back to kafka itself

בתאריך יום ה׳, 7 ביוני 2018, 4:10, מאת licl ‏:

> I met the same issue and I have try to delete the checkpoint dir before the
> job ,
>
> But spark seems can read the correct offset  even though after the
> checkpoint dir is deleted ,
>
> I don't know how spark do this without checkpoint's metadata.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread licl
I met the same issue and I have try to delete the checkpoint dir before the
job ,

But spark seems can read the correct offset  even though after the
checkpoint dir is deleted ,

I don't know how spark do this without checkpoint's metadata.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark Installation error

2018-05-31 Thread Irving Duran
You probably want to recognize "spark-shell" as a command in your
environment.  Maybe try "sudo ln -s /path/to/spark-shell
/usr/bin/spark-shell"  Have you tried "./spark-shell" in the current path
to see if it works?

Thank You,

Irving Duran


On Thu, May 31, 2018 at 9:00 AM Remil Mohanan  wrote:

> Hi there,
>
>I am not able to execute the spark-shell command. Can you please help.
>
> Thanks
>
> Remil
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Apache Spark is not working as expected

2018-05-30 Thread remil
hadoopuser@sherin-VirtualBox:/usr/lib/spark/bin$ spark-shell
spark-shell: command not found
hadoopuser@sherin-VirtualBox:/usr/lib/spark/bin$  Spark.odt
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t9314/Spark.odt>  



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache spark on windows without shortnames enabled

2018-04-15 Thread ashwini
Hi,

We use Apache Spark 2.2.0 in our stack. Our software by default like other
softwares gets installed under "C:\Program Files\". We have a
restriction that we cannot ask our customers to enable short names on their
machines. From our experience, spark does not handle the absolute paths well
if there is a whitespace in the path neither while calling spark-class2.cmd
from commandline nor the paths inside spark-env.cmd.

We have tried the following:
1. Using double quotes around the path.
2. Escaping the whitespaces.
3. Using relative paths. 

None of them have been successful in bringing up spark. How do you recommend
handling this?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache spark -2.1.0 question in Spark SQL

2018-04-03 Thread anbu
Please help me on the below error & give me different approach on the below
data manipulation.

Error:Unable to find encoder for type stored in a Dataset. Primitive types
(Int, String, etc) and Product types (case classes) are supported by
importing spark.implicits._ Support for serializing other types will be
added in future releases.

not enough arguments for method flatmap: (implicit evidence$8:
org.apache.spark.sql.Encoder[org.apache.spark.sql.Row])org.apache.spark.sql.Datset[org.apache.spark.sql.Row]
unspecified value parameter evidence$8.

Code:

def
getFlattenDataframe(someDataframe:DataFrame,spark:SparkSession):DataFrame =
{

val mymanualschema = new StructType(Array(

StructField("field1",StringType,true),

StructField("field2",StringType,true),

StructField("field3",StringType,true),

StructField("field4",IntegerType,true),

StructField("field5",DoubleType,true)))

val flattenRDD= someDataFrame.flatMap{curRow:Row=>getRows(curRow)} >
error showing in this line

spark.createDataFrame(flattenRdd,mymanualschema)

def getRows(CurRow:Row):Array[Row]={

val somefield =curRow.getAs[String]("field1")

--- saome manipulation happening here and finally return a array of rows

return res[Row]

}

Could you please someone help me what causing the issues here.i have tested
the import spark.implicits not working. how to fix this error or else help
me in different approach here.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache Spark - Structured Streaming State Management With Watermark

2018-03-28 Thread M Singh
Hi:
I am using Apache Spark Structured Streaming (2.2.1) to implement custom 
sessionization for events.  The processing is in two steps:1. 
flatMapGroupsWithState (based on user id) - which stores the state of user and 
emits events every minute until a expire event is received 
2. The next step is a aggregation (group by count)

I am using outputMode - Update.

I have a few questions:
1. If I don't use watermark at all -      (a) is the state for 
flatMapGroupsWithState state stored forever ?      (b) is the state for groupBy 
count stored for ever ?2. Is watermark applicable for cleaning up groupBy 
aggregates only ?3. Can we use watermark to manage state in by 
flatMapGroupsWithState ? If so, how ?
4. Can watermark be used for other state clean up - are there any examples for 
those ?
Thanks


Apache Spark - Structured Streaming StreamExecution Stats Description

2018-03-28 Thread M Singh
Hi:
I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState 
and a groupBy count operators.
 In the StreamExecution logs I see two enteries for stateOperators
"stateOperators" : [ {
    "numRowsTotal" : 1617339,
    "numRowsUpdated" : 9647
  }, {
    "numRowsTotal" : 1326355,
    "numRowsUpdated" : 1398672
  } ],
My questions are:1. Is there way to figure out which stats is for 
flatMapGroupWithState and which one for groupBy count ?  In my case, I can 
guess based on my data but want to be definitive about it.2. For the second 
stats - how can the numRowsTotal (1326355) be less than numRowsUpdated 
(1398672) ?
If there in documentation I can use to understand the debug output, please let 
me know.

Thanks


Apache Spark Structured Streaming - How to keep executor alive.

2018-03-23 Thread M Singh
Hi:
I am working on spark structured streaming (2.2.1) with kafka and want 100 
executors to be alive. I set spark.executor.instances to be 100.  The process 
starts running with 100 executors but after some time only a few remain which 
causes backlog of events from kafka.  
I thought I saw a setting to keep the executors from being killed.  However, I 
am not able to find that configuration in spark docs.  If anyone knows that 
setting, please let me know.
Thanks


Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-03-22 Thread M Singh
Hi:
I am working on a realtime application using spark structured streaming (v 
2.2.1). The application reads data from kafka and if there is a failure, I 
would like to ignore the checkpoint.  Is there any configuration to just read 
from last kafka offset after a failure and ignore any offset checkpoints ? 
Also, I believe that the checkpoint also saves state and will continue to 
aggregations after recovery.  Is there any way to ignore checkpointed state ?
Also, is there a way to selectively save state or offset checkpoint only ?

Thanks


Re: Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread Tathagata Das
Structured Streaming AUTOMATICALLY saves the offsets in a checkpoint
directory that you provide. And when you start the query again with the
same directory it will just pick up where it left off.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing

On Thu, Mar 22, 2018 at 8:06 PM, M Singh <mans2si...@yahoo.com.invalid>
wrote:

> Hi:
>
> I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the
> last few days, after running the application for 30-60 minutes get
> exception from Kafka Consumer included below.
>
> The structured streaming application is processing 1 minute worth of data
> from kafka topic. So I've tried increasing request.timeout.ms from 4
> seconds default to 45000 seconds and receive.buffer.bytes to 1mb but still
> get the same exception.
>
> Is there any spark/kafka configuration that can save the offset and retry
> it next time rather than throwing an exception and killing the application.
>
> I've tried googling but have not found substantial
> solution/recommendation.  If anyone has any suggestions or a different
> version etc, please let me know.
>
> Thanks
>
> Here is the exception stack trace.
>
> java.util.concurrent.TimeoutException: Cannot fetch record for offset
> <offset#> in 120000 milliseconds
> at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$
> apache$spark$sql$kafka010$CachedKafkaConsumer$$
> fetchData(CachedKafkaConsumer.scala:219)
> at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(
> CachedKafkaConsumer.scala:117)
> at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(
> CachedKafkaConsumer.scala:106)
> at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(
> UninterruptibleThread.scala:85)
> at org.apache.spark.sql.kafka010.CachedKafkaConsumer.
> runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
> at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(
> CachedKafkaConsumer.scala:106)
> at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.
> getNext(KafkaSourceRDD.scala:157)
> at
>


Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread M Singh
Hi:
I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last 
few days, after running the application for 30-60 minutes get exception from 
Kafka Consumer included below.

The structured streaming application is processing 1 minute worth of data from 
kafka topic. So I've tried increasing request.timeout.ms from 4 seconds 
default to 45000 seconds and receive.buffer.bytes to 1mb but still get the same 
exception.
Is there any spark/kafka configuration that can save the offset and retry it 
next time rather than throwing an exception and killing the application.
I've tried googling but have not found substantial solution/recommendation.  If 
anyone has any suggestions or a different version etc, please let me know.
Thanks
Here is the exception stack trace.

java.util.concurrent.TimeoutException: Cannot fetch record for offset <offset#> 
in 12 millisecondsat 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:219)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
 at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
 at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
 at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
 at 


Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-24 Thread M Singh
Hi Vijay:
I am using spark-shell because I am still prototyping the steps involved.
Regarding executors - I have 280 executors and UI only show a few straggler 
tasks on each trigger.  The UI does not show too much time spend on GC.  
suspect the delay is because of getting data from kafka. The number of 
straggler is generally less than 5 out 240 but sometimes is higher. 

I will try to dig more into it and see if changing partitions etc helps but was 
wondering if anyone else has encountered similar stragglers holding up 
processing of a window trigger.
Thanks
 

On Friday, February 23, 2018 6:07 PM, vijay.bvp <bvpsa...@gmail.com> wrote:
 

 Instead of spark-shell have you tried running it as a job. 

how many executors and cores, can you share the RDD graph and event timeline
on the UI and did you find which of  the tasks taking more time was they are
any GC 

please look at the UI if not already it can provide lot of information



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



   

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread vijay.bvp
Instead of spark-shell have you tried running it as a job. 

how many executors and cores, can you share the RDD graph and event timeline
on the UI and did you find which of  the tasks taking more time was they are
any GC 

please look at the UI if not already it can provide lot of information



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread M Singh
Hi:
I am working with spark structured streaming (2.2.1) reading data from Kafka 
(0.11).  

I need to aggregate data ingested every minute and I am using spark-shell at 
the moment.  The message rate ingestion rate is approx 500k/second.  During 
some trigger intervals (1 minute) especially when the streaming process is 
started, all tasks finish in 20seconds but during some triggers, it takes 90 
seconds.  

I have tried to reduce the number of partitions approx (100 from 300) to reduce 
the consumers for Kafka, but that has not helped. I also tried the 
kafkaConsumer.pollTimeoutMs to 30 seconds but then I see a lot of 
java.util.concurrent.TimeoutException: Cannot fetch record for offset.
So I wanted to see if anyone has any thoughts/recommendations.
Thanks




Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread M Singh
Thanks Richard.  I am hoping that Spark team will at some time, provide more 
detailed documentation.
 

On Sunday, February 11, 2018 2:17 AM, Richard Qiao 
 wrote:
 

 Can find a good source for documents, but the source code 
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to 
answer some of them.
For example:  inputRowsPerSecond = numRecords / inputTimeSec,  
processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why 
the 2 rowPerSec difference.

On Feb 10, 2018, at 8:42 PM, M Singh  wrote:
Hi:
I am working with spark 2.2.0 and am looking at the query status console 
output.  

My application reads from kafka - performs flatMapGroupsWithState and then 
aggregates the elements for two group counts.  The output is send to console 
sink.  I see the following output  (with my questions in bold). 

Please me know where I can find detailed description of the query status fields 
for spark structured streaming ?


StreamExecution: Streaming query made progress: {
  "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
  "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
  "name" : null,
  "timestamp" : "2018-02-11T01:18:00.005Z",
  "numInputRows" : 5780,
  "inputRowsPerSecond" : 96.32851690748795,    
  "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
shuffling/grouping ?
  "durationMs" : {
    "addBatch" : 9765,    // Is 
the time taken to get send output to all console output streams ? 
    "getBatch" : 3,   
// Is this time taken to get the batch from Kafka ?
    "getOffset" : 3,   
// Is this time for getting offset from Kafka ?
    "queryPlanning" : 89, // 
The value of this field changes with different triggers but the query is not 
changing so why does this change ?
    "triggerExecution" : 9898, // Is 
this total time for this trigger ?
    "walCommit" : 35 // Is 
this for checkpointing ?
  },
  "stateOperators" : [ {   // 
What are the two state operators ? I am assuming one is flatMapWthState (first 
one).
    "numRowsTotal" : 8,
    "numRowsUpdated" : 1
  }, {
    "numRowsTotal" : 6,    //Is 
this the group by state operator ?  If so, I have two group by so why do I see 
only one ?
    "numRowsUpdated" : 6
  } ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[xyz]]",
    "startOffset" : {
  "xyz" : {
    "2" : 9183,
    "1" : 9184,
    "3" : 9184,
    "0" : 9183
  }
    },
    "endOffset" : {
  "xyz" : {
    "2" : 10628,
    "1" : 10629,
    "3" : 10629,
    "0" : 10628
  }
    },
    "numInputRows" : 5780,
    "inputRowsPerSecond" : 96.32851690748795,
    "processedRowsPerSecond" : 583.9563548191554
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
  }
}






   

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread Richard Qiao
Can find a good source for documents, but the source code 
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to 
answer some of them.

For example:
  inputRowsPerSecond = numRecords / inputTimeSec,
  processedRowsPerSecond = numRecords / processingTimeSec
This is explaining why the 2 rowPerSec difference.

> On Feb 10, 2018, at 8:42 PM, M Singh  wrote:
> 
> Hi:
> 
> I am working with spark 2.2.0 and am looking at the query status console 
> output.  
> 
> My application reads from kafka - performs flatMapGroupsWithState and then 
> aggregates the elements for two group counts.  The output is send to console 
> sink.  I see the following output  (with my questions in bold). 
> 
> Please me know where I can find detailed description of the query status 
> fields for spark structured streaming ?
> 
> 
> StreamExecution: Streaming query made progress: {
>   "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
>   "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
>   "name" : null,
>   "timestamp" : "2018-02-11T01:18:00.005Z",
>   "numInputRows" : 5780,
>   "inputRowsPerSecond" : 96.32851690748795,
>   "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
> processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
> shuffling/grouping ?
>   "durationMs" : {
> "addBatch" : 9765,// 
> Is the time taken to get send output to all console output streams ? 
> "getBatch" : 3,   
> // Is this time taken to get the batch from Kafka ?
> "getOffset" : 3,  
>  // Is this time for getting offset from Kafka ?
> "queryPlanning" : 89, // 
> The value of this field changes with different triggers but the query is not 
> changing so why does this change ?
> "triggerExecution" : 9898, // Is 
> this total time for this trigger ?
> "walCommit" : 35 // 
> Is this for checkpointing ?
>   },
>   "stateOperators" : [ {   // 
> What are the two state operators ? I am assuming one is flatMapWthState 
> (first one).
> "numRowsTotal" : 8,
> "numRowsUpdated" : 1
>   }, {
> "numRowsTotal" : 6,//Is 
> this the group by state operator ?  If so, I have two group by so why do I 
> see only one ?
> "numRowsUpdated" : 6
>   } ],
>   "sources" : [ {
> "description" : "KafkaSource[Subscribe[xyz]]",
> "startOffset" : {
>   "xyz" : {
> "2" : 9183,
> "1" : 9184,
> "3" : 9184,
> "0" : 9183
>   }
> },
> "endOffset" : {
>   "xyz" : {
> "2" : 10628,
> "1" : 10629,
> "3" : 10629,
> "0" : 10628
>   }
> },
> "numInputRows" : 5780,
> "inputRowsPerSecond" : 96.32851690748795,
> "processedRowsPerSecond" : 583.9563548191554
>   } ],
>   "sink" : {
> "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
>   }
> }
> 
> 



Re: Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-10 Thread M Singh
Just checking if anyone has any pointers for dynamically updating query state 
in structured streaming.
Thanks
 

On Thursday, February 8, 2018 2:58 PM, M Singh 
 wrote:
 

 Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to 
update the state periodically.
Here is the scenario:
1. I have a udf with a variable with default value (eg: 1)  This value is 
applied to a column (eg: subtract the variable from the column value )2. The 
variable is to be updated periodically asynchronously (eg: reading a file every 
5 minutes) and the new rows will have the new value applied to the column value.
Spark natively supports broadcast variables, but I could not find a way to 
update the broadcasted variables dynamically or rebroadcast them once so that 
the udf internal state can be updated while the structure streaming application 
is running.
I can try to read the variable from the file on each invocation of the udf but 
it will not scale since each invocation open/read/close the file.
Please let me know if there is any documentation/example to support this 
scenario.
Thanks





   

Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-10 Thread M Singh
Hi:
I am working with spark 2.2.0 and am looking at the query status console 
output.  

My application reads from kafka - performs flatMapGroupsWithState and then 
aggregates the elements for two group counts.  The output is send to console 
sink.  I see the following output  (with my questions in bold). 

Please me know where I can find detailed description of the query status fields 
for spark structured streaming ?


StreamExecution: Streaming query made progress: {
  "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
  "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
  "name" : null,
  "timestamp" : "2018-02-11T01:18:00.005Z",
  "numInputRows" : 5780,
  "inputRowsPerSecond" : 96.32851690748795,    
  "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
shuffling/grouping ?
  "durationMs" : {
    "addBatch" : 9765,    // Is 
the time taken to get send output to all console output streams ? 
    "getBatch" : 3,   
// Is this time taken to get the batch from Kafka ?
    "getOffset" : 3,   
// Is this time for getting offset from Kafka ?
    "queryPlanning" : 89, // 
The value of this field changes with different triggers but the query is not 
changing so why does this change ?
    "triggerExecution" : 9898, // Is 
this total time for this trigger ?
    "walCommit" : 35 // Is 
this for checkpointing ?
  },
  "stateOperators" : [ {   // 
What are the two state operators ? I am assuming one is flatMapWthState (first 
one).
    "numRowsTotal" : 8,
    "numRowsUpdated" : 1
  }, {
    "numRowsTotal" : 6,    //Is 
this the group by state operator ?  If so, I have two group by so why do I see 
only one ?
    "numRowsUpdated" : 6
  } ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[xyz]]",
    "startOffset" : {
  "xyz" : {
    "2" : 9183,
    "1" : 9184,
    "3" : 9184,
    "0" : 9183
  }
    },
    "endOffset" : {
  "xyz" : {
    "2" : 10628,
    "1" : 10629,
    "3" : 10629,
    "0" : 10628
  }
    },
    "numInputRows" : 5780,
    "inputRowsPerSecond" : 96.32851690748795,
    "processedRowsPerSecond" : 583.9563548191554
  } ],
  "sink" : {
    "description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
  }
}




Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-08 Thread M Singh
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to 
update the state periodically.
Here is the scenario:
1. I have a udf with a variable with default value (eg: 1)  This value is 
applied to a column (eg: subtract the variable from the column value )2. The 
variable is to be updated periodically asynchronously (eg: reading a file every 
5 minutes) and the new rows will have the new value applied to the column value.
Spark natively supports broadcast variables, but I could not find a way to 
update the broadcasted variables dynamically or rebroadcast them once so that 
the udf internal state can be updated while the structure streaming application 
is running.
I can try to read the variable from the file on each invocation of the udf but 
it will not scale since each invocation open/read/close the file.
Please let me know if there is any documentation/example to support this 
scenario.
Thanks





Free access to Index Conf for Apache Spark community attendees

2018-02-08 Thread xwu0226
Free access to Index Conf for Apache Spark session attendees. For info go to:
https://www.meetup.com/SF-Big-Analytic

IBM is hosting a developer conference - Essentially the conference is ‘By
Developers, for Developers’ based on Open technologies.
This will be held Feb 20 - 22nd in Moscone West.
http://indexconf.com

We have a phenomenal list of speakers for the Spark community as well as
participation by the Tensorflow, R, and other communities.
https://developer.ibm.com/indexconf/communities/ Register using this Promo
Code to get your free access.
Usage Instructions:
Follow this link:
https://www.ibm.com/events/wwe/indexconf/indexconf18.nsf/Registration.xsp?open
Select “Attendee” as your Registration Type
Enter your Registration Promotion Code*: IND18FULL

*Restrictions:
Promotion Code expires February 12th, 11:59PM Pacific
Government Owned Entities (GOE’s) not eligible

*If you have previously registered, please reach out to Jeff Borek
(jbo...@us.ibm.com) to take advantage of the new discount code.*

Detailed Agenda for Spark Community Day Feb 20th

2:00 - 2:30
What the community does in the coming release Spark 2.3.
Sean Li, Apache Spark committer & PMC member from Databricks

There are many great features added to Apache Spark. This talk is to provide
a preview of the new features and updates in the coming release Spark 2.3.

2:30 - 3:00
Data Warehouse Features in Spark SQL
Ioana Ursu, IBM Lead contributor on SparkSQL

This talk covers advanced Spark SQL features for data warehouse such as
star-schema optimizations and informational constraints support. Star-schema
consists of a fact table referencing a number of dimension tables. Fact and
dimension tables are in a primary key – foreign key relationship. An
informational or statistical constraint can be used by Spark to improve
query performance.

3:00- 3:30
Building an Enterprise/Cloud Analytics Platform with Jupyter Notebooks and
Apache Spark
Frederick Reiss Chief architect of IBM Spark Technology Center

Data Scientists are becoming a necessity of every company in the
data-centric world of today, and with them comes the requirement to make
available a flexible and interactive analytics platform that exposes
Notebook services at web scale. In this session we will describe our
experience and best practices building the platform, in particular how we
built the Enterprise Gateway that enables all the Notebooks to share the
Spark cluster computational resources. 3:45-4:15
The State of Spark MLlib and New Scalability Features in 2.3
Nick Pentreath, Spark committer & PMC member

This talk will give an overview of Spark’s machine learning library, MLlib.
The new 2.3 release of Spark brings some exciting scalability enhancements
to MLlib, which we will explore in depth, including parallel
cross-validation and performance improvements for larger-scale datasets
through adding multi-column support to the most widely-used Spark
transformers.

4:15 - 5:15
Spark and AI
Nick Pentreath & Fred Reiss

This session will be an open discussion of the role of Spark within the AI
landscape, and what the future holds for AI / deep learning on Spark. In
recent years specialized systems (such as TensorFlow, Caffe, PyTorch and
MXNet) have been dominant in the domain of AI and deep learning. While there
are a few deep learning frameworks that are Spark specific, often these
frameworks are separate from Spark and the ease of integration and feature
set exposed varies considerably.

5:15 - 6:00
HSpark: enable Spark SQL query on NoSQL Hbase tables
Bo Meng, IBM Spark contributor, Yan Zhou IBM Hadoop Architect HBase is a
NoSQL data source which allows flexible data storage and access mechanisms.
While leveraging Spark’s high scalable framework and programming interface,
we added SQL capability to HBase and an easy of use interface for data
scientists and traditional analysts. We will discuss how we implement HSpark
by leveraging Spark SQL parser, mapping different data types, pushing down
the predicates to HBase and improving the query performance



-
Xin Wu | @xwu0226
Spark Technology Center
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
Hi Jacek:
Thanks for your response.
I am just trying to understand the fundamentals of watermarking and how it 
behaves in aggregation vs non-aggregation scenarios.



 

On Tuesday, February 6, 2018 9:04 AM, Jacek Laskowski  
wrote:
 

 Hi,
What would you expect? The data is simply dropped as that's the purpose of 
watermarking it. That's my understanding at least.
Pozdrawiam,Jacek Laskowskihttps://about.me/JacekLaskowskiMastering Spark 
SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Feb 5, 2018 at 8:11 PM, M Singh  wrote:

Just checking if anyone has more details on how watermark works in cases where 
event time is earlier than processing time stamp. 

On Friday, February 2, 2018 8:47 AM, M Singh  wrote:
 

 Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.

Vishnu - Spark documentation (https://spark.apache.org/ docs/latest/structured- 
streaming-programming-guide. html) does indicate that it can dedup using 
watermark.  So I believe there are more use cases for watermark and that is 
what I am trying to find.
I am hoping that TD can clarify or point me to the documentation.
Thanks
 

On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath 
 wrote:
 

 Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the even it 
delayed more than when the state is cleared by Spark, then it will be ignored.I 
recently wrote a blog post on this : http://vishnuviswanath.com/ 
spark_structured_streaming. html#watermark

Yes, this State is applicable for aggregation only. If you are having only a 
map function and don't want to process it, you could do a filter based on its 
EventTime field, but I guess you will have to compare it with the processing 
time since there is no API to access Watermark by the user. 
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh  wrote:

Hi:
I am trying to filter out records which are lagging behind (based on event 
time) by a certain amount of time.  
Is the watermark api applicable to this scenario (ie, filtering lagging 
records) or it is only applicable with aggregation ?  I could not get a clear 
understanding from the documentation which only refers to it's usage with 
aggregation.

Thanks
Mans



   

   



   

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread Jacek Laskowski
Hi,

What would you expect? The data is simply dropped as that's the purpose of
watermarking it. That's my understanding at least.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Feb 5, 2018 at 8:11 PM, M Singh  wrote:

> Just checking if anyone has more details on how watermark works in cases
> where event time is earlier than processing time stamp.
>
>
> On Friday, February 2, 2018 8:47 AM, M Singh  wrote:
>
>
> Hi Vishu/Jacek:
>
> Thanks for your responses.
>
> Jacek - At the moment, the current time for my use case is processing time.
>
> Vishnu - Spark documentation (https://spark.apache.org/
> docs/latest/structured-streaming-programming-guide.html) does indicate
> that it can dedup using watermark.  So I believe there are more use cases
> for watermark and that is what I am trying to find.
>
> I am hoping that TD can clarify or point me to the documentation.
>
> Thanks
>
>
> On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>
> Hi Mans,
>
> Watermark is Spark is used to decide when to clear the state, so if the
> even it delayed more than when the state is cleared by Spark, then it will
> be ignored.
> I recently wrote a blog post on this : http://vishnuviswanath.com/
> spark_structured_streaming.html#watermark
>
> Yes, this State is applicable for aggregation only. If you are having only
> a map function and don't want to process it, you could do a filter based on
> its EventTime field, but I guess you will have to compare it with the
> processing time since there is no API to access Watermark by the user.
>
> -Vishnu
>
> On Fri, Jan 26, 2018 at 1:14 PM, M Singh 
> wrote:
>
> Hi:
>
> I am trying to filter out records which are lagging behind (based on event
> time) by a certain amount of time.
>
> Is the watermark api applicable to this scenario (ie, filtering lagging
> records) or it is only applicable with aggregation ?  I could not get a
> clear understanding from the documentation which only refers to it's usage
> with aggregation.
>
> Thanks
>
> Mans
>
>
>
>
>
>
>


Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-05 Thread M Singh
Just checking if anyone has more details on how watermark works in cases where 
event time is earlier than processing time stamp. 

On Friday, February 2, 2018 8:47 AM, M Singh  wrote:
 

 Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.

Vishnu - Spark documentation 
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
 does indicate that it can dedup using watermark.  So I believe there are more 
use cases for watermark and that is what I am trying to find.
I am hoping that TD can clarify or point me to the documentation.
Thanks
 

On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath 
 wrote:
 

 Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the even it 
delayed more than when the state is cleared by Spark, then it will be ignored.I 
recently wrote a blog post on this : 
http://vishnuviswanath.com/spark_structured_streaming.html#watermark

Yes, this State is applicable for aggregation only. If you are having only a 
map function and don't want to process it, you could do a filter based on its 
EventTime field, but I guess you will have to compare it with the processing 
time since there is no API to access Watermark by the user. 
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh  wrote:

Hi:
I am trying to filter out records which are lagging behind (based on event 
time) by a certain amount of time.  
Is the watermark api applicable to this scenario (ie, filtering lagging 
records) or it is only applicable with aggregation ?  I could not get a clear 
understanding from the documentation which only refers to it's usage with 
aggregation.

Thanks
Mans



   

   

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-05 Thread M Singh
Hi TD:
Just wondering if you have any insight for me or need more info.
Thanks 

On Thursday, February 1, 2018 7:43 AM, M Singh 
<mans2si...@yahoo.com.INVALID> wrote:
 

 Hi TD:
Here is the udpated code with explain and full stack trace.
Please let me know what could be the issue and what to look for in the explain 
output.
Updated code:
import scala.collection.immutableimport org.apache.spark.sql.functions._import 
org.joda.time._import org.apache.spark.sql._import 
org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import 
org.apache.log4j._
object StreamingTest {  def main(args:Array[String]) : Unit = {    val 
sparkBuilder = SparkSession      .builder.      
config("spark.sql.streaming.checkpointLocation", "./checkpointes").      
appName("StreamingTest").master("local[4]")          val spark = 
sparkBuilder.getOrCreate()       val schema = StructType(        Array(         
 StructField("id", StringType, false),           StructField("visit", 
StringType, false)          ))    var dataframeInput = 
spark.readStream.option("sep","\t").schema(schema).csv("./data/")    var 
dataframe2 = dataframeInput.select("*")    dataframe2 = 
dataframe2.withColumn("cts", current_timestamp().cast("long"))    
dataframe2.explain(true)    val query = 
dataframe2.writeStream.option("trucate","false").format("console").start    
query.awaitTermination()  }}
Explain output:
== Parsed Logical Plan ==Project [id#0, visit#1, cast(current_timestamp() as 
bigint) AS cts#6L]+- AnalysisBarrier      +- Project [id#0, visit#1]         +- 
StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false),
 StructField(visit,StringType,false))),List(),None,Map(sep ->  , path -> 
./data/),None), FileSource[./data/], [id#0, visit#1]
== Analyzed Logical Plan ==id: string, visit: string, cts: bigintProject [id#0, 
visit#1, cast(current_timestamp() as bigint) AS cts#6L]+- Project [id#0, 
visit#1]   +- StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false),
 StructField(visit,StringType,false))),List(),None,Map(sep ->  , path -> 
./data/),None), FileSource[./data/], [id#0, visit#1]
== Optimized Logical Plan ==Project [id#0, visit#1, 1517499591 AS cts#6L]+- 
StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false),
 StructField(visit,StringType,false))),List(),None,Map(sep ->  , path -> 
./data/),None), FileSource[./data/], [id#0, visit#1]
== Physical Plan ==*(1) Project [id#0, visit#1, 1517499591 AS cts#6L]+- 
StreamingRelation FileSource[./data/], [id#0, visit#1]
Here is the exception:
18/02/01 07:39:52 ERROR MicroBatchExecution: Query [id = 
a0e573f0-e93b-48d9-989c-1aaa73539b58, runId = 
b5c618cb-30c7-4eff-8f09-ea1d064878ae] terminated with 
errororg.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call 
to dataType on unresolved object, tree: 'cts at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
 at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157) 
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:448)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:134)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122)
 at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
 at 
org.apache.spark.sql.execution.st

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is processing time.

Vishnu - Spark documentation 
(https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
 does indicate that it can dedup using watermark.  So I believe there are more 
use cases for watermark and that is what I am trying to find.
I am hoping that TD can clarify or point me to the documentation.
Thanks
 

On Wednesday, January 31, 2018 6:37 AM, Vishnu Viswanath 
 wrote:
 

 Hi Mans,
Watermark is Spark is used to decide when to clear the state, so if the even it 
delayed more than when the state is cleared by Spark, then it will be ignored.I 
recently wrote a blog post on this : 
http://vishnuviswanath.com/spark_structured_streaming.html#watermark

Yes, this State is applicable for aggregation only. If you are having only a 
map function and don't want to process it, you could do a filter based on its 
EventTime field, but I guess you will have to compare it with the processing 
time since there is no API to access Watermark by the user. 
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh  wrote:

Hi:
I am trying to filter out records which are lagging behind (based on event 
time) by a certain amount of time.  
Is the watermark api applicable to this scenario (ie, filtering lagging 
records) or it is only applicable with aggregation ?  I could not get a clear 
understanding from the documentation which only refers to it's usage with 
aggregation.

Thanks
Mans



   

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-01 Thread M Singh
Hi TD:
Here is the udpated code with explain and full stack trace.
Please let me know what could be the issue and what to look for in the explain 
output.
Updated code:
import scala.collection.immutableimport org.apache.spark.sql.functions._import 
org.joda.time._import org.apache.spark.sql._import 
org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import 
org.apache.log4j._
object StreamingTest {  def main(args:Array[String]) : Unit = {    val 
sparkBuilder = SparkSession      .builder.      
config("spark.sql.streaming.checkpointLocation", "./checkpointes").      
appName("StreamingTest").master("local[4]")          val spark = 
sparkBuilder.getOrCreate()       val schema = StructType(        Array(         
 StructField("id", StringType, false),           StructField("visit", 
StringType, false)          ))    var dataframeInput = 
spark.readStream.option("sep","\t").schema(schema).csv("./data/")    var 
dataframe2 = dataframeInput.select("*")    dataframe2 = 
dataframe2.withColumn("cts", current_timestamp().cast("long"))    
dataframe2.explain(true)    val query = 
dataframe2.writeStream.option("trucate","false").format("console").start    
query.awaitTermination()  }}
Explain output:
== Parsed Logical Plan ==Project [id#0, visit#1, cast(current_timestamp() as 
bigint) AS cts#6L]+- AnalysisBarrier      +- Project [id#0, visit#1]         +- 
StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false),
 StructField(visit,StringType,false))),List(),None,Map(sep ->  , path -> 
./data/),None), FileSource[./data/], [id#0, visit#1]
== Analyzed Logical Plan ==id: string, visit: string, cts: bigintProject [id#0, 
visit#1, cast(current_timestamp() as bigint) AS cts#6L]+- Project [id#0, 
visit#1]   +- StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false),
 StructField(visit,StringType,false))),List(),None,Map(sep ->  , path -> 
./data/),None), FileSource[./data/], [id#0, visit#1]
== Optimized Logical Plan ==Project [id#0, visit#1, 1517499591 AS cts#6L]+- 
StreamingRelation 
DataSource(org.apache.spark.sql.SparkSession@3c74aa0d,csv,List(),Some(StructType(StructField(id,StringType,false),
 StructField(visit,StringType,false))),List(),None,Map(sep ->  , path -> 
./data/),None), FileSource[./data/], [id#0, visit#1]
== Physical Plan ==*(1) Project [id#0, visit#1, 1517499591 AS cts#6L]+- 
StreamingRelation FileSource[./data/], [id#0, visit#1]
Here is the exception:
18/02/01 07:39:52 ERROR MicroBatchExecution: Query [id = 
a0e573f0-e93b-48d9-989c-1aaa73539b58, runId = 
b5c618cb-30c7-4eff-8f09-ea1d064878ae] terminated with 
errororg.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call 
to dataType on unresolved object, tree: 'cts at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.AbstractTraversable.map(Traversable.scala:104) at 
org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157)
 at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:157) 
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:448)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:134)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:122)
 at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
 at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
 at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfu

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread Tathagata Das
Could you give the full stack trace of the exception?

Also, can you do `dataframe2.explain(true)` and show us the plan output?



On Wed, Jan 31, 2018 at 3:35 PM, M Singh 
wrote:

> Hi Folks:
>
> I have to add a column to a structured *streaming* dataframe but when I
> do that (using select or withColumn) I get an exception.  I can add a
> column in structured *non-streaming* structured dataframe. I could not
> find any documentation on how to do this in the following doc  [
> https://spark.apache.org/docs/latest/
> *structured-streaming-programming-guide*.html]
>
> I am using spark 2.4.0-SNAPSHOT
>
> Please let me know what I could be missing.
>
> Thanks for your help.
>
> (I am also attaching the source code for the structured streaming,
> structured non-streaming classes and input file with this email)
>
> 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call
> to dataType on unresolved object, tree: 'cts
> at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(
> unresolved.scala:105)
> at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(
> StructType.scala:435)
> 
>
> Here is the input file (in the ./data directory) - note tokens are
> separated by '\t'
>
> 1 v1
> 2 v1
> 2 v2
> 3 v3
> 3 v1
>
> Here is the code with dataframe (*non-streaming*) which works:
>
> import scala.collection.immutable
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
>
> object StructuredTest {
>   def main(args:Array[String]) : Unit = {
> val sparkBuilder = SparkSession
>   .builder.
>   appName("StreamingTest").master("local[4]")
>
> val spark = sparkBuilder.getOrCreate()
>
> val schema = StructType(
> Array(
>   StructField("id", StringType, false),
>   StructField("visit", StringType, false)
>   ))
> var dataframe = spark.read.option("sep","\t").schema(schema).csv(
> "./data/")
> var dataframe2 = dataframe.select(expr("*"), current_timestamp().as(
> "cts"))
> dataframe2.show(false)
> spark.stop()
>
>   }
> }
>
> Output of the above code is:
>
> +---+-+---+
> |id |visit|cts|
> +---+-+---+
> |1  |v1   |2018-01-31 15:07:00.758|
> |2  |v1   |2018-01-31 15:07:00.758|
> |2  |v2   |2018-01-31 15:07:00.758|
> |3  |v3   |2018-01-31 15:07:00.758|
> |3  |v1   |2018-01-31 15:07:00.758|
> +---+-+---+
>
>
> Here is the code with *structured streaming* which throws the exception:
>
> import scala.collection.immutable
> import org.apache.spark.sql.functions._
> import org.joda.time._
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.streaming._
> import org.apache.log4j._
>
> object StreamingTest {
>   def main(args:Array[String]) : Unit = {
> val sparkBuilder = SparkSession
>   .builder.
>   config("spark.sql.streaming.checkpointLocation", "./checkpointes").
>   appName("StreamingTest").master("local[4]")
>
> val spark = sparkBuilder.getOrCreate()
>
> val schema = StructType(
> Array(
>   StructField("id", StringType, false),
>   StructField("visit", StringType, false)
>   ))
> var dataframeInput = spark.readStream.option("sep","\t"
> ).schema(schema).csv("./data/")
> var dataframe2 = dataframeInput.select("*")
> dataframe2 = dataframe2.withColumn("cts", current_timestamp())
> val query = dataframe2.writeStream.option("trucate","false").format("
> console").start
> query.awaitTermination()
>   }
> }
>
> Here is the exception:
>
> 18/01/31 15:10:25 ERROR MicroBatchExecution: Query [id =
> 0fe655de-9096-4d69-b6a5-c593400d2eba, runId = 
> 2394a402-dd52-49b4-854e-cb46684bf4d8]
> terminated with error
> *org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call
> to dataType on unresolved object, tree: 'cts*
> at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(
> unresolved.scala:105)
> at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(
> StructType.scala:435)
> at org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(
> StructType.scala:435)
>
> I've also used snippets (shown in bold below) from (
> https://docs.databricks.com/spark/latest/structured-
> streaming/examples.html)
> but still get the same exception:
>
> Here is the code:
>
> import scala.collection.immutable
> import org.apache.spark.sql.functions._
> import org.joda.time._
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.streaming._
> import org.apache.log4j._
>
> object StreamingTest {
>   def main(args:Array[String]) : Unit = {
> val sparkBuilder = SparkSession
>   .builder.
>   config("spark.sql.streaming.checkpointLocation", "./checkpointes").
>   appName("StreamingTest").master("local[4]")
>
> val 

Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread M Singh
Hi Folks:
I have to add a column to a structured streaming dataframe but when I do that 
(using select or withColumn) I get an exception.  I can add a column in 
structured non-streaming structured dataframe. I could not find any 
documentation on how to do this in the following doc  
[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
I am using spark 2.4.0-SNAPSHOT
Please let me know what I could be missing.

Thanks for your help.
(I am also attaching the source code for the structured streaming, structured 
non-streaming classes and input file with this email)
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid 
call to dataType on unresolved object, tree: 'cts at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
Here is the input file (in the ./data directory) - note tokens are separated by 
'\t'
1 v12 v12 v23 v33 v1
Here is the code with dataframe (non-streaming) which works:
import scala.collection.immutableimport org.apache.spark.sql.functions._import 
org.apache.spark.sql._import org.apache.spark.sql.types._
object StructuredTest {  def main(args:Array[String]) : Unit = {    val 
sparkBuilder = SparkSession      .builder.      
appName("StreamingTest").master("local[4]")          val spark = 
sparkBuilder.getOrCreate()       val schema = StructType(        Array(         
 StructField("id", StringType, false),           StructField("visit", 
StringType, false)          ))    var dataframe = 
spark.read.option("sep","\t").schema(schema).csv("./data/")    var dataframe2 = 
dataframe.select(expr("*"), current_timestamp().as("cts"))    
dataframe2.show(false)    spark.stop()      }}
Output of the above code is:
+---+-+---+|id |visit|cts                    
|+---+-+---+|1  |v1   |2018-01-31 15:07:00.758||2  |v1  
 |2018-01-31 15:07:00.758||2  |v2   |2018-01-31 15:07:00.758||3  |v3   
|2018-01-31 15:07:00.758||3  |v1   |2018-01-31 
15:07:00.758|+---+-+---+

Here is the code with structured streaming which throws the exception:
import scala.collection.immutableimport org.apache.spark.sql.functions._import 
org.joda.time._import org.apache.spark.sql._import 
org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import 
org.apache.log4j._
object StreamingTest {  def main(args:Array[String]) : Unit = {    val 
sparkBuilder = SparkSession      .builder.      
config("spark.sql.streaming.checkpointLocation", "./checkpointes").      
appName("StreamingTest").master("local[4]")          val spark = 
sparkBuilder.getOrCreate()       val schema = StructType(        Array(         
 StructField("id", StringType, false),           StructField("visit", 
StringType, false)          ))    var dataframeInput = 
spark.readStream.option("sep","\t").schema(schema).csv("./data/")    var 
dataframe2 = dataframeInput.select("*")    dataframe2 = 
dataframe2.withColumn("cts", current_timestamp())    val query = 
dataframe2.writeStream.option("trucate","false").format("console").start    
query.awaitTermination()  }}
Here is the exception:
18/01/31 15:10:25 ERROR MicroBatchExecution: Query [id = 
0fe655de-9096-4d69-b6a5-c593400d2eba, runId = 
2394a402-dd52-49b4-854e-cb46684bf4d8] terminated with 
errororg.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call 
to dataType on unresolved object, tree: 'cts at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
 at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435)
I've also used snippets (shown in bold below) from 
(https://docs.databricks.com/spark/latest/structured-streaming/examples.html)but
 still get the same exception:
Here is the code:
import scala.collection.immutableimport org.apache.spark.sql.functions._import 
org.joda.time._import org.apache.spark.sql._import 
org.apache.spark.sql.types._import org.apache.spark.sql.streaming._import 
org.apache.log4j._
object StreamingTest {  def main(args:Array[String]) : Unit = {    val 
sparkBuilder = SparkSession      .builder.      
config("spark.sql.streaming.checkpointLocation", "./checkpointes").      
appName("StreamingTest").master("local[4]")          val spark = 
sparkBuilder.getOrCreate()       val schema = StructType(        Array(         
 StructField("id", StringType, false),           StructField("visit", 
StringType, false)          ))    var dataframeInput = 
spark.readStream.option("sep","\t").schema(schema).csv("./data/")    var 
dataframe2 = dataframeInput.select(      
current_timestamp().cast("timestamp").alias("timestamp"),      expr("*"))    
val query = 
dataframe2.writeStream.option("trucate","false").format("console").start
    query.awaitTermination() 

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-31 Thread Vishnu Viswanath
Hi Mans,

Watermark is Spark is used to decide when to clear the state, so if the
even it delayed more than when the state is cleared by Spark, then it will
be ignored.
I recently wrote a blog post on this :
http://vishnuviswanath.com/spark_structured_streaming.html#watermark

Yes, this State is applicable for aggregation only. If you are having only
a map function and don't want to process it, you could do a filter based on
its EventTime field, but I guess you will have to compare it with the
processing time since there is no API to access Watermark by the user.

-Vishnu

On Fri, Jan 26, 2018 at 1:14 PM, M Singh 
wrote:

> Hi:
>
> I am trying to filter out records which are lagging behind (based on event
> time) by a certain amount of time.
>
> Is the watermark api applicable to this scenario (ie, filtering lagging
> records) or it is only applicable with aggregation ?  I could not get a
> clear understanding from the documentation which only refers to it's usage
> with aggregation.
>
> Thanks
>
> Mans
>


Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Dongjoon Hyun
Hi, Nicolas.

Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901
(Feature parity for ORC with Parquet).
For your questions, the following three are related.

1. spark.sql.orc.impl="native"
By default, `native` ORC implementation (based on the latest ORC 1.4.1)
is added.
The old one is `hive` implementation.

2. spark.sql.orc.enableVectorizedReader="true"
By default, `native` ORC implementation uses Vectorized Reader code
path if possible.
Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only
supported only for simple data types.

3. spark.sql.hive.convertMetastoreOrc=true
Like Parquet, by default, Hive tables are converted into file-based
data sources to use Vectorization technique.

Bests,
Dongjoon.



On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <nipari...@gmail.com> wrote:

> Hi
>
> Thanks for this work.
>
> Will this affect both:
> 1) spark.read.format("orc").load("...")
> 2) spark.sql("select ... from my_orc_table_in_hive")
>
> ?
>
>
> Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> > Hi, All.
> >
> > Vectorized ORC Reader is now supported in Apache Spark 2.3.
> >
> > https://issues.apache.org/jira/browse/SPARK-16060
> >
> > It has been a long journey. From now, Spark can read ORC files faster
> without
> > feature penalty.
> >
> > Thank you for all your support, especially Wenchen Fan.
> >
> > It's done by two commits.
> >
> > [SPARK-16060][SQL] Support Vectorized ORC Reader
> > https://github.com/apache/spark/commit/
> f44ba910f58083458e1133502e193a
> > 9d6f2bf766
> >
> > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized
> orc
> > reader
> > https://github.com/apache/spark/commit/
> eaac60a1e20e29084b7151ffca964c
> > faa5ba99d1
> >
> > Please check OrcReadBenchmark for the final speed-up from `Hive built-in
> ORC`
> > to `Native ORC Vectorized`.
> >
> > https://github.com/apache/spark/blob/master/sql/hive/
> src/test/scala/org/
> > apache/spark/sql/hive/orc/OrcReadBenchmark.scala
> >
> > Thank you.
> >
> > Bests,
> > Dongjoon.
>


Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Nicolas Paris
Hi

Thanks for this work.

Will this affect both:
1) spark.read.format("orc").load("...")
2) spark.sql("select ... from my_orc_table_in_hive")

?


Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> Hi, All.
> 
> Vectorized ORC Reader is now supported in Apache Spark 2.3.
> 
>     https://issues.apache.org/jira/browse/SPARK-16060
> 
> It has been a long journey. From now, Spark can read ORC files faster without
> feature penalty.
> 
> Thank you for all your support, especially Wenchen Fan.
> 
> It's done by two commits.
> 
>     [SPARK-16060][SQL] Support Vectorized ORC Reader
>     https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
> 9d6f2bf766
> 
>     [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
> reader
>     https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
> faa5ba99d1
> 
> Please check OrcReadBenchmark for the final speed-up from `Hive built-in ORC`
> to `Native ORC Vectorized`.
> 
>     https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/
> apache/spark/sql/hive/orc/OrcReadBenchmark.scala
> 
> Thank you.
> 
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread Jacek Laskowski
Hi,

I'm curious how would you do the requirement "by a certain amount of time"
without a watermark? How would you know what's current and compute the lag?
Let's forget about watermark for a moment and see if it pops up as an
inevitable feature :)

"I am trying to filter out records which are lagging behind (based on event
time) by a certain amount of time."

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Fri, Jan 26, 2018 at 7:14 PM, M Singh 
wrote:

> Hi:
>
> I am trying to filter out records which are lagging behind (based on event
> time) by a certain amount of time.
>
> Is the watermark api applicable to this scenario (ie, filtering lagging
> records) or it is only applicable with aggregation ?  I could not get a
> clear understanding from the documentation which only refers to it's usage
> with aggregation.
>
> Thanks
>
> Mans
>


Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread M Singh
Hi:
I am trying to filter out records which are lagging behind (based on event 
time) by a certain amount of time.  
Is the watermark api applicable to this scenario (ie, filtering lagging 
records) or it is only applicable with aggregation ?  I could not get a clear 
understanding from the documentation which only refers to it's usage with 
aggregation.

Thanks
Mans

Re: Apache Spark - Custom structured streaming data source

2018-01-26 Thread M Singh
Thanks TD.  When will 2.3 scheduled for release ?   

On Thursday, January 25, 2018 11:32 PM, Tathagata Das  
wrote:
 

 Hello Mans,
The streaming DataSource APIs are still evolving and are not public yet. Hence 
there is no official documentation. In fact, there is a new DataSourceV2 API 
(in Spark 2.3) that we are migrating towards. So at this point of time, it's 
hard to make any concrete suggestion. You can take a look at the classes 
DataSourceV2, DataReader, MicroBatchDataReader in the spark source code, along 
with their implementations.
Hope this helps. 
TD

On Jan 25, 2018 8:36 PM, "M Singh"  wrote:

Hi:
I am trying to create a custom structured streaming source and would like to 
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are 
internal to the sql package:
  private[sql] def internalCreateDataFrame(      catalystRows: 
RDD[InternalRow],      schema: StructType,      isStreaming: Boolean = false): 
DataFrame = {    // TODO: use MutableProjection when rowRDD is another 
DataFrame and the applied    // schema differs from the existing schema on any 
field data type.    val logicalPlan = LogicalRDD(      schema.toAttributes,     
 catalystRows,      isStreaming = isStreaming)(self)    Dataset.ofRows(self, 
logicalPlan)  } 
Please let me know where I can find the appropriate API or documentation.
Thanks
Mans



   

Re: Apache Spark - Custom structured streaming data source

2018-01-25 Thread Tathagata Das
Hello Mans,

The streaming DataSource APIs are still evolving and are not public yet.
Hence there is no official documentation. In fact, there is a new
DataSourceV2 API (in Spark 2.3) that we are migrating towards. So at this
point of time, it's hard to make any concrete suggestion. You can take a
look at the classes DataSourceV2, DataReader, MicroBatchDataReader in the
spark source code, along with their implementations.

Hope this helps.

TD

On Jan 25, 2018 8:36 PM, "M Singh"  wrote:

Hi:

I am trying to create a custom structured streaming source and would like
to know if there is any example or documentation on the steps involved.

I've looked at the some methods available in the SparkSession but these are
internal to the sql package:

  *private**[sql]* def internalCreateDataFrame(
  catalystRows: RDD[InternalRow],
  schema: StructType,
  isStreaming: Boolean = false): DataFrame = {
// TODO: use MutableProjection when rowRDD is another DataFrame and the
applied
// schema differs from the existing schema on any field data type.
val logicalPlan = LogicalRDD(
  schema.toAttributes,
  catalystRows,
  isStreaming = isStreaming)(self)
Dataset.ofRows(self, logicalPlan)
  }

Please let me know where I can find the appropriate API or documentation.

Thanks

Mans


Apache Spark - Custom structured streaming data source

2018-01-25 Thread M Singh
Hi:
I am trying to create a custom structured streaming source and would like to 
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are 
internal to the sql package:
  private[sql] def internalCreateDataFrame(      catalystRows: 
RDD[InternalRow],      schema: StructType,      isStreaming: Boolean = false): 
DataFrame = {    // TODO: use MutableProjection when rowRDD is another 
DataFrame and the applied    // schema differs from the existing schema on any 
field data type.    val logicalPlan = LogicalRDD(      schema.toAttributes,     
 catalystRows,      isStreaming = isStreaming)(self)    Dataset.ofRows(self, 
logicalPlan)  } 
Please let me know where I can find the appropriate API or documentation.
Thanks
Mans

Re: good materiala to learn apache spark

2018-01-18 Thread Marco Mistroni
Jacek lawskowski on this mail list wrote a book which is available
online.
Hth

On Jan 18, 2018 6:16 AM, "Manuel Sopena Ballesteros" <
manuel...@garvan.org.au> wrote:

> Dear Spark community,
>
>
>
> I would like to learn more about apache spark. I have a Horton works HDP
> platform and have ran a few spark jobs in a cluster but now I need to know
> more in depth how spark works.
>
>
>
> My main interest is sys admin and operational point of Spark and it’s
> ecosystem.
>
>
>
> Is there any material?
>
>
>
> Thank you very much
>
>
>
> *Manuel Sopena Ballesteros *| Big data Engineer
> *Garvan Institute of Medical Research *
> The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
> <https://maps.google.com/?q=370+Victoria+Street,+Darlinghurst,+NSW+2010=gmail=g>
> *T:* + 61 (0)2 9355 5760 <+61%202%209355%205760> | *F:* +61 (0)2 9295 8507
> <+61%202%209295%208507> | *E:* manuel...@garvan.org.au
>
>
> NOTICE
> Please consider the environment before printing this email. This message
> and any attachments are intended for the addressee named and may contain
> legally privileged/confidential/copyright information. If you are not the
> intended recipient, you should not read, use, disclose, copy or distribute
> this communication. If you have received this message in error please
> notify us at once by return email and then delete both messages. We accept
> no liability for the distribution of viruses or similar in electronic
> communications. This notice should not be removed.
>


good materiala to learn apache spark

2018-01-17 Thread Manuel Sopena Ballesteros
Dear Spark community,

I would like to learn more about apache spark. I have a Horton works HDP 
platform and have ran a few spark jobs in a cluster but now I need to know more 
in depth how spark works.

My main interest is sys admin and operational point of Spark and it's ecosystem.

Is there any material?

Thank you very much

Manuel Sopena Ballesteros | Big data Engineer
Garvan Institute of Medical Research
The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: 
manuel...@garvan.org.au<mailto:manuel...@garvan.org.au>

NOTICE
Please consider the environment before printing this email. This message and 
any attachments are intended for the addressee named and may contain legally 
privileged/confidential/copyright information. If you are not the intended 
recipient, you should not read, use, disclose, copy or distribute this 
communication. If you have received this message in error please notify us at 
once by return email and then delete both messages. We accept no liability for 
the distribution of viruses or similar in electronic communications. This 
notice should not be removed.


Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-10 Thread Dongjoon Hyun
Hi, All.

Vectorized ORC Reader is now supported in Apache Spark 2.3.

https://issues.apache.org/jira/browse/SPARK-16060

It has been a long journey. From now, Spark can read ORC files faster
without feature penalty.

Thank you for all your support, especially Wenchen Fan.

It's done by two commits.

[SPARK-16060][SQL] Support Vectorized ORC Reader
https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
9d6f2bf766

[SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
reader
https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
faa5ba99d1

Please check OrcReadBenchmark for the final speed-up from `Hive built-in
ORC` to `Native ORC Vectorized`.

https://github.com/apache/spark/blob/master/sql/hive/
src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

Thank you.

Bests,
Dongjoon.


Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Felix Cheung
And Hadoop-3.x is not part of the release and sign off for 2.2.1.

Maybe we could update the website to avoid any confusion with "later".


From: Josh Rosen <joshro...@databricks.com>
Sent: Monday, January 8, 2018 10:17:14 AM
To: akshay naidu
Cc: Saisai Shao; Raj Adyanthaya; spark users
Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

My current best guess is that Spark does not fully support Hadoop 3.x because 
https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for 
Hadoop 3.x) has not been resolved. There are also likely to be transitive 
dependency conflicts which will need to be resolved.

On Mon, Jan 8, 2018 at 8:52 AM akshay naidu 
<akshaynaid...@gmail.com<mailto:akshaynaid...@gmail.com>> wrote:
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and 
later', but my confusion is because spark was released on 1st dec and hadoop-3 
stable version released on 13th Dec. And  to my similar question on 
stackoverflow.com<https://stackoverflow.com/questions/47920005/how-is-hadoop-3-0-0-s-compatibility-with-older-versions-of-hive-pig-sqoop-and>
 , Mr. jacek-laskowski<https://stackoverflow.com/users/1305344/jacek-laskowski> 
replied that spark-2.2.1 doesn't support hadoop-3. so I am just looking for 
more clarity on this doubt before moving on to upgrades.

Thanks all for help.

Akshay.

On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao 
<sai.sai.s...@gmail.com<mailto:sai.sai.s...@gmail.com>> wrote:
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it is 
not clear whether it is supported or not (or has some issues). I think in the 
download page "Pre-Built for Apache Hadoop 2.7 and later" mostly means that it 
supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).

Thanks
Jerry

2018-01-08 4:50 GMT+08:00 Raj Adyanthaya 
<raj...@gmail.com<mailto:raj...@gmail.com>>:
Hi Akshay

On the Spark Download page when you select Spark 2.2.1 it gives you an option 
to select package type. In that, there is an option to select  "Pre-Built for 
Apache Hadoop 2.7 and later". I am assuming it means that it does support 
Hadoop 3.0.

http://spark.apache.org/downloads.html

Thanks,
Raj A.

On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
<akshaynaid...@gmail.com<mailto:akshaynaid...@gmail.com>> wrote:
hello Users,
I need to know whether we can run latest spark on  latest hadoop version i.e., 
spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
thanks.





Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Josh Rosen
My current best guess is that Spark does *not* fully support Hadoop 3.x
because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive
shims for Hadoop 3.x) has not been resolved. There are also likely to be
transitive dependency conflicts which will need to be resolved.

On Mon, Jan 8, 2018 at 8:52 AM akshay naidu  wrote:

> yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and
> later', but my confusion is because spark was released on 1st dec and
> hadoop-3 stable version released on 13th Dec. And  to my similar question
> on stackoverflow.com
> 
> , Mr. jacek-laskowski
>  replied that
> spark-2.2.1 doesn't support hadoop-3. so I am just looking for more clarity
> on this doubt before moving on to upgrades.
>
> Thanks all for help.
>
> Akshay.
>
> On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao 
> wrote:
>
>> AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it
>> is not clear whether it is supported or not (or has some issues). I think
>> in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly
>> means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).
>>
>> Thanks
>> Jerry
>>
>> 2018-01-08 4:50 GMT+08:00 Raj Adyanthaya :
>>
>>> Hi Akshay
>>>
>>> On the Spark Download page when you select Spark 2.2.1 it gives you an
>>> option to select package type. In that, there is an option to select
>>> "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it
>>> does support Hadoop 3.0.
>>>
>>> http://spark.apache.org/downloads.html
>>>
>>> Thanks,
>>> Raj A.
>>>
>>> On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
>>> wrote:
>>>
 hello Users,
 I need to know whether we can run latest spark on  latest hadoop
 version i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on
 13th dec.
 thanks.

>>>
>>>
>>
>


Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread akshay naidu
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and
later', but my confusion is because spark was released on 1st dec and
hadoop-3 stable version released on 13th Dec. And  to my similar question
on stackoverflow.com

, Mr. jacek-laskowski
 replied that
spark-2.2.1 doesn't support hadoop-3. so I am just looking for more clarity
on this doubt before moving on to upgrades.

Thanks all for help.
Akshay.

On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao  wrote:

> AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it
> is not clear whether it is supported or not (or has some issues). I think
> in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly
> means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).
>
> Thanks
> Jerry
>
> 2018-01-08 4:50 GMT+08:00 Raj Adyanthaya :
>
>> Hi Akshay
>>
>> On the Spark Download page when you select Spark 2.2.1 it gives you an
>> option to select package type. In that, there is an option to select
>> "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it
>> does support Hadoop 3.0.
>>
>> http://spark.apache.org/downloads.html
>>
>> Thanks,
>> Raj A.
>>
>> On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
>> wrote:
>>
>>> hello Users,
>>> I need to know whether we can run latest spark on  latest hadoop version
>>> i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
>>> thanks.
>>>
>>
>>
>


Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-07 Thread Saisai Shao
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it
is not clear whether it is supported or not (or has some issues). I think
in the download page "Pre-Built for Apache Hadoop 2.7 and later" mostly
means that it supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).

Thanks
Jerry

2018-01-08 4:50 GMT+08:00 Raj Adyanthaya :

> Hi Akshay
>
> On the Spark Download page when you select Spark 2.2.1 it gives you an
> option to select package type. In that, there is an option to select
> "Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it
> does support Hadoop 3.0.
>
> http://spark.apache.org/downloads.html
>
> Thanks,
> Raj A.
>
> On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
> wrote:
>
>> hello Users,
>> I need to know whether we can run latest spark on  latest hadoop version
>> i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
>> thanks.
>>
>
>


Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-07 Thread Raj Adyanthaya
Hi Akshay

On the Spark Download page when you select Spark 2.2.1 it gives you an
option to select package type. In that, there is an option to select
"Pre-Built for Apache Hadoop 2.7 and later". I am assuming it means that it
does support Hadoop 3.0.

http://spark.apache.org/downloads.html

Thanks,
Raj A.

On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
wrote:

> hello Users,
> I need to know whether we can run latest spark on  latest hadoop version
> i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
> thanks.
>


Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-06 Thread akshay naidu
hello Users,
I need to know whether we can run latest spark on  latest hadoop version
i.e., spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
thanks.


Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-05 Thread M Singh
Hi Jacek:

The javadoc mentions that we can only consume data from the data frame in the 
addBatch method.  So, if I would like to save the data to a new sink then I 
believe that I will need to collect the data and then save it.  This is the 
reason I am asking about how to control the size of the data in each invocation 
of the addBatch method.  Let me know if I am interpreting the javadoc 
incorrectly.  Here it is:
/**   * Adds a batch of data to this sink. The data for a given `batchId` is 
deterministic and if   * this method is called more than once with the same 
batchId (which will happen in the case of   * failures), then `data` should 
only be added once.   *   * Note 1: You cannot apply any operators on `data` 
except consuming it (e.g., `collect/foreach`).   * Otherwise, you may get a 
wrong result.   *   * Note 2: The method is supposed to be executed 
synchronously, i.e. the method should only return   * after data is consumed by 
sink successfully.   */  def addBatch(batchId: Long, data: DataFrame): Unit


Thanks
Mans  

On Thursday, January 4, 2018 2:19 PM, Jacek Laskowski  
wrote:
 

 Hi,
> If the data is very large then a collect may result in OOM.
That's a general case even in any part of Spark, incl. Spark Structured 
Streaming. Why would you collect in addBatch? It's on the driver side and as 
anything on the driver, it's a single JVM (and usually not fault tolerant)
> Do you have any other suggestion/recommendation ?
What's wrong with the current solution? I don't think you should change how you 
do things currently. You should just avoid collect on large datasets (which you 
have to do anywhere in Spark).
Pozdrawiam,Jacek Laskowskihttps://about.me/JacekLaskowskiMastering Spark 
SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Thu, Jan 4, 2018 at 10:49 PM, M Singh  wrote:

Thanks Tathagata for your answer.
The reason I was asking about controlling data size is that the javadoc 
indicate you can use foreach or collect on the dataframe.  If the data is very 
large then a collect may result in OOM.
>From your answer it appears that the only way to control the size (in 2.2) 
>would be control the trigger interval. However, in my case, I have to dedup 
>the elements in one minute interval, which I am using a trigger interval and 
>cannot reduce it.  Do you have any other suggestion/recommendation ?
Also, do you have any timeline for the availability of DataSourceV2/Spark 2.3 ?
Thanks again. 

On Wednesday, January 3, 2018 2:27 PM, Tathagata Das 
 wrote:
 

 1. It is all the result data in that trigger. Note that it takes a DataFrame 
which is a purely logical representation of data and has no association with 
partitions, etc. which are physical representations.
2. If you want to limit the amount of data that is processed in a trigger, then 
you should either control the trigger interval or use the rate limit options on 
sources that support it (e.g. for kafka, you can use the option 
"maxOffsetsPerTrigger", see the guide).
Related note, these APIs are subject to change. In fact in the upcoming release 
2.3, we are adding a DataSource V2 API for batch/microbatch-streaming/ 
continuous-streaming sources and sinks.
On Wed, Jan 3, 2018 at 11:23 PM, M Singh  wrote:

Hi:
The documentation for Sink.addBatch is as follows:
  /**   * Adds a batch of data to this sink. The data for a given `batchId` is 
deterministic and if   * this method is called more than once with the same 
batchId (which will happen in the case of   * failures), then `data` should 
only be added once.   *   * Note 1: You cannot apply any operators on `data` 
except consuming it (e.g., `collect/foreach`).   * Otherwise, you may get a 
wrong result.   *   * Note 2: The method is supposed to be executed 
synchronously, i.e. the method should only return   * after data is consumed by 
sink successfully.   */  def addBatch(batchId: Long, data: DataFrame): Unit
A few questions about the data is each DataFrame passed as the argument to 
addBatch - 1. Is it all the data in a partition for each trigger or is it all 
the data in that trigger ?  2. Is there a way to control the size in each 
addBatch invocation to make sure that we don't run into OOM exception on the 
executor while calling collect ?
Thanks



   



   

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread Jacek Laskowski
Hi,

> If the data is very large then a collect may result in OOM.

That's a general case even in any part of Spark, incl. Spark Structured
Streaming. Why would you collect in addBatch? It's on the driver side and
as anything on the driver, it's a single JVM (and usually not fault
tolerant)

> Do you have any other suggestion/recommendation ?

What's wrong with the current solution? I don't think you should change how
you do things currently. You should just avoid collect on large datasets
(which you have to do anywhere in Spark).

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Thu, Jan 4, 2018 at 10:49 PM, M Singh 
wrote:

> Thanks Tathagata for your answer.
>
> The reason I was asking about controlling data size is that the javadoc
> indicate you can use foreach or collect on the dataframe.  If the data is
> very large then a collect may result in OOM.
>
> From your answer it appears that the only way to control the size (in 2.2)
> would be control the trigger interval. However, in my case, I have to dedup
> the elements in one minute interval, which I am using a trigger interval
> and cannot reduce it.  Do you have any other suggestion/recommendation ?
>
> Also, do you have any timeline for the availability of DataSourceV2/Spark
> 2.3 ?
>
> Thanks again.
>
>
> On Wednesday, January 3, 2018 2:27 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>
> 1. It is all the result data in that trigger. Note that it takes a
> DataFrame which is a purely logical representation of data and has no
> association with partitions, etc. which are physical representations.
>
> 2. If you want to limit the amount of data that is processed in a trigger,
> then you should either control the trigger interval or use the rate limit
> options on sources that support it (e.g. for kafka, you can use the option
> "maxOffsetsPerTrigger", see the guide
> 
> ).
>
> Related note, these APIs are subject to change. In fact in the upcoming
> release 2.3, we are adding a DataSource V2 API for
> batch/microbatch-streaming/continuous-streaming sources and sinks.
>
> On Wed, Jan 3, 2018 at 11:23 PM, M Singh 
> wrote:
>
> Hi:
>
> The documentation for Sink.addBatch is as follows:
>
>   /**
>* Adds a batch of data to this sink. The data for a given `batchId` is
> deterministic and if
>* this method is called more than once with the same batchId (which
> will happen in the case of
>* failures), then `data` should only be added once.
>*
>* Note 1: You cannot apply any operators on `data` except consuming it
> (e.g., `collect/foreach`).
>* Otherwise, you may get a wrong result.
>*
>* Note 2: The method is supposed to be executed synchronously, i.e.
> the method should only return
>* after data is consumed by sink successfully.
>*/
>   def addBatch(batchId: Long, data: DataFrame): Unit
>
> A few questions about the data is each DataFrame passed as the argument to
> addBatch -
> 1. Is it all the data in a partition for each trigger or is it all the
> data in that trigger ?
> 2. Is there a way to control the size in each addBatch invocation to make
> sure that we don't run into OOM exception on the executor while calling
> collect ?
>
> Thanks
>
>
>
>
>


<    1   2   3   4   5   6   7   8   9   10   >