[CFP] DataWorks Summit/Hadoop Summit Sydney - Call for abstracts

2017-05-03 Thread Yanbo Liang
The Australia/Pacific version of DataWorks Summit is in Sydney this year, September 20-21. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark. Information on submitting an abstract is at

Spark 2.1.0 and Hive 2.1.1

2017-05-03 Thread Lohith Samaga M
Hi, Good day. My setup: 1. Single node Hadoop 2.7.3 on Ubuntu 16.04. 2. Hive 2.1.1 with metastore in MySQL. 3. Spark 2.1.0 configured using hive-site.xml to use MySQL metastore. 4. The VERSION table contains SCHEMA_VERSION = 2.1.0 Hive

Re: Spark books

2017-05-03 Thread Pushkar.Gujar
*"I would suggest do not buy any book, just start with databricks community edition"* I dont agree with above , "Learning Spark" book was definitely stepping stone for me. All the basics that one beginner can/will need is covered in very easy to understand format with examples. Great book!

Re: Spark books

2017-05-03 Thread Stephen Fletcher
Zeming, Jacek also has a really good online spark book for spark 2, "mastering spark". I found it very helpful when trying to understand spark 2's encoders. his book is here: https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details On Wed, May 3, 2017 at 8:16 PM, Neelesh

What is the correct JSON parameter format used to to submit Spark2 apps with YARN REST API?

2017-05-03 Thread Kun Liu
Hi folks, I am trying to submit a spark app via YARN REST API, by following this tutorial from Hortonworks: https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html . Here is the general flow: GET a new app ID, then POST a new app with the ID and

Re: Spark books

2017-05-03 Thread Neelesh Salian
The Apache Spark documentation is good to begin with. All the programming guides, particularly. On Wed, May 3, 2017 at 5:07 PM, ayan guha wrote: > I would suggest do not buy any book, just start with databricks community > edition > > On Thu, May 4, 2017 at 9:30 AM, Tobi

Re: Spark books

2017-05-03 Thread ayan guha
I would suggest do not buy any book, just start with databricks community edition On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede wrote: > Well that is the nature of technology, ever evolving. There will always be > new concepts. If you're trying to get started ASAP and the

Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread Michael Armbrust
> > if I do dataset.select("nonExistentColumn") then the Analysis Error is > thrown at compile time right? > if you do df.as[MyClass].map(_.badFieldName) you will get a compile error. However, if df doesn't have the right columns for MyClass, that error will only be thrown at runtime (whether DF

Re: Refreshing a persisted RDD

2017-05-03 Thread Tathagata Das
Yes, you will have to recreate the streaming Dataframe along with the static Dataframe, and restart the query. There isnt a currently feasible to do this without a query restart. But restarting a query WITHOUT restarting the whole application + spark cluster, is reasonably fast. If your

Re: Spark books

2017-05-03 Thread Tobi Bosede
Well that is the nature of technology, ever evolving. There will always be new concepts. If you're trying to get started ASAP and the internet isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of production stacks are still on that version and the knowledge from mastering 1.6 is

Re: Refreshing a persisted RDD

2017-05-03 Thread Lalwani, Jayesh
Thanks, TD for answering this question on the Spark mailing list. A follow-up. So, let’s say we are joining a cached dataframe with a streaming dataframe, and we recreate the cached dataframe, do we have to recreate the streaming dataframe too? One possible solution that we have is val

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-05-03 Thread JayeshLalwani
You need to understand how join works to make sense of it. Logically, a join does a cartesian product of the 2 tables, and then filters the rows that satisfy the contains UDF. So, let's say you have Input Allen Armstrong nishanth hemanth Allen shivu Armstrong nishanth shree shivu DeWALT

Re: Spark-SQL collect function

2017-05-03 Thread JayeshLalwani
In any distributed application, you scale up by splitting execution up on multiple machines. The way Spark does this is by slicing the data into partitions and spreading them on multiple machines. Logically, an RDD is exactly that: data is split up and spread around on multiple machines. When you

Spark books

2017-05-03 Thread Zeming Yu
I'm trying to decide whether to buy the book learning spark, spark for machine learning etc. or wait for a new edition covering the new concepts like dataframe and datasets. Anyone got any suggestions?

Re: Refreshing a persisted RDD

2017-05-03 Thread Tathagata Das
If you want to always get the latest data in files, its best to always recreate the DataFrame. On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani wrote: > We have a Structured Streaming application that gets accounts from Kafka > into > a streaming data frame. We have

Re: In-order processing using spark streaming

2017-05-03 Thread JayeshLalwani
Option A If you can get all the messages in a session into the same Spark partition, you can use df.mapWithPartition to process the whole partition. This will allow you to control the order in which the messages are processed within the partition. This will work if messages are posted in Kafka in

Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread kant kodali
got it! so if I do dataset.select("nonExistentColumn") then the Analysis Error is thrown at compile time right? but what if I have a dataset in memory and I go to a database server and execute a delete column query ? Will the dataset object be in sync with the underlying database table? Thanks!

Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread Michael Armbrust
An analysis exception occurs whenever the scala/java/python program is valid, but the dataframe operations being performed are not. For example, df.select("nonExistentColumn") would throw an analysis exception. On Wed, May 3, 2017 at 1:38 PM, kant kodali wrote: > Hi All, >

What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread kant kodali
Hi All, I understand the compile time Errors this blog is talking about but I don't understand what are Analysis Errors? Any Examples? Thanks!

Re: [Spark Streaming] - Killing application from within code

2017-05-03 Thread Tathagata Das
There isnt a clean programmatic way to kill the application running in the driver from the executor. You will have to set up addition RPC mechanism to explicitly send a signal from the executors to the application/driver to quit. On Wed, May 3, 2017 at 8:44 AM, Sidney Feiner

Pat Ferrel has shared a document on Google Docs with you

2017-05-03 Thread pat
Pat Ferrel has invited you to view the following document: Open in Docs

Francis Lau has shared a document on Google Docs with you

2017-05-03 Thread francis . lau
Francis Lau has invited you to view the following document: Open in Docs

[Spark Streaming] - Killing application from within code

2017-05-03 Thread Sidney Feiner
Hey, I'm using connections to Elasticsearch from within my Spark Streaming application. I'm using Futures to maximize performance when it sends network requests to the ES cluster. Basically, I want my app to crash if any one of the executors fails to connect to ES. The exception gets catched

Refreshing a persisted RDD

2017-05-03 Thread JayeshLalwani
We have a Structured Streaming application that gets accounts from Kafka into a streaming data frame. We have a blacklist of accounts stored in S3 and we want to filter out all the accounts that are blacklisted. So, we are loading the blacklisted accounts into a batch data frame and joining it

Re: parquet optimal file structure - flat vs nested

2017-05-03 Thread Steve Loughran
> On 30 Apr 2017, at 09:19, Zeming Yu wrote: > > Hi, > > We're building a parquet based data lake. I was under the impression that > flat files are more efficient than deeply nested files (say 3 or 4 levels > down). Is that correct? > > Thanks, > Zeming Where's the data

Redefining Spark native UDFs

2017-05-03 Thread Miguel Figueiredo
Hi, I have pre-defined SQL code that uses the decode UDF which is no compatible with the existing Spark decode UDF. I tried to re-define the decode UDF without success. Is there a way to do this in Spark? Best regards, Miguel -- Miguel Figueiredo Software Developer http://jaragua.hopto.org

Re: map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-03 Thread Tathagata Das
You can apply apply any kind of aggregation on windows. There are some built in aggregations (e.g. sum and count) as well as there is an API for user-defined aggregations (scala/Java) that works with both batch and streaming DFs. See the programming guide if you havent seen it already - windowing

Benchmark of XGBoost, Vowpal Wabbit and Spark ML on Criteo 1TB Dataset

2017-05-03 Thread pklemenkov
Hi! We've done cool benchmark of popular ML libraries (including Spark ML) on Criteo 1TB dataset https://github.com/rambler-digital-solutions/criteo-1tb-benchmark Spark ML was tested on a real production cluster and showed great results at scale. We'd like to see some feedback and tips for

map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-03 Thread peay
Hello, I would like to get started on Spark Streaming with a simple window. I've got some existing Spark code that takes a dataframe, and outputs a dataframe. This includes various joins and operations that are not supported by structured streaming yet. I am looking to essentially map/apply

spark 1.6 .0 and gridsearchcv

2017-05-03 Thread issues solution
Hi , i wonder if we have methode under pyspakr 1.6 to perform gridsearchCv ? if yes can i ask example please . thx