Re: [Spark SQL] - Not able to consume Kafka topics

2021-02-18 Thread Jungtaek Lim
(Dropping Kafka user mailing list as this is more likely Spark issue) Do you have a full stack trace for a log message? It would help to make clear where the issue lays. On Thu, Feb 18, 2021 at 8:01 PM Rathore, Yashasvini wrote: > Hello, > > Issues : > > * I and my team are trying to

Re: Spark SQL Dataset and BigDecimal

2021-02-18 Thread Khalid Mammadov
As Scala book says Value types are mapped/transformed to java primitive types. So when you use Integer for example it will compile to int. So Integer is a syntactic sugar and makes it more readable in Scala code than plain int and plus Scala adds extra perks through implicits etc. I think the

Bursting Your On-Premises Data Lake Analytics and AI Workloads on AWS

2021-02-18 Thread Bin Fan
Hi everyone! I am sharing this article about running Spark / Presto workloads on AWS: Bursting On-Premise Datalake Analytics and AI Workloads on AWS published on AWS blog. Hope you enjoy it. Feel free to discuss with me here . - Bin Fan Powered

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Lalwani, Jayesh
Presto has slightly lower latency than Spark, but I've found that it gets stuck on some edge cases. If you are on AWS, then the simplest solution is to use Athena. Athena is built on Presto, has a JDBC driver, and is serverless, so you don't have to take any headaches On 2/18/21, 3:32 PM,

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Scott Ribe
> On Feb 18, 2021, at 12:52 PM, Jeff Evans > wrote: > > It sounds like the tool you're after, then, is a distributed SQL engine like > Presto. But I could be totally misunderstanding what you're trying to do. Presto may well be a longer-term solution as our use grows. For now, a simple data

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Scott Ribe
> On Feb 18, 2021, at 1:13 PM, Lalwani, Jayesh > wrote: > > Have you tried any of those? Where are you getting stuck? Thanks! The 3rd one in your list I had not found, and it seems to fill in what I was missing (CREATE EXTERNAL TABLE). I'd found the first two, but they only got me creating

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Lalwani, Jayesh
There are several step by step guides that you can find online by googling https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-thrift-server.html

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Jeff Evans
It sounds like the tool you're after, then, is a distributed SQL engine like Presto. But I could be totally misunderstanding what you're trying to do. On Thu, Feb 18, 2021 at 1:48 PM Scott Ribe wrote: > I have a client side piece that needs access via JDBC. > > > On Feb 18, 2021, at 12:45 PM,

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Scott Ribe
I have a client side piece that needs access via JDBC. > On Feb 18, 2021, at 12:45 PM, Jeff Evans > wrote: > > If the data is already in Parquet files, I don't see any reason to involve > JDBC at all. You can read Parquet files directly into a DataFrame. >

Re: how to serve data over JDBC using simplest setup

2021-02-18 Thread Jeff Evans
If the data is already in Parquet files, I don't see any reason to involve JDBC at all. You can read Parquet files directly into a DataFrame. https://spark.apache.org/docs/latest/sql-data-sources-parquet.html On Thu, Feb 18, 2021 at 1:42 PM Scott Ribe wrote: > I need a little help figuring out

how to serve data over JDBC using simplest setup

2021-02-18 Thread Scott Ribe
I need a little help figuring out how some pieces fit together. I have some tables in parquet files, and I want to access them using SQL over JDBC. I gather that I need to run the thrift server, but how do I configure it to load my files into datasets and expose views? The context is this:

Re: understanding spark shuffle file re-use better

2021-02-18 Thread Mandloi87
Increase or Decrease the number of data partitions: Since a data partition represents the quantum of data to be processed together by a single Spark Task, there could be situations: (a) Where existing number of data partitions are not sufficient enough in order to maximize the usage of available

PySpark registerJavaUDAF doesn't accept UDAF Aggregator (Spark 3)

2021-02-18 Thread Grégory Dugernier
Greetings, I've been trying to migrate some piece of code from Scala - Spark 2.X to PySpark 3.0.1. Part of the software includes a User-Defined Aggregate Function (UDAF), which represented a two-fold problem: - The UserDefinedAggregateFunction abstract class is deprecated in Spark >= 3.0

[Spark SQL] - Not able to consume Kafka topics

2021-02-18 Thread Rathore, Yashasvini
Hello, Issues : * I and my team are trying to consume some kafka topics based on the timestamps using startingOffsetsByTimestamps option, and the code works fine when we run via a Databricks notebook. * There is a need to setup the whole process in a local system (IntelliJ), but the

Re: Spark SQL Dataset and BigDecimal

2021-02-18 Thread Ivan Petrov
I'm fine with both. So does it make sense to use java.math.BigDecimal everywhere to avoid perf penalty for value conversion? scala BigMath looks like a wrapper around java.math.BigDecimal though... чт, 18 февр. 2021 г. в 00:33, Takeshi Yamamuro : > Yea, I think that's because it's needed for