Re: Spark DataSets and multiple write(.) calls

2018-11-20 Thread Michael Shtelma
You can also cache the data frame on disk, if it does not fit into memory. An alternative would be to write out data frame as parquet and then read it, you can check if in this case the whole pipeline works faster as with the standard cache. Best, Michael On Tue, Nov 20, 2018 at 9:14 AM

Re: Read Avro Data using Spark Streaming

2018-11-14 Thread Michael Shtelma
Hi, you can use this project in order to read Avro using Spark Structured Streaming: https://github.com/AbsaOSS/ABRiS Spark 2.4 has also built in support for Avro, so you can use from_avro function in Spark 2.4. Best, Michael On Sat, Nov 3, 2018 at 4:34 AM Divya Narayan wrote: > Hi, > > I

Re: How to increase the parallelism of Spark Streaming application?

2018-11-07 Thread Michael Shtelma
If you configure to many Kafka partitions, you can run into memory issues. This will increase memory requirements for spark job a lot. Best, Michael On Wed, Nov 7, 2018 at 8:28 AM JF Chen wrote: > I have a Spark Streaming application which reads data from kafka and save > the the

Re: [Arrow][Dremio]

2018-05-14 Thread Michael Shtelma
Hi Xavier, Dremio is looking really interesting and has nice UI. I think the idea to replace SSIS or similar tools with Dremio is not so bad, but what about complex scenarios with a lot of code and transformations ? Is it possible to use Dremio via API and define own transformations and

INSERT INTO TABLE_PARAMS fails during ANALYZE TABLE

2018-04-19 Thread Michael Shtelma
Hi everybody, I wanted to test CBO with enabled histograms. In order to do this, I have enabled property spark.sql.statistics.histogram.enabled. In this test derby was used as a database for hive metastore. The problem is, that in some cases, the values, that are inserted to table TABLE_PARAMS

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-28 Thread Michael Shtelma
comma-separated list > of multiple directories on different disks. NOTE: In Spark 1.0 and later > this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or > LOCAL_DIRS (YARN) environment variables set by the cluster manager. > > Regards, > Gourav Sengupta > > > &

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-27 Thread Michael Shtelma
uin@gmail.com> wrote: > Can you file a jira if this is a bug? > Thanks! > > On Sat, Mar 24, 2018 at 1:23 AM, Michael Shtelma <mshte...@gmail.com> wrote: >> >> Hi Maropu, >> >> the problem seems to be in FilterEstimation.scala on lines 50 and 52: >> >&

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Michael Shtelma
> > sorry for the late reply. I guess you may have to set it through the hdfs > core-site.xml file. The property you need to set is "hadoop.tmp.dir" which > defaults to "/tmp/hadoop-${user.name}" > > Regards, > Keith. > > http://keith-chapman.com >

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma
rows the exception. > > // maropu > > On Fri, Mar 23, 2018 at 6:20 PM, Michael Shtelma <mshte...@gmail.com> wrote: >> >> Hi all, >> >> I am using Spark 2.3 with activated cost-based optimizer and a couple >> of hive tables, that were analyzed previously

Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma
Hi all, I am using Spark 2.3 with activated cost-based optimizer and a couple of hive tables, that were analyzed previously. I am getting the following exception for different queries: java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:494) at

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
/application_1521110306769_0041/container_1521110306769_0041_01_04/tmp JVM is using the second Djava.io.tmpdir parameter and writing everything to the same directory as before. Best, Michael Sincerely, Michael Shtelma On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman <keithgchap...@gmail.com> wrote: > Ca

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Chapman <keithgchap...@gmail.com> wrote: > Hi Michael, > > You could either set spark.local.dir through spark conf or java.io.tmpdir > system property. > > Regards, > Keith. > > http://keith-chapman.com > > On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtel

Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi everybody, I am running spark job on yarn, and my problem is that the blockmgr-* folders are being created under /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/* The size of this folder can grow to a significant size and does not really fit into /tmp file system for one

spark.sql call takes far too long

2018-01-24 Thread Michael Shtelma
Hi all, I have a problem with the performance of the sparkSession.sql call. It takes up to a couple of seconds for me right now. I have a lot of generated temporary tables, which are registered within the session and also a lot of temporary data frames. Is it possible, that the

Re: Inner join with the table itself

2018-01-16 Thread Michael Shtelma
nside). > > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-st

Re: Inner join with the table itself

2018-01-15 Thread Michael Shtelma
Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com

Using UDF compiled with Janino in Spark

2017-12-15 Thread Michael Shtelma
Hi all, I am trying to compile my udf with janino copmpiler and then register it in spark and use it afterwards. Here is the code: String s = " \n" + "public class MyUDF implements org.apache.spark.sql.api.java.UDF1 {\n" + "@Override\n"