from:"Michael Shtelma"

Re: Spark DataSets and multiple write(.) calls

2018-11-20 Thread Michael Shtelma

You can also cache the data frame on disk, if it does not fit into memory. An alternative would be to write out data frame as parquet and then read it, you can check if in this case the whole pipeline works faster as with the standard cache. Best, Michael On Tue, Nov 20, 2018 at 9:14 AM

Re: Read Avro Data using Spark Streaming

2018-11-14 Thread Michael Shtelma

Hi, you can use this project in order to read Avro using Spark Structured Streaming: https://github.com/AbsaOSS/ABRiS Spark 2.4 has also built in support for Avro, so you can use from_avro function in Spark 2.4. Best, Michael On Sat, Nov 3, 2018 at 4:34 AM Divya Narayan wrote: > Hi, > > I

Re: How to increase the parallelism of Spark Streaming application？

2018-11-07 Thread Michael Shtelma

If you configure to many Kafka partitions, you can run into memory issues. This will increase memory requirements for spark job a lot. Best, Michael On Wed, Nov 7, 2018 at 8:28 AM JF Chen wrote: > I have a Spark Streaming application which reads data from kafka and save > the the

Re: [Arrow][Dremio]

2018-05-14 Thread Michael Shtelma

Hi Xavier, Dremio is looking really interesting and has nice UI. I think the idea to replace SSIS or similar tools with Dremio is not so bad, but what about complex scenarios with a lot of code and transformations ? Is it possible to use Dremio via API and define own transformations and

INSERT INTO TABLE_PARAMS fails during ANALYZE TABLE

2018-04-19 Thread Michael Shtelma

Hi everybody, I wanted to test CBO with enabled histograms. In order to do this, I have enabled property spark.sql.statistics.histogram.enabled. In this test derby was used as a database for hive metastore. The problem is, that in some cases, the values, that are inserted to table TABLE_PARAMS

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-28 Thread Michael Shtelma

comma-separated list > of multiple directories on different disks. NOTE: In Spark 1.0 and later > this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or > LOCAL_DIRS (YARN) environment variables set by the cluster manager. > > Regards, > Gourav Sengupta > > > &

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-27 Thread Michael Shtelma

uin@gmail.com> wrote: > Can you file a jira if this is a bug? > Thanks! > > On Sat, Mar 24, 2018 at 1:23 AM, Michael Shtelma <mshte...@gmail.com> wrote: >> >> Hi Maropu, >> >> the problem seems to be in FilterEstimation.scala on lines 50 and 52: >> >&

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Michael Shtelma

> > sorry for the late reply. I guess you may have to set it through the hdfs > core-site.xml file. The property you need to set is "hadoop.tmp.dir" which > defaults to "/tmp/hadoop-${user.name}" > > Regards, > Keith. > > http://keith-chapman.com >

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma

rows the exception. > > // maropu > > On Fri, Mar 23, 2018 at 6:20 PM, Michael Shtelma <mshte...@gmail.com> wrote: >> >> Hi all, >> >> I am using Spark 2.3 with activated cost-based optimizer and a couple >> of hive tables, that were analyzed previously

Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma

Hi all, I am using Spark 2.3 with activated cost-based optimizer and a couple of hive tables, that were analyzed previously. I am getting the following exception for different queries: java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:494) at

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma

/application_1521110306769_0041/container_1521110306769_0041_01_04/tmp JVM is using the second Djava.io.tmpdir parameter and writing everything to the same directory as before. Best, Michael Sincerely, Michael Shtelma On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman <keithgchap...@gmail.com> wrote: > Ca

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma

Chapman <keithgchap...@gmail.com> wrote: > Hi Michael, > > You could either set spark.local.dir through spark conf or java.io.tmpdir > system property. > > Regards, > Keith. > > http://keith-chapman.com > > On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtel

Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma

Hi everybody, I am running spark job on yarn, and my problem is that the blockmgr-* folders are being created under /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/* The size of this folder can grow to a significant size and does not really fit into /tmp file system for one

spark.sql call takes far too long

2018-01-24 Thread Michael Shtelma

Hi all, I have a problem with the performance of the sparkSession.sql call. It takes up to a couple of seconds for me right now. I have a lot of generated temporary tables, which are registered within the session and also a lot of temporary data frames. Is it possible, that the

Re: Inner join with the table itself

2018-01-16 Thread Michael Shtelma

nside). > > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-st

Re: Inner join with the table itself

2018-01-15 Thread Michael Shtelma

Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com

Using UDF compiled with Janino in Spark

2017-12-15 Thread Michael Shtelma

Hi all, I am trying to compile my udf with janino copmpiler and then register it in spark and use it afterwards. Here is the code: String s = " \n" + "public class MyUDF implements org.apache.spark.sql.api.java.UDF1 {\n" + "@Override\n"

Re: Spark DataSets and multiple write(.) calls

Re: Read Avro Data using Spark Streaming

Re: How to increase the parallelism of Spark Streaming application？

Re: [Arrow][Dremio]

INSERT INTO TABLE_PARAMS fails during ANALYZE TABLE

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

Re: Using CBO on Spark 2.3 with analyzed hive tables

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

Re: Using CBO on Spark 2.3 with analyzed hive tables

Using CBO on Spark 2.3 with analyzed hive tables

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

spark.sql call takes far too long

Re: Inner join with the table itself

Re: Inner join with the table itself

Using UDF compiled with Janino in Spark

17 matches

Site Navigation

Mail list logo

Footer information