Long term arbitrary stateful processing - best practices

2018-08-29 Thread monohusche
Hi there, We are currently evaluating Spark to provide streaming analytics for unbounded data sets. Basically creating aggregations over event time windows while supporting both early and late/out of order messages. For streaming aggregations (=SQL operations with well-defined semantics, e.g.

Is there a plan for official spark-avro/spark-orc read/write library using Data Source V2

2018-08-29 Thread yxchen
Data Source V2 API is available in Spark 2.3. But currently there's no official library using data source v2 api to read/write avro or orc files -- spark-avro and spark-orc are both using Data Source V1. I wonder if there's a plan in the upstream to implement those readers/writers and make them

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I can't... do you think that it's a possible bug of this version?? from Spark or Kafka? El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () escribió: > Are you able to try a recent version of spark? > > On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández > wrote: > > I'm using Spark

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Cody Koeninger
Are you able to try a recent version of spark? On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández wrote: > I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this > exception and Spark dies. > > I couldn't see any error or problem among the machines, anybody has the >

Re: Spark code to write to MySQL and Hive

2018-08-29 Thread Jacek Laskowski
Hi, I haven't checked my answer (too lazy today), but think I know what might be going on. tl;dr Use cache to preserve the initial set of rows from mysql After you append new rows, you will have twice as many rows as you had previously. Correct? Since newDF references the table every time you

RE: Spark code to write to MySQL and Hive

2018-08-29 Thread ryandam.9
Sorry, last mail format was not good. println("Going to talk to mySql") // Read table from mySQL. val mysqlDF = spark.read.jdbc(jdbcUrl, table, properties) println("I am back from mySql") mysqlDF.show() // Create a new Dataframe with column 'id' increased to avoid Duplicate primary keys

Spark code to write to MySQL and Hive

2018-08-29 Thread ryandam.9
Hi, Can anyone help me to understand what is happening with my code ? I wrote a Spark application to read from a MySQL table [that already has 4 records], Create a new DF by adding 10 to the ID field. Then, I wanted to write the new DF to MySQL as well as to Hive. I am surprised to

Re: Parallelism: behavioural difference in version 1.2 and 2.1!?

2018-08-29 Thread Jeevan K. Srivatsa
Dear Apostolos, Thanks for the response! Our version is built on 2.1, the problem is that the state-of-the-art system I'm trying to compare is built on the version 1.2. So I have to deal with it. If I understand the level of parallelism correctly, --total-executor-cores is set to the number or

Re: Parallelism: behavioural difference in version 1.2 and 2.1!?

2018-08-29 Thread Apostolos N. Papadopoulos
Dear Jeevan, Spark 1.2 is quite old, and If I were you I would go for a newer version. However, is there a parallelism level (e.g., 20, 30) that works for both installations? regards, Apostolos On 29/08/2018 04:55 μμ, jeevan.ks wrote: Hi, I've two systems. One is built on Spark 1.2 and

Parallelism: behavioural difference in version 1.2 and 2.1!?

2018-08-29 Thread jeevan.ks
Hi, I've two systems. One is built on Spark 1.2 and the other on 2.1. I am benchmarking both with the same benchmarks (wordcount, grep, sort, etc.) with the same data set from S3 bucket (size ranges from 50MB to 10 GB). The Spark cluster I made use of is r3.xlarge, 8 instances, 4 cores each, and

Re: Which Py4J version goes with Spark 2.3.1?

2018-08-29 Thread Gourav Sengupta
Hi, I think that the best option is to use the py4j which is either automatically installed with "pip install pyspark" or when we unzip the Spark download from its site, its in SPARK_HOME/python/lib folder. Regards, Gourav Sengupta On Wed, Aug 29, 2018 at 8:00 AM Aakash Basu wrote: > Hi, > >

Spark udf from external jar without enabling Hive

2018-08-29 Thread Swapnil Chougule
Hi Team, I am creating udf as follow from external jar val spark = SparkSession.builder.appName("UdfUser") .master("local") .enableHiveSupport() .getOrCreate() spark.sql("CREATE FUNCTION uppercase AS 'path.package.udf.UpperCase' " + "USING JAR

java.lang.OutOfMemoryError: Java heap space - Spark driver.

2018-08-29 Thread Guillermo Ortiz Fernández
I got this error from spark driver, it seems that I should increase the memory in the driver although it's 5g (and 4 cores) right now. It seems weird to me because I'm not using Kryo or broadcast in this process but in the log there are references to Kryo and broadcast. How could I figure out the

Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this exception and Spark dies. I couldn't see any error or problem among the machines, anybody has the reason about this error? java.lang.IllegalStateException: This consumer has already been closed. at

Which Py4J version goes with Spark 2.3.1?

2018-08-29 Thread Aakash Basu
Hi, Which Py4J version goes with Spark 2.3.1? I have py4j-0.10.7 but throws an error because of certain compatibility issues with the Spark 2.3.1. Error: [2018-08-29] [06:46:56] [ERROR] - Traceback (most recent call last): File "", line 120, in run File