Spark 1.6 change the number partitions without repartition and without shuffling

2018-06-13 Thread Spico Florin
Hello! I have a parquet file that has 60MB representing 10millions records. When I read this file using Spark 2.3.0 and with the configuration spark.sql.files.maxPartitionBytes=1024*1024*2 (=2MB) I got 29 partitions as expected. Code: sqlContext.setConf("spark.sql.files.maxPartitionBytes",

Re: How to branch a Stream / have multiple Sinks / do multiple Queries on one Stream

2018-06-13 Thread Amiya Mishra
Hi Jürgen, Have you found any solution or workaround for multiple sinks from single source as we cannot process multiple sinks at a time ? As i also has a scenario in ETL where we are using clone component having multiple sinks with single input stream dataframe. Can you keep posting once you

Inferring from Event Timeline

2018-06-13 Thread Aakash Basu
Hi guys, What all can be inferred by closely watching an event time-line in Spark UI? I generally monitor the tasks taking more time and also how much in parallel they're spinning. What else? Eg Event Time-line from Spark UI: Thanks, Aakash.

withColumn on nested schema

2018-06-13 Thread Zsolt Tóth
Hi, I'm trying to replace values in a nested column in a JSON-based dataframe using withColumn(). This syntax works for select, filter, etc, giving only the nested "country" column: df.select('body.payload.country') but if I do this, it will create a new column with the name

Understanding Spark behavior when reading from Kafka in static dataframe

2018-06-13 Thread Arbab Khalil
Hi all, I have IoT time series data in Kafka and reading it in static dataframe as: df = spark.read\ .format("kafka")\ .option("zookeeper.connect", "localhost:2181")\ .option("kafka.bootstrap.servers", "localhost:9092")\ .option("subscribe", "test_topic")\ .option("failOnDataLoss", "false")\