Re: How to reduce the amount of data that is getting written to the checkpoint from Spark Streaming

2017-07-03 Thread Yuval.Itzchakov
Using a long period betweem checkpoints may cause a long linage of the graphs computations to be created, since Spark uses checkpointing to cut it, which can also cause a delay in the streaming job. -- View this message in context:

Analysis Exception after join

2017-07-03 Thread Bernard Jesop
Hello, I don't understand my error message. Basically, all I am doing is : - dfAgg = df.groupBy("S_ID") - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer") However I get this AnalysisException: " Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) S_ID#1903L

Re: Analysis Exception after join

2017-07-03 Thread Didac Gil
With the left join, you are joining two tables. In your case, df is the left table, dfAgg is the right table. The second parameter should be the joining condition, right? For instance dfRes = df.join(dfAgg, $”userName”===$”name", "left_outer”) having a field in df called userName, and another

[PySpark] - running processes

2017-07-03 Thread Sidney Feiner
In my Spark Streaming application, I have the need to build a graph from a file and initializing that graph takes between 5 and 10 seconds. So I tried initializing it once per executor so it'll be initialized only once. After running the application, I've noticed that it's initiated much more

Re: What's the simplest way to Read Avro records from Kafka to Spark DataSet/DataFrame?

2017-07-03 Thread kant kodali
import org.apache.avro.Schemaimport org.apache.spark.sql.SparkSession val schema = new Schema.Parser().parse(new File("user.avsc"))val spark = SparkSession.builder().master("local").getOrCreate() spark .read .format("com.databricks.spark.avro") .option("avroSchema", schema.toString)

[Spark SQL] JDBC connection from UDF

2017-07-03 Thread Patrik Medvedev
Hello guys, I'm using Spark SQL with Hive thru Thrift. I need this because I need to create a table by table mask. Here is an example: 1. Take tables by mask, like SHOW TABLES IN db 'table__*' 2. Create query like: CREATE TABLE total_data AS SELECT * FROM table__1 UNION ALL SELECT * FROM table__2

spark submit with logs and kerberos

2017-07-03 Thread Juan Rods
Hi! I have a question about logs and have not seen the answer through internet. I have a spark submit process and I configure a custom log configuration to it using the next params: --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=customlog4j.properties" --driver-java-options

Re: spark-jdbc impala with kerberos using yarn-client

2017-07-03 Thread morfious902002
Did you ever find a solution to this? If so, can you share your solution? I am running into similar issue in YARN cluster mode connecting to impala table. -- View this message in context: