Parse RDD[Seq[String]] to DataFrame with types.

2019-07-17 Thread Guillermo Ortiz Fernández
I'm trying to parse a RDD[Seq[String]] to Dataframe. ALthough it's a Seq of Strings they could have a more specific type as Int, Boolean, Double, String an so on. For example, a line could be: "hello", "1", "bye", "1.1" "hello1", "11", "bye1", "2.1" ... First column is going to be always a

Re: Putting record in HBase with Spark - error get regions.

2019-05-28 Thread Guillermo Ortiz Fernández
) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/05/28 11:11:18 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 369 El mar., 28 may. 2019 a las 12:12, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escribió:

Putting record in HBase with Spark - error get regions.

2019-05-28 Thread Guillermo Ortiz Fernández
I'm executing a load process into HBase with spark. (around 150M record). At the end of the process there are a lot of fail tasks. I get this error: 19/05/28 11:02:31 ERROR client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.TableNotFoundException: my_table at

Trying to improve performance of the driver.

2018-09-13 Thread Guillermo Ortiz Fernández
I have a process in Spark Streamin which lasts 2 seconds. When I check where the time is spent I see about 0.8s-1s in processing time although the global time is 2s. This one second is spent in the driver. I reviewed the code which is executed by the driver and I commented some of this code with

Re: deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Guillermo Ortiz Fernández
:lib/kafka-clients-0.10.0.1.jar \ --files conf/${1}Conf.json iris-core-0.0.1-SNAPSHOT.jar conf/${1}Conf.json El mié., 5 sept. 2018 a las 11:11, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escribió: > I want to execute my processes in cluster mode. As I don't know where th

deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Guillermo Ortiz Fernández
I want to execute my processes in cluster mode. As I don't know where the driver has been executed I have to do available all the file it needs. I undertand that they are two options. Copy all the files to all nodes of copy them to HDFS. My doubt is,, if I want to put all the files in HDFS, isn't

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I can't... do you think that it's a possible bug of this version?? from Spark or Kafka? El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () escribió: > Are you able to try a recent version of spark? > > On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández > wrote: > &g

java.lang.OutOfMemoryError: Java heap space - Spark driver.

2018-08-29 Thread Guillermo Ortiz Fernández
I got this error from spark driver, it seems that I should increase the memory in the driver although it's 5g (and 4 cores) right now. It seems weird to me because I'm not using Kryo or broadcast in this process but in the log there are references to Kryo and broadcast. How could I figure out the

Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this exception and Spark dies. I couldn't see any error or problem among the machines, anybody has the reason about this error? java.lang.IllegalStateException: This consumer has already been closed. at

Refresh broadcast variable when it isn't the value.

2018-08-19 Thread Guillermo Ortiz Fernández
Hello, I want to set data in a broadcast (Map) variable in Spark. Sometimes there are new data so I have to update/refresh the values but I'm not sure how I could do this. My idea is to use accumulators like a flag when a cache error occurs, in this point I could read the data and reload the

Reset the offsets, Kafka 0.10 and Spark

2018-06-07 Thread Guillermo Ortiz Fernández
I'm consuming data from Kafka with createDirectStream and store the offsets in Kafka ( https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html#kafka-itself ) val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String,

Measure performance time in some spark transformations.

2018-05-12 Thread Guillermo Ortiz Fernández
I want to measure how long it takes some different transformations in Spark as map, joinWithCassandraTable and so on. Which one is the best aproximation to do it? def time[R](block: => R): R = { val t0 = System.nanoTime() val result = block val t1 = System.nanoTime()