from:"Guillermo Ortiz Fernández"

Parse RDD[Seq[String]] to DataFrame with types.

2019-07-17 Thread Guillermo Ortiz Fernández

I'm trying to parse a RDD[Seq[String]] to Dataframe. ALthough it's a Seq of Strings they could have a more specific type as Int, Boolean, Double, String an so on. For example, a line could be: "hello", "1", "bye", "1.1" "hello1", "11", "bye1", "2.1" ... First column is going to be always a

Re: Putting record in HBase with Spark - error get regions.

2019-05-28 Thread Guillermo Ortiz Fernández

) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/05/28 11:11:18 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 369 El mar., 28 may. 2019 a las 12:12, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escribió:

Putting record in HBase with Spark - error get regions.

2019-05-28 Thread Guillermo Ortiz Fernández

I'm executing a load process into HBase with spark. (around 150M record). At the end of the process there are a lot of fail tasks. I get this error: 19/05/28 11:02:31 ERROR client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.TableNotFoundException: my_table at

Trying to improve performance of the driver.

2018-09-13 Thread Guillermo Ortiz Fernández

I have a process in Spark Streamin which lasts 2 seconds. When I check where the time is spent I see about 0.8s-1s in processing time although the global time is 2s. This one second is spent in the driver. I reviewed the code which is executed by the driver and I commented some of this code with

Re: deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Guillermo Ortiz Fernández

:lib/kafka-clients-0.10.0.1.jar \ --files conf/${1}Conf.json iris-core-0.0.1-SNAPSHOT.jar conf/${1}Conf.json El mié., 5 sept. 2018 a las 11:11, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escribió: > I want to execute my processes in cluster mode. As I don't know where th

deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Guillermo Ortiz Fernández

I want to execute my processes in cluster mode. As I don't know where the driver has been executed I have to do available all the file it needs. I undertand that they are two options. Copy all the files to all nodes of copy them to HDFS. My doubt is,, if I want to put all the files in HDFS, isn't

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández

I can't... do you think that it's a possible bug of this version?? from Spark or Kafka? El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () escribió: > Are you able to try a recent version of spark? > > On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández > wrote: > &g

java.lang.OutOfMemoryError: Java heap space - Spark driver.

2018-08-29 Thread Guillermo Ortiz Fernández

I got this error from spark driver, it seems that I should increase the memory in the driver although it's 5g (and 4 cores) right now. It seems weird to me because I'm not using Kryo or broadcast in this process but in the log there are references to Kryo and broadcast. How could I figure out the

Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández

I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this exception and Spark dies. I couldn't see any error or problem among the machines, anybody has the reason about this error? java.lang.IllegalStateException: This consumer has already been closed. at

Refresh broadcast variable when it isn't the value.

2018-08-19 Thread Guillermo Ortiz Fernández

Hello, I want to set data in a broadcast (Map) variable in Spark. Sometimes there are new data so I have to update/refresh the values but I'm not sure how I could do this. My idea is to use accumulators like a flag when a cache error occurs, in this point I could read the data and reload the

Reset the offsets, Kafka 0.10 and Spark

2018-06-07 Thread Guillermo Ortiz Fernández

I'm consuming data from Kafka with createDirectStream and store the offsets in Kafka ( https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html#kafka-itself ) val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String,

Measure performance time in some spark transformations.

2018-05-12 Thread Guillermo Ortiz Fernández

I want to measure how long it takes some different transformations in Spark as map, joinWithCassandraTable and so on. Which one is the best aproximation to do it? def time[R](block: => R): R = { val t0 = System.nanoTime() val result = block val t1 = System.nanoTime()

Parse RDD[Seq[String]] to DataFrame with types.

Re: Putting record in HBase with Spark - error get regions.

Putting record in HBase with Spark - error get regions.

Trying to improve performance of the driver.

Re: deploy-mode cluster. FileNotFoundException

deploy-mode cluster. FileNotFoundException

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

java.lang.OutOfMemoryError: Java heap space - Spark driver.

Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

Refresh broadcast variable when it isn't the value.

Reset the offsets, Kafka 0.10 and Spark

Measure performance time in some spark transformations.

12 matches

Site Navigation

Mail list logo

Footer information