Re: Problem with Kafka group.id

2020-03-24 Thread Spico Florin
Hello! Maybe you can find more information on the same issue reported here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSourceProvider.html validateGeneralOptions makes sure that group.id has not been specified and reports an IllegalArgumentException

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Spico Florin
Hi! What file system are you using: EMRFS or HDFS? Also what memory are you using for the reducer ? On Thu, Nov 7, 2019 at 8:37 PM abeboparebop wrote: > I ran into the same issue processing 20TB of data, with 200k tasks on both > the map and reduce sides. Reducing to 100k tasks each resolved

[Spark Streaming] Apply multiple ML pipelines(Models) to the same stream

2019-10-31 Thread Spico Florin
Hello! I have an use case where I have to apply multiple already trained models (e.g. M1, M2, ..Mn) on the same spark stream ( fetched from kafka). The models were trained usining the isolation forest algorithm from here: https://github.com/titicaca/spark-iforest I have found something similar

Re: Kafka Integration libraries put in the fat jar

2019-07-31 Thread Spico Florin
dCount On Tue, Jul 30, 2019 at 4:38 PM Spico Florin wrote: > Hello! > > I would like to use the spark structured streaming integrated with Kafka > the way is described here: > > https://spark.apache.org/docs/latest/structured-streamin

Kafka Integration libraries put in the fat jar

2019-07-30 Thread Spico Florin
Hello! I would like to use the spark structured streaming integrated with Kafka the way is described here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html but I got the following issue: Caused by: org.apache.spark.sql.AnalysisException: Failed to find data

Dataframe Publish to RabbitMQ

2019-06-21 Thread Spico Florin
Hello! Can you please share some code/thoughts on how to publish data from a dataframe to RabbbitMQ? Thanks. Regards, Florin

Re: Run/install tensorframes on zeppelin pyspark

2018-08-10 Thread Spico Florin
forward for your answers/ Regards, Florin On Thu, Aug 9, 2018 at 3:52 AM, Jeff Zhang wrote: > > Make sure you use the correct python which has tensorframe installed. Use > PYSPARK_PYTHON > to configure the python > > > > Spico Florin 于2018年8月

Run/install tensorframes on zeppelin pyspark

2018-08-08 Thread Spico Florin
Hi! I would like to use tensorframes in my pyspark notebook. I have performed the following: 1. In the spark intepreter adde a new repository http://dl.bintray.com/spark-packages/maven 2. in the spark interpreter added the dependency databricks:tensorframes:0.2.9-s_2.11 3. pip install

Dataframe to automatically create Impala table when writing to Impala

2018-06-22 Thread Spico Florin
Hello! I would like to know if there is any feature in Spark Dataframe that when is writing data to a Impala table, to also create that table when this table was not previously cretaed in Impala . For example, the code: myDatafarme.write.mode(SaveMode.Overwrite).jdbc(jdbcURL, "books",

Spark 1.6 change the number partitions without repartition and without shuffling

2018-06-13 Thread Spico Florin
Hello! I have a parquet file that has 60MB representing 10millions records. When I read this file using Spark 2.3.0 and with the configuration spark.sql.files.maxPartitionBytes=1024*1024*2 (=2MB) I got 29 partitions as expected. Code: sqlContext.setConf("spark.sql.files.maxPartitionBytes",

Re: testing frameworks

2018-06-04 Thread Spico Florin
> applications - > http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ > > On Wed, May 30, 2018 at 4:14 AM Spico Florin > wrote: > >> Hello! >> I'm also looking for unit testing spark Java application. I've seen the >> great work done in spa

Re: testing frameworks

2018-05-30 Thread Spico Florin
Hello! I'm also looking for unit testing spark Java application. I've seen the great work done in spark-testing-base but it seemed to me that I could not use for Spark Java applications. Only spark scala applications are supported? Thanks. Regards, Florin On Wed, May 23, 2018 at 8:07 AM,

Re: Does saveAsHadoopFile depend on master?

2016-06-22 Thread Spico Florin
Hi! I had a similar issue when the user that submit the job to the spark cluster didn't have permission to write into the hdfs. If you have the hdfs GUI then you can check which users are and what permissions. Also can in hdfs browser:(

Re: Using Log4j for logging messages inside lambda functions

2015-05-26 Thread Spico Florin
. On Mon, May 25, 2015 at 11:05 PM, Spico Florin spicoflo...@gmail.com wrote: Hello! I would like to use the logging mechanism provided by the log4j, but I'm getting the Exception in thread main org.apache.spark.SparkException: Task not serializable - Caused

Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Spico Florin
Hello! I would like to use the logging mechanism provided by the log4j, but I'm getting the Exception in thread main org.apache.spark.SparkException: Task not serializable - Caused by: java.io.NotSerializableException: org.apache.log4j.Logger The code (and the problem) that I'm using resembles

Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Spico Florin
Hello! I know that HadoopRDD partitions are built based on the number of splits in HDFS. I'm wondering if these partitions preserve the initial order of data in file. As an example, if I have an HDFS (myTextFile) file that has these splits: split 0- line 1, ..., line k split 1-line k+1,...,

Spark RDD sortByKey triggering a new job

2015-04-24 Thread Spico Florin
I have tested sortByKey method with the following code and I have observed that is triggering a new job when is called. I could find this in the neither in API nor in the code. Is this an indented behavior? For example, the RDD zipWithIndex method API specifies that will trigger a new job. But

Order of execution of tasks inside of a stage and computing the number of stages

2015-04-20 Thread Spico Florin
Hello! I'm newbie in spark I would like to understand some basic mechanism on how it works behind the scenes. I have attached the lineage of my RDD and I have the following questions: 1. Why do I have 8 stages instead of 5? From the book Learning from Spark (Chapter 8 -http://bit.ly/1E0Hah7), I

Re: Save org.apache.spark.mllib.linalg.Matri to a file

2015-04-16 Thread Spico Florin
wrote: You can turn the Matrix to an Array with .toArray and then: 1- Write it using Scala/Java IO to the local disk of the driver 2- parallelize it and use .saveAsTextFile 2015-04-15 14:16 GMT+02:00 Spico Florin spicoflo...@gmail.com: Hello! The result of correlation in Spark MLLib

Save org.apache.spark.mllib.linalg.Matri to a file

2015-04-15 Thread Spico Florin
Hello! The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations) val data: RDD[Vector] = ... val correlMatrix: Matrix = Statistics.corr(data, pearson) I would like to save the

Matrix Transpose

2015-04-02 Thread Spico Florin
Hello! I have a CSV file that has the following content: C1;C2;C3 11;22;33 12;23;34 13;24;35 What is the best approach to use Spark (API, MLLib) for achieving the transpose of it? C1 11 12 13 C2 22 23 24 C3 33 34 35 I look forward for your solutions and suggestions (some Scala code will be

Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Spico Florin
://polyglotprogramming.com On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com wrote: Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter

Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Spico Florin
Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() } Thanks.

Re: Number of Executors per worker process

2015-03-02 Thread Spico Florin
) and driver JVMs ps aux | grep spark.executor = will show the actual worker JVMs 2015-02-25 14:23 GMT+01:00 Spico Florin spicoflo...@gmail.com: Hello! I've read the documentation about the spark architecture, I have the following questions: 1: how many executors can be on a single worker

Number of Executors per worker process

2015-02-25 Thread Spico Florin
Hello! I've read the documentation about the spark architecture, I have the following questions: 1: how many executors can be on a single worker process (JMV)? 2:Should I think executor like a Java Thread Executor where the pool size is equal with the number of the given cores (set up by the

MLib usage on Spark Streaming

2015-02-16 Thread Spico Florin
Hello! I'm newbie to Spark and I have the following case study: 1. Client sending at 100ms the following data: {uniqueId, timestamp, measure1, measure2 } 2. Each 30 seconds I would like to correlate the data collected in the window, with some predefined double vector pattern for each given

Parsing CSV files in Spark

2015-02-06 Thread Spico Florin
Hi! I'm new to Spark. I have a case study that where the data is store in CSV files. These files have headers with morte than 1000 columns. I would like to know what are the best practice to parsing them and in special the following points: 1. Getting and parsing all the files from a folder 2.

Errors in the workers machines

2015-02-05 Thread Spico Florin
Hello! I received the following errors in the workerLog.log files: ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660] - [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed with [akka.tcp://sparkExecutor@stream4:47929]] [

Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Spico Florin
Hello! I'm using spark-csv 2.10 with Java from the maven repository groupIdcom.databricks/groupId artifactIdspark-csv_2.10/artifactId version0.1.1/version I would like to use Spark-SQL to filter out my data. I'm using the following code: JavaSchemaRDD cars = new