Hello!
Maybe you can find more information on the same issue reported here:
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSourceProvider.html
validateGeneralOptions makes sure that group.id has not been specified and
reports an IllegalArgumentException
Hi!
What file system are you using: EMRFS or HDFS?
Also what memory are you using for the reducer ?
On Thu, Nov 7, 2019 at 8:37 PM abeboparebop wrote:
> I ran into the same issue processing 20TB of data, with 200k tasks on both
> the map and reduce sides. Reducing to 100k tasks each resolved
Hello!
I have an use case where I have to apply multiple already trained models
(e.g. M1, M2, ..Mn) on the same spark stream ( fetched from kafka).
The models were trained usining the isolation forest algorithm from here:
https://github.com/titicaca/spark-iforest
I have found something similar
dCount
On Tue, Jul 30, 2019 at 4:38 PM Spico Florin wrote:
> Hello!
>
> I would like to use the spark structured streaming integrated with Kafka
> the way is described here:
>
> https://spark.apache.org/docs/latest/structured-streamin
Hello!
I would like to use the spark structured streaming integrated with Kafka
the way is described here:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
but I got the following issue:
Caused by: org.apache.spark.sql.AnalysisException: Failed to find data
Hello!
Can you please share some code/thoughts on how to publish data from a
dataframe to RabbbitMQ?
Thanks.
Regards,
Florin
forward for your answers/
Regards,
Florin
On Thu, Aug 9, 2018 at 3:52 AM, Jeff Zhang wrote:
>
> Make sure you use the correct python which has tensorframe installed. Use
> PYSPARK_PYTHON
> to configure the python
>
>
>
> Spico Florin 于2018年8月
Hi!
I would like to use tensorframes in my pyspark notebook.
I have performed the following:
1. In the spark intepreter adde a new repository
http://dl.bintray.com/spark-packages/maven
2. in the spark interpreter added the
dependency databricks:tensorframes:0.2.9-s_2.11
3. pip install
Hello!
I would like to know if there is any feature in Spark Dataframe that when
is writing data to a Impala table, to also create that table when this
table was not previously cretaed in Impala .
For example, the code:
myDatafarme.write.mode(SaveMode.Overwrite).jdbc(jdbcURL, "books",
Hello!
I have a parquet file that has 60MB representing 10millions records.
When I read this file using Spark 2.3.0 and with the
configuration spark.sql.files.maxPartitionBytes=1024*1024*2 (=2MB) I got 29
partitions as expected.
Code:
sqlContext.setConf("spark.sql.files.maxPartitionBytes",
> applications -
> http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/
>
> On Wed, May 30, 2018 at 4:14 AM Spico Florin
> wrote:
>
>> Hello!
>> I'm also looking for unit testing spark Java application. I've seen the
>> great work done in spa
Hello!
I'm also looking for unit testing spark Java application. I've seen the
great work done in spark-testing-base but it seemed to me that I could not
use for Spark Java applications.
Only spark scala applications are supported?
Thanks.
Regards,
Florin
On Wed, May 23, 2018 at 8:07 AM,
Hi!
I had a similar issue when the user that submit the job to the spark
cluster didn't have permission to write into the hdfs. If you have the hdfs
GUI then you can check which users are and what permissions. Also can in
hdfs browser:(
.
On Mon, May 25, 2015 at 11:05 PM, Spico Florin spicoflo...@gmail.com
wrote:
Hello!
I would like to use the logging mechanism provided by the log4j, but
I'm getting the
Exception in thread main org.apache.spark.SparkException: Task not
serializable - Caused
Hello!
I would like to use the logging mechanism provided by the log4j, but I'm
getting the
Exception in thread main org.apache.spark.SparkException: Task not
serializable - Caused by: java.io.NotSerializableException:
org.apache.log4j.Logger
The code (and the problem) that I'm using resembles
Hello!
I know that HadoopRDD partitions are built based on the number of splits
in HDFS. I'm wondering if these partitions preserve the initial order of
data in file.
As an example, if I have an HDFS (myTextFile) file that has these splits:
split 0- line 1, ..., line k
split 1-line k+1,...,
I have tested sortByKey method with the following code and I have observed
that is triggering a new job when is called. I could find this in the
neither in API nor in the code. Is this an indented behavior? For example,
the RDD zipWithIndex method API specifies that will trigger a new job. But
Hello!
I'm newbie in spark I would like to understand some basic mechanism on how
it works behind the scenes.
I have attached the lineage of my RDD and I have the following questions:
1. Why do I have 8 stages instead of 5? From the book Learning from Spark
(Chapter 8 -http://bit.ly/1E0Hah7), I
wrote:
You can turn the Matrix to an Array with .toArray and then:
1- Write it using Scala/Java IO to the local disk of the driver
2- parallelize it and use .saveAsTextFile
2015-04-15 14:16 GMT+02:00 Spico Florin spicoflo...@gmail.com:
Hello!
The result of correlation in Spark MLLib
Hello!
The result of correlation in Spark MLLib is a of type
org.apache.spark.mllib.linalg.Matrix. (see
http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations)
val data: RDD[Vector] = ...
val correlMatrix: Matrix = Statistics.corr(data, pearson)
I would like to save the
Hello!
I have a CSV file that has the following content:
C1;C2;C3
11;22;33
12;23;34
13;24;35
What is the best approach to use Spark (API, MLLib) for achieving the
transpose of it?
C1 11 12 13
C2 22 23 24
C3 33 34 35
I look forward for your solutions and suggestions (some Scala code will be
://polyglotprogramming.com
On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin spicoflo...@gmail.com
wrote:
Hello!
I would like to know what is the optimal solution for getting the header
with from a CSV file with Spark? My aproach was:
def getHeader(data: RDD[String]): String = {
data.zipWithIndex().filter
Hello!
I would like to know what is the optimal solution for getting the header
with from a CSV file with Spark? My aproach was:
def getHeader(data: RDD[String]): String = {
data.zipWithIndex().filter(_._2==0).map(x=x._1).take(1).mkString() }
Thanks.
) and
driver JVMs
ps aux | grep spark.executor = will show the actual worker JVMs
2015-02-25 14:23 GMT+01:00 Spico Florin spicoflo...@gmail.com:
Hello!
I've read the documentation about the spark architecture, I have the
following questions:
1: how many executors can be on a single worker
Hello!
I've read the documentation about the spark architecture, I have the
following questions:
1: how many executors can be on a single worker process (JMV)?
2:Should I think executor like a Java Thread Executor where the pool size
is equal with the number of the given cores (set up by the
Hello!
I'm newbie to Spark and I have the following case study:
1. Client sending at 100ms the following data:
{uniqueId, timestamp, measure1, measure2 }
2. Each 30 seconds I would like to correlate the data collected in the
window, with some predefined double vector pattern for each given
Hi!
I'm new to Spark. I have a case study that where the data is store in CSV
files. These files have headers with morte than 1000 columns. I would like
to know what are the best practice to parsing them and in special the
following points:
1. Getting and parsing all the files from a folder
2.
Hello!
I received the following errors in the workerLog.log files:
ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660]
- [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed with
[akka.tcp://sparkExecutor@stream4:47929]] [
Hello!
I'm using spark-csv 2.10 with Java from the maven repository
groupIdcom.databricks/groupId
artifactIdspark-csv_2.10/artifactId
version0.1.1/version
I would like to use Spark-SQL to filter out my data. I'm using the
following code:
JavaSchemaRDD cars = new
29 matches
Mail list logo