Hi Sunitha,
In the mapper function, you cannot update outer variables such as
`personLst.add(person)`,
this won't work so that's the reason you got an empty list.
You can use `rdd.collect()` to get a local list of `Person` objects first,
then you can safely iterate on the local list and do any
Hi Jorn,
In my case I have to call common interface function
by passing the values of each rdd. So I have tried iterating , but I was
not able to trigger common function from call method as commented in the
snippet code in my earlier mail.
Request you please share your views.
Regards
Sunitha
This is correct behavior. If you need to call another method simply append
another map, flatmap or whatever you need.
Depending on your use case you may use also reduce and reduce by key.
However you never (!) should use a global variable as in your snippet. This can
to work because you work in
Hi Deepak,
I am able to map row to person class, issue is I want to to call another
method.
I tried converting to list and its not working with out using collect.
Regards
Sunitha
On Tuesday, December 19, 2017, Deepak Sharma wrote:
> I am not sure about java but in scala
I am looking for same answer too .. will wait for response from other people
Sent from my iPhone
> On Dec 18, 2017, at 10:56 PM, Gaurav1809 wrote:
>
> Hi All,
>
> Will Bigdata tools & technology work with Blockchain in future? Any possible
> use cases that anyone is
Hi All,
Will Bigdata tools & technology work with Blockchain in future? Any possible
use cases that anyone is likely to face, please share.
Thanks
Gaurav
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
I am not sure about java but in scala it would be something like
df.rdd.map{ x => MyClass(x.getString(0),.)}
HTH
--Deepak
On Dec 19, 2017 09:25, "Sunitha Chennareddy"
wrote:
Hi All,
I am new to Spark, I want to convert DataFrame to List with out
using
Hi All,
I am new to Spark, I want to convert DataFrame to List with out
using collect().
Main requirement is I need to iterate through the rows of dataframe and
call another function by passing column value of each row (person.getId())
Here is the snippet I have tried, Kindly help me to resolve
Hi Marco,
If you add assembler at the first of the pipeline, like:
```
val pipeline = new Pipeline()
.setStages(Array(assembler, labelIndexer, featureIndexer, dt,
labelConverter))
```
Which error do you got ?
I think it can work fine if the `assembler` added into pipeline.
Thanks.
On
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.
Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?
window = Window.\
partitionBy('***', 'date_str').\
Hi All,
I've used CountVectorizerModel in spark ml and got the td-idf of the words.
Output column of a df looks like:
Hi,
You can use https://twitter.github.io/algebird/ which provides an
implementation of interesting Monoids and ways to combine them to tuples
(or products) of Monoids. Of course, you are not bound to use the algebird
library but it might be helpful to bootstrap.
On Mon, Dec 18, 2017 at 7:18
It seems interesting, however scalding seems to require be used outside of
spark ?
Le lun. 18 déc. 2017 à 17:15, Anastasios Zouzias a
écrit :
> Hi Julien,
>
> I am not sure if my answer applies on the streaming part of your question.
> However, in batch processing, if you
Gourav, Yes, sorry. Apparently I failed to mention I'm having these problems
with Spark consuming from a kinesis stream. Been putting in late nights to
figure this out and it's affecting my brain. :^)
-jeremy
--
Jeremy Kelley | Technical Director, Data
jkel...@carbonblack.com | Carbon
Hello,
I'm working with the ML package for regression purposes and I get good
results on my data.
I'm now trying to get multiple metrics at once, as right now, I'm doing
what is suggested by the examples here:
https://spark.apache.org/docs/2.1.0/ml-classification-regression.html
Basically
Hi Julien,
I am not sure if my answer applies on the streaming part of your question.
However, in batch processing, if you want to perform multiple aggregations
over an RDD with a single pass, a common approach is to use multiple
aggregators (a.k.a. tuple monoids), see below an example from
Thanks, Timur.
The problem is that if I run `foreachPartitions`, then I cannot ` start` the
streaming query. Or perhaps I miss something.
From: Timur Shenkao
Sent: 18 December 2017 16:11:06
To: Liana Napalkova
Cc: Silvio Fiorito;
You can't create a network connection to kafka on the driver and then
serialize it to send it the executor. That's likely why you're getting
serialization errors.
Kafka producers are thread safe and designed for use as a singleton.
Use a lazy singleton instance of the producer on the executor,
If there is no other way, then I will follow this recommendation.
From: Silvio Fiorito
Sent: 18 December 2017 16:20:03
To: Liana Napalkova; user@spark.apache.org
Subject: Re: How to properly execute `foreachPartition` in Spark 2.2
Couldn’t you readStream from Kafka, do your transformations, map your rows from
the transformed input into what you want need to send to Kafka, then
writeStream to Kafka?
From: Liana Napalkova
Date: Monday, December 18, 2017 at 10:07 AM
To: Silvio Fiorito
Spark Dataset / Dataframe has foreachPartition() as well. Its
implementation is much more efficient than RDD's.
There is ton of code snippets, say
I need to firstly read from Kafka queue into a DataFrame. Then I should perform
some transformations with the data. Finally, for each row in the DataFrame I
should conditionally apply KafkaProducer in order to send some data to Kafka.
So, I am both consuming and producing the data from/to
Hi Timur,
Thanks for your interest in SANSA.
The intermediate results are stored in RDDs mostly ( sometimes in Parquet
files).
The use of GraphFrames is in planning, as they are not officially
integrated with spark yet.
Please feel free to contact us in case of any questions.
Regards
Hajira
Why don’t you just use the Kafka sink for Spark 2.2?
https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html#creating-a-kafka-sink-for-streaming-queries
From: Liana Napalkova
Date: Monday, December 18, 2017 at 9:45 AM
To:
Hi,
I wonder how to properly execute `foreachPartition` in Spark 2.2. Below I
explain the problem is details. I appreciate any help.
In Spark 1.6 I was doing something similar to this:
DstreamFromKafka.foreachRDD(session => {
session.foreachPartition { partitionOfRecords =>
Hi Don,
It’s not so much map() vs flatMap(). You can return a collection and have
Spark flatten the result.
My point was more to change from Seq[BigDataStructure] to
Seq[SmallDataStructure]
If the use case is really storing image data - I would try to use
Seq[Vector] and store the values as a
Hi Esa,
I'm using it like this:
https://gist.github.com/tromika/1cda392242fdd66befe7970d80380216
cheers,
2017-12-16 11:04 GMT+01:00 Esa Heikkinen :
> Hi
>
> Does anyone have any hints or example (code) how to get combination:
> Windows10 + pyspark + ipython notebook +
The tables created are in the name livy:
drwxrwxrwx+ - livy hdfs 0 2017-12-18 09:38
/apps/hive/warehouse/dev.db/tbl
1. Df.write.saveAsTable()
2. spark.sql("CREATE TABLE tbl (key INT, value STRING)")
whereas, the create table from hive shell is 'hive'/proxyuser:
Hello
Thank you for very interesting job!
The question are :
1) where do you store final results or intermediate results? Parquet,
Janusgraph, Cassandra ?
2) Is there integration with Spark GraphFrames?
Sincerely yours, Timur
On Mon, Dec 18, 2017 at 9:21 AM, Hajira Jabeen
I've been looking for several solutions but I can't find something
efficient to compute many window function efficiently ( optimized
computation or efficient parallelism )
Am I the only one interested by this ?
Regards,
Julien
Le ven. 15 déc. 2017 à 21:34, Julien CHAMP
Dear all,
The Smart Data Analytics group [1] is happy to announce SANSA 0.3 - the
third release of the Scalable Semantic Analytics Stack. SANSA employs
distributed computing via Apache Spark and Flink in order to allow scalable
machine learning, inference and querying capabilities for large
31 matches
Mail list logo