Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Weichen Xu
Hi Sunitha, In the mapper function, you cannot update outer variables such as `personLst.add(person)`, this won't work so that's the reason you got an empty list. You can use `rdd.collect()` to get a local list of `Person` objects first, then you can safely iterate on the local list and do any

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Sunitha Chennareddy
Hi Jorn, In my case I have to call common interface function by passing the values of each rdd. So I have tried iterating , but I was not able to trigger common function from call method as commented in the snippet code in my earlier mail. Request you please share your views. Regards Sunitha

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Jörn Franke
This is correct behavior. If you need to call another method simply append another map, flatmap or whatever you need. Depending on your use case you may use also reduce and reduce by key. However you never (!) should use a global variable as in your snippet. This can to work because you work in

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Sunitha Chennareddy
Hi Deepak, I am able to map row to person class, issue is I want to to call another method. I tried converting to list and its not working with out using collect. Regards Sunitha On Tuesday, December 19, 2017, Deepak Sharma wrote: > I am not sure about java but in scala

Re: What does Blockchain technology mean for Big Data? And how Hadoop/Spark will play role with it?

2017-12-18 Thread KhajaAsmath Mohammed
I am looking for same answer too .. will wait for response from other people Sent from my iPhone > On Dec 18, 2017, at 10:56 PM, Gaurav1809 wrote: > > Hi All, > > Will Bigdata tools & technology work with Blockchain in future? Any possible > use cases that anyone is

What does Blockchain technology mean for Big Data? And how Hadoop/Spark will play role with it?

2017-12-18 Thread Gaurav1809
Hi All, Will Bigdata tools & technology work with Blockchain in future? Any possible use cases that anyone is likely to face, please share. Thanks Gaurav -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ -

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Deepak Sharma
I am not sure about java but in scala it would be something like df.rdd.map{ x => MyClass(x.getString(0),.)} HTH --Deepak On Dec 19, 2017 09:25, "Sunitha Chennareddy" wrote: Hi All, I am new to Spark, I want to convert DataFrame to List with out using

Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Sunitha Chennareddy
Hi All, I am new to Spark, I want to convert DataFrame to List with out using collect(). Main requirement is I need to iterate through the rows of dataframe and call another function by passing column value of each row (person.getId()) Here is the snippet I have tried, Kindly help me to resolve

Re: Please Help with DecisionTree/FeatureIndexer

2017-12-18 Thread Weichen Xu
Hi Marco, If you add assembler at the first of the pipeline, like: ``` val pipeline = new Pipeline() .setStages(Array(assembler, labelIndexer, featureIndexer, dt, labelConverter)) ``` Which error do you got ? I think it can work fine if the `assembler` added into pipeline. Thanks. On

/tmp fills up to 100GB when using a window function

2017-12-18 Thread Mihai Iacob
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.   Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?   window = Window.\ partitionBy('***', 'date_str').\

Mapping words to vector sparkml CountVectorizerModel

2017-12-18 Thread Sandeep Nemuri
Hi All, I've used CountVectorizerModel in spark ml and got the td-idf of the words. Output column of a df looks like:

Re: Several Aggregations on a window function

2017-12-18 Thread Anastasios Zouzias
Hi, You can use https://twitter.github.io/algebird/ which provides an implementation of interesting Monoids and ways to combine them to tuples (or products) of Monoids. Of course, you are not bound to use the algebird library but it might be helpful to bootstrap. On Mon, Dec 18, 2017 at 7:18

Re: Several Aggregations on a window function

2017-12-18 Thread Julien CHAMP
It seems interesting, however scalding seems to require be used outside of spark ? Le lun. 18 déc. 2017 à 17:15, Anastasios Zouzias a écrit : > Hi Julien, > > I am not sure if my answer applies on the streaming part of your question. > However, in batch processing, if you

Re: kinesis throughput problems

2017-12-18 Thread Jeremy Kelley
Gourav, Yes, sorry. Apparently I failed to mention I'm having these problems with Spark consuming from a kinesis stream. Been putting in late nights to figure this out and it's affecting my brain. :^) -jeremy -- Jeremy Kelley | Technical Director, Data jkel...@carbonblack.com | Carbon

Getting multiple regression metrics at once

2017-12-18 Thread OBones
Hello, I'm working with the ML package for regression purposes and I get good results on my data. I'm now trying to get multiple metrics at once, as right now, I'm doing what is suggested by the examples here: https://spark.apache.org/docs/2.1.0/ml-classification-regression.html Basically

Re: Several Aggregations on a window function

2017-12-18 Thread Anastasios Zouzias
Hi Julien, I am not sure if my answer applies on the streaming part of your question. However, in batch processing, if you want to perform multiple aggregations over an RDD with a single pass, a common approach is to use multiple aggregators (a.k.a. tuple monoids), see below an example from

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova
Thanks, Timur. The problem is that if I run `foreachPartitions`, then I cannot ` start` the streaming query. Or perhaps I miss something. From: Timur Shenkao Sent: 18 December 2017 16:11:06 To: Liana Napalkova Cc: Silvio Fiorito;

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Cody Koeninger
You can't create a network connection to kafka on the driver and then serialize it to send it the executor. That's likely why you're getting serialization errors. Kafka producers are thread safe and designed for use as a singleton. Use a lazy singleton instance of the producer on the executor,

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova
If there is no other way, then I will follow this recommendation. From: Silvio Fiorito Sent: 18 December 2017 16:20:03 To: Liana Napalkova; user@spark.apache.org Subject: Re: How to properly execute `foreachPartition` in Spark 2.2

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Silvio Fiorito
Couldn’t you readStream from Kafka, do your transformations, map your rows from the transformed input into what you want need to send to Kafka, then writeStream to Kafka? From: Liana Napalkova Date: Monday, December 18, 2017 at 10:07 AM To: Silvio Fiorito

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Timur Shenkao
Spark Dataset / Dataframe has foreachPartition() as well. Its implementation is much more efficient than RDD's. There is ton of code snippets, say

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova
I need to firstly read from Kafka queue into a DataFrame. Then I should perform some transformations with the data. Finally, for each row in the DataFrame I should conditionally apply KafkaProducer in order to send some data to Kafka. So, I am both consuming and producing the data from/to

Re: SANSA 0.3 (Scalable Semantic Analytics Stack) Released

2017-12-18 Thread Hajira Jabeen
Hi Timur, Thanks for your interest in SANSA. The intermediate results are stored in RDDs mostly ( sometimes in Parquet files). The use of GraphFrames is in planning, as they are not officially integrated with spark yet. Please feel free to contact us in case of any questions. Regards Hajira

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Silvio Fiorito
Why don’t you just use the Kafka sink for Spark 2.2? https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html#creating-a-kafka-sink-for-streaming-queries From: Liana Napalkova Date: Monday, December 18, 2017 at 9:45 AM To:

How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova
Hi, I wonder how to properly execute `foreachPartition` in Spark 2.2. Below I explain the problem is details. I appreciate any help. In Spark 1.6 I was doing something similar to this: DstreamFromKafka.foreachRDD(session => { session.foreachPartition { partitionOfRecords =>

Re: flatMap() returning large class

2017-12-18 Thread Richard Garris
Hi Don, It’s not so much map() vs flatMap(). You can return a collection and have Spark flatten the result. My point was more to change from Seq[BigDataStructure] to Seq[SmallDataStructure] If the use case is really storing image data - I would try to use Seq[Vector] and store the values as a

Re: Windows10 + pyspark + ipython + csv file loading with timestamps

2017-12-18 Thread Szuromi Tamás
Hi Esa, I'm using it like this: https://gist.github.com/tromika/1cda392242fdd66befe7970d80380216 cheers, 2017-12-16 11:04 GMT+01:00 Esa Heikkinen : > Hi > > Does anyone have any hints or example (code) how to get combination: > Windows10 + pyspark + ipython notebook +

Spark - Livy - Hive Table User

2017-12-18 Thread Sudha KS
The tables created are in the name livy: drwxrwxrwx+ - livy hdfs 0 2017-12-18 09:38 /apps/hive/warehouse/dev.db/tbl 1. Df.write.saveAsTable() 2. spark.sql("CREATE TABLE tbl (key INT, value STRING)") whereas, the create table from hive shell is 'hive'/proxyuser:

Re: SANSA 0.3 (Scalable Semantic Analytics Stack) Released

2017-12-18 Thread Timur Shenkao
Hello Thank you for very interesting job! The question are : 1) where do you store final results or intermediate results? Parquet, Janusgraph, Cassandra ? 2) Is there integration with Spark GraphFrames? Sincerely yours, Timur On Mon, Dec 18, 2017 at 9:21 AM, Hajira Jabeen

Re: Several Aggregations on a window function

2017-12-18 Thread Julien CHAMP
I've been looking for several solutions but I can't find something efficient to compute many window function efficiently ( optimized computation or efficient parallelism ) Am I the only one interested by this ? Regards, Julien Le ven. 15 déc. 2017 à 21:34, Julien CHAMP

SANSA 0.3 (Scalable Semantic Analytics Stack) Released

2017-12-18 Thread Hajira Jabeen
Dear all, The Smart Data Analytics group [1] is happy to announce SANSA 0.3 - the third release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large