Re: Maelstrom: Kafka integration with Spark

2016-08-23 Thread Jeoffrey Lim
Apologies, I was not aware that Spark 2.0 has Kafka Consumer caching/pooling now. What I have checked is the latest Kafka Consumer, and I believe it is still in beta quality. https://kafka.apache.org/documentation.html#newconsumerconfigs > Since 0.9.0.0 we have been working on a replacement for

Re: Apply ML to grouped dataframe

2016-08-23 Thread Wen Pei Yu
Thank you Ayan. For example, I have a dataframe below. And consider column "group" as key to split this dataframe to three part, then want use kmeans to each split part. To get each group's kmeans result. +---+-++ | userID|group|features|

Re: Spark 2.0 with Kafka 0.10 exception

2016-08-23 Thread Cody Koeninger
You can set that poll timeout higher with spark.streaming.kafka.consumer.poll.ms but half a second is fairly generous. I'd try to take a look at what's going on with your network or kafka broker during that time. On Tue, Aug 23, 2016 at 4:44 PM, Srikanth wrote: > Hello,

Re: Maelstrom: Kafka integration with Spark

2016-08-23 Thread Cody Koeninger
Were you aware that the spark 2.0 / kafka 0.10 integration also reuses kafka consumer instances on the executors? On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim wrote: > Hi, > > I have released the first version of a new Kafka integration with Spark > that we use in the

Re: How Spark HA works

2016-08-23 Thread Mohit Jaggi
what did you mean by “link” ? an HTTP URL to the spark monitoring UI? AFAIK, it is not directly supported. i typically go to both masters and check which one is active :-) did you check if the failover actually happened in other ways (i don’t know what the znode should say)? you can try

Re: Using spark to distribute jobs to standalone servers

2016-08-23 Thread Mohit Jaggi
It is a bit hacky but possible. A lot depends on what kind of queries etc you want to run. You could write a data source that reads your data and keeps it partitioned the way you want, then use mapPartitions() to execute your code… Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com

Re: Spark with Parquet

2016-08-23 Thread Mohit Jaggi
something like this should work…. val df = sparkSession.read.csv(“myfile.csv”) //you may have to provide a schema if the guessed schema is not accurate df.write.parquet(“myfile.parquet”) Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Apr 27, 2014, at 11:41 PM, Sai

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
@RK yeah I am thinking perhaps it is a better question to the @dev group. but from the files that I pointed out the code and the comments that are in those files I would be more inclined to think that it is actually storing byte code. On Tue, Aug 23, 2016 4:37 PM, RK Aduri

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread RK Aduri
Can you come up with your complete analysis? A snapshot of what you think the code is doing. May be that would help us understand what exactly you were trying to convey. > On Aug 23, 2016, at 4:21 PM, kant kodali wrote: > > >

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
apache/spark spark - Mirror of Apache Spark github.com On Tue, Aug 23, 2016 4:17 PM, kant kodali kanth...@gmail.com wrote: @RK you may want to look more deeply if you are curious. the code starts from here apache/spark spark - Mirror of Apache Spark github.com and it goes here where it is

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
@RK you may want to look more deeply if you are curious. the code starts from here apache/spark spark - Mirror of Apache Spark github.com and it goes here where it is trying to save the python code object(which is a byte code) apache/spark spark - Mirror of Apache Spark github.com On Tue,

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Matei Zaharia
I think people explained this pretty well, but in practice, this distinction is also somewhat of a marketing term, because every system will perform some kind of batching. For example, every time you use TCP, the OS and network stack may buffer multiple messages together and send them at once;

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
Thanks everyone for clarifying. On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal wrote: > I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ > and it mentioned that spark streaming actually mini-batch not actual > streaming. > > I have not used streaming

DataFrame Data Manipulation - Based on a timestamp column Not Working

2016-08-23 Thread Subhajit Purkayastha
Using spark 2.0 & scala 2.11.8, I have a DataFrame with a timestamp column root |-- ORG_ID: integer (nullable = true) |-- HEADER_ID: integer (nullable = true) |-- ORDER_NUMBER: integer (nullable = true) |-- LINE_ID: integer (nullable = true) |-- LINE_NUMBER: integer (nullable = true)

How do we process/scale variable size batches in Apache Spark Streaming

2016-08-23 Thread Rachana Srivastava
I am running a spark streaming process where I am getting batch of data after n seconds. I am using repartition to scale the application. Since the repartition size is fixed we are getting lots of small files when batch size is very small. Is there anyway I can change the partitioner logic

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread RK Aduri
I just had a glance. AFAIK, that is nothing do with RDDs. It’s a pickler used to serialize and deserialize the python code. > On Aug 23, 2016, at 2:23 PM, kant kodali wrote: > > @Sean > > well this makes sense but I wonder what the following source code is doing? > > >

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
@Sean well this makes sense but I wonder what the following source code is doing? apache/spark spark - Mirror of Apache Spark github.com This code looks like it is trying to store some byte code some where (whether its memory or disk) but why even go this path like creating a code objects so it

Re: Breaking down text String into Array elements

2016-08-23 Thread Mich Talebzadeh
Thanks Nick, Sean and everyone. That did it BTW I registered UDF for later use in a program Anyway this is the much simplified code import scala.util.Random // // UDF to create a random string of length characters // def randomString(chars: String, length: Int): String = (0 until

spark-jdbc impala with kerberos using yarn-client

2016-08-23 Thread twisterius
I am trying to use the spark-jdbc package to access an impala table via a spark data frame. From my understanding (https://issues.apache.org/jira/browse/SPARK-12312) When loading DataFrames from JDBC datasource with Kerberos authentication, remote executors (yarn-client/cluster etc. modes) fail to

Spark 2.0 with Kafka 0.10 exception

2016-08-23 Thread Srikanth
Hello, I'm getting the below exception when testing Spark 2.0 with Kafka 0.10. 16/08/23 16:31:01 INFO AppInfoParser: Kafka version : 0.10.0.0 > 16/08/23 16:31:01 INFO AppInfoParser: Kafka commitId : b8642491e78c5a13 > 16/08/23 16:31:01 INFO CachedKafkaConsumer: Initial fetch for >

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread Sean Owen
We're probably mixing up some semantics here. An RDD is indeed, really, just some bookkeeping that records how a certain result is computed. It is not the data itself. However we often talk about "persisting an RDD" which means "persisting the result of computing the RDD" in which case that

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
Also if I were believe that it stores data then why do RDD needs to be recomputed in the case of node failure? since the data has already been saved to disk(according to you) after applying the transformation. It can simply just bring back those data blocks right there is really no need to

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
@srkanth are you sure? the whole point of RDD's is to store transformations but not the data as the spark paper points out but I do lack the practical experience for me to confirm. when I looked at the spark source code (specifically the checkpoint code) a while ago it was clearly storing some

Re: Breaking down text String into Array elements

2016-08-23 Thread RK Aduri
That’s because of this: scala> val text = Array((1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr")) text: Array[(Int, String)] = Array((1,hNjLJEgjxn), (2,lgryHkVlCN),

RE: Are RDD's ever persisted to disk?

2016-08-23 Thread srikanth.jella
RDD contains data but not JVM byte code i.e. data which is read from source and transformations have been applied. This is ideal case to persist RDDs.. As Nirav mentioned this data will be serialized before persisting to disk.. Thanks, Sreekanth Jella From: kant kodali

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
Storing RDD to disk is nothing but storing JVM byte code to disk (in case of Java or Scala). am I correct? On Tue, Aug 23, 2016 12:55 PM, Nirav nira...@gmail.com wrote: You can store either in serialized form(butter array) or just save it in a string format like tsv or csv. There are

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-23 Thread Arun Luthra
Splitting up the Maps to separate objects did not help. However, I was able to work around the problem by reimplementing it with RDD joins. On Aug 18, 2016 5:16 PM, "Arun Luthra" wrote: > This might be caused by a few large Map objects that Spark is trying to >

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread Nirav
You can store either in serialized form(butter array) or just save it in a string format like tsv or csv. There are different RDD save apis for that. Sent from my iPhone > On Aug 23, 2016, at 12:26 PM, kant kodali wrote: > > > ok now that I understand RDD can be stored to

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread RK Aduri
On an other note, if you have a streaming app, you checkpoint the RDDs so that they can be accessed in case of a failure. And yes, RDDs are persisted to DISK. You can access spark’s UI and see it listed under Storage tab. If RDDs are persisted in memory, you avoid any disk I/Os so that any

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
> How about something like > > scala> val text = (1 to 10).map(i => (i.toString, > random_string(chars.mkString(""), 10))).toArray > > text: Array[(String, String)] = Array((1,FBECDoOoAC), (2,wvAyZsMZnt), > (3,KgnwObOFEG), (4,tAZPRodrgP), (5,uSgrqyZGuc), (6,ztrTmbkOhO), > (7,qUbQsKtZWq),

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
ok now that I understand RDD can be stored to the disk. My last question on this topic would be this. Storing RDD to disk is nothing but storing JVM byte code to disk (in case of Java or Scala). am I correct? On Tue, Aug 23, 2016 12:19 PM, RK Aduri rkad...@collectivei.com wrote: On an other

Maelstrom: Kafka integration with Spark

2016-08-23 Thread Jeoffrey Lim
Hi, I have released the first version of a new Kafka integration with Spark that we use in the company I work for: open sourced and named Maelstrom. It is unique compared to other solutions out there as it reuses the Kafka Consumer connection to achieve sub-milliseconds latency. This library

RE: Are RDD's ever persisted to disk?

2016-08-23 Thread srikanth.jella
RAM or Virtual memory is finite, so data size needs to be considered before persist. Please see below documentation when to choose the persistency level. http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose Thanks, Sreekanth Jella From: kant kodali

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
so when do we ever need to persist RDD on disk? given that we don't need to worry about RAM(memory) as virtual memory will just push pages to the disk when memory becomes scarce. On Tue, Aug 23, 2016 11:23 AM, srikanth.je...@gmail.com wrote: Hi Kant Kodali, Based on the input parameter to

RE: Are RDD's ever persisted to disk?

2016-08-23 Thread srikanth.jella
Hi Kant Kodali, Based on the input parameter to persist() method either it will be cached on memory or persisted to disk. In case of failures Spark will reconstruct the RDD on a different executor based on the DAG. That is how failures are handled. Spark Core does not replicate the RDDs as

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Cody Koeninger
See https://github.com/koeninger/kafka-exactly-once On Aug 23, 2016 10:30 AM, "KhajaAsmath Mohammed" wrote: > Hi Experts, > > I am looking for some information on how to acheive zero data loss while > working with kafka and Spark. I have searched online and blogs have >

Re: Breaking down text String into Array elements

2016-08-23 Thread Mich Talebzadeh
Hi gents, Well I was trying to see whether I can create an array of elements. From RDD to DF, register as TempTable and store it as a Hive table import scala.util.Random // // UDF to create a random string of charlength characters // def random_string(chars: String, charlength: Int) : String =

Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
I am new to spark and I keep hearing that RDD's can be persisted to memory or disk after each checkpoint. I wonder why RDD's are persisted in memory? In case of node failure how would you access memory to reconstruct the RDD? persisting to disk make sense because its like persisting to a Network

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
what is "text"? i.e. what is the "val text = ..." definition? If text is a String itself then indeed sc.parallelize(Array(text)) is doing the correct thing in this case. On Tue, 23 Aug 2016 at 19:42 Mich Talebzadeh wrote: > I am sure someone know this :) > > Created

Breaking down text String into Array elements

2016-08-23 Thread Mich Talebzadeh
I am sure someone know this :) Created a dynamic text string which has format scala> println(text) (1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr") now if I do scala> val

Re: Combining multiple models in Spark-ML 2.0

2016-08-23 Thread janardhan shetty
Any methods to achieve this? On Aug 22, 2016 3:40 PM, "janardhan shetty" wrote: > Hi, > > Are there any pointers, links on stacking multiple models in spark > dataframes ?. WHat strategies can be employed if we need to combine greater > than 2 models ? >

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Mich Talebzadeh
Russell Is correct here. micro-batch means it does processing within a window. In general there are three things here. batch window This is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Russell Spitzer
Spark streaming does not process 1 event at a time which is in general I think what people call "Streaming." It instead processes groups of events. Each group is a "MicroBatch" that gets processed at the same time. Streaming theoretically always has better latency because the event is processed

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread pandees waran
It's based on "micro batching" model. Sent from my iPhone > On Aug 23, 2016, at 8:41 AM, Aseem Bansal wrote: > > I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and > it mentioned that spark streaming actually mini-batch not actual streaming. >

Re: UDTRegistration (in Java)

2016-08-23 Thread raghukiran
Well, it is perplexing - as I am able to simply call UDTRegistration from Java. And maybe it is not working properly? I was able to put in a class/String through the register function. And when I call exists(..) it returns true. So, it appears to work, but has issues :-) Regards, Raghu -- View

Re: Zero Data Loss in Spark with Kafka

2016-08-23 Thread Sudhir Babu Pothineni
saving offsets to zookeeper is old approach, check-pointing internally saves the offsets to HDFS/location of checkpointing. more details here: http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Tue, Aug 23, 2016 at 10:30 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com>

Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and it mentioned that spark streaming actually mini-batch not actual streaming. I have not used streaming and I am not sure what is the difference in the 2 terms. Hence could not make a judgement myself.

Zero Data Loss in Spark with Kafka

2016-08-23 Thread KhajaAsmath Mohammed
Hi Experts, I am looking for some information on how to acheive zero data loss while working with kafka and Spark. I have searched online and blogs have different answer. Please let me know if anyone has idea on this. Blog 1:

Things to do learn Cassandra in Apache Spark Environment

2016-08-23 Thread Gokula Krishnan D
Hello All - Hope, you are doing good. I have a general question. I am working on Hadoop using Apache Spark. At this moment, we are not using Cassandra but I would like to know what's the scope of learning and using it in the Hadoop environment. It would be great if you could provide the use

Re: Can one create a dynamic Array and convert it to DF

2016-08-23 Thread Mich Talebzadeh
Ok I sorted out basic problem. I can create a text string dynamically with 2 columns and numerate rows scala> println(text) (1,"VDNiqDKChu"),(2,"LApMjYGYkC"),(3,"HuVCyfizzD"),(4,"kUSzHWquGA"),(5,"OlJGGQQlUh"),(6,"POljdWgAIN"),(7,"wsRqqGZaqy"),(8,"HOgdjAFUln"),(9,"jYwvafOjDo"),(10,"QlvZGMBimd")

Retrying: Using spark to distribute jobs to standalone servers

2016-08-23 Thread Larry White
(apologies if this appears twice. I sent it 24 hours ago and it hasn't hit the list yet) Hi, I have a bit of an unusual use-case and would greatly appreciate some feedback from experienced Sparklers as to whether it is a good fit for spark. I have a network of compute/data servers configured as

question about Broadcast value NullPointerException

2016-08-23 Thread Chong Zhang
Hello, I'm using Spark streaming to process kafka message, and wants to use a prop file as the input and broadcast the properties: val props = new Properties() props.load(new FileInputStream(args(0))) val sc = initSparkContext() val propsBC = sc.broadcast(props) println(s"propFileBC 1: " +

RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-23 Thread Cinquegrana, Piero
The output from score() is very small, just a float. The input, however, could be as big as several hundred MBs. I would like to broadcast the dataset to all executors. Thanks, Piero From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Monday, August 22, 2016 10:48 PM To: Cinquegrana,

Can one create a dynamic Array and convert it to DF

2016-08-23 Thread Mich Talebzadeh
Hi, I can easily do this in shell but wanted to see what I can do in Spark. I am trying to create a simple table (10 rows, 2 columns) for now and then register it as tempTable and store in Hive, if it is feasible. First column col1 is monolithically incrementing integer and the second column a

Re:Do we still need to use Kryo serializer in Spark 1.6.2 ?

2016-08-23 Thread prosp4300
The way to use Kryo serializer is similar as Scala, like below, the only different is lack of convenient method "conf.registerKryoClasses", but it should be easy to make one by yourself conf=SparkConf() conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

Re:Log rollover in spark streaming jobs

2016-08-23 Thread prosp4300
Spark on Yarn by default support customized log4j configuration, RollingFileAppender could be used to avoid disk overflow as documented below If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use

Log rollover in spark streaming jobs

2016-08-23 Thread Pradeep
Hi All, I am running Java spark streaming jobs in yarn-client mode. Is there a way I can manage logs rollover on edge node. I have a 10 second batch and log file volume is huge. Thanks, Pradeep - To unsubscribe e-mail:

Re: Spark Streaming application failing with Token issue

2016-08-23 Thread Jacek Laskowski
Hi Steve, Could you share your opinion on whether the token gets renewed or not? Is the token going to expire after 7 days anyway? Why is the change in the recent version for token renewal? See https://github.com/apache/spark/commit/ab648c0004cfb20d53554ab333dd2d198cb94ffa Pozdrawiam, Jacek

Re: Spark Streaming application failing with Token issue

2016-08-23 Thread Steve Loughran
On 21 Aug 2016, at 20:43, Mich Talebzadeh > wrote: Hi Kamesh, The message you are getting after 7 days: PriviledgedActionException as:sys_bio_replicator (auth:KERBEROS)

Re: Apply ML to grouped dataframe

2016-08-23 Thread ayan guha
I would suggest you to construct a toy problem and post for solution. At this moment it's a little unclear what your intentions are. Generally speaking, group by on a data frame created another data frame, not multiple ones. On 23 Aug 2016 16:35, "Wen Pei Yu" wrote: > Hi

Re: Spark 2.0 - Join statement compile error

2016-08-23 Thread Mich Talebzadeh
What is --> s below before the text of sql? *var* sales_order_sql_stmt =* s*"""SELECT ORDER_NUMBER , INVENTORY_ITEM_ID, ORGANIZATION_ID, from_unixtime(unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd'), '-MM-dd') AS schedule_date FROM sales_order_demand WHERE

Re: Apply ML to grouped dataframe

2016-08-23 Thread Wen Pei Yu
Hi Mirmal Filter works fine if I want handle one of grouped dataframe. But I has multiple grouped dataframe, I wish I can apply ML algorithm to all of them in one job, but not in for loops. Wenpei. From: Nirmal Fernando To: Wen Pei Yu/China/IBM@IBMCN Cc: User

Re: Spark 2.0 - Join statement compile error

2016-08-23 Thread Deepak Sharma
On Tue, Aug 23, 2016 at 10:32 AM, Deepak Sharma wrote: > *val* *df** = > **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID" > =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)* Ignore the last statement. It should look something