Apologies, I was not aware that Spark 2.0 has Kafka Consumer
caching/pooling now.
What I have checked is the latest Kafka Consumer, and I believe it is still
in beta quality.
https://kafka.apache.org/documentation.html#newconsumerconfigs
> Since 0.9.0.0 we have been working on a replacement for
Thank you Ayan.
For example, I have a dataframe below. And consider column "group" as key
to split this dataframe to three part, then want use kmeans to each split
part. To get each group's kmeans result.
+---+-++
| userID|group|features|
You can set that poll timeout higher with
spark.streaming.kafka.consumer.poll.ms
but half a second is fairly generous. I'd try to take a look at
what's going on with your network or kafka broker during that time.
On Tue, Aug 23, 2016 at 4:44 PM, Srikanth wrote:
> Hello,
Were you aware that the spark 2.0 / kafka 0.10 integration also reuses
kafka consumer instances on the executors?
On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim wrote:
> Hi,
>
> I have released the first version of a new Kafka integration with Spark
> that we use in the
what did you mean by “link” ? an HTTP URL to the spark monitoring UI? AFAIK, it
is not directly supported. i typically go to both masters and check which one
is active :-)
did you check if the failover actually happened in other ways (i don’t know
what the znode should say)? you can try
It is a bit hacky but possible. A lot depends on what kind of queries etc you
want to run. You could write a data source that reads your data and keeps it
partitioned the way you want, then use mapPartitions() to execute your code…
Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com
something like this should work….
val df = sparkSession.read.csv(“myfile.csv”) //you may have to provide a schema
if the guessed schema is not accurate
df.write.parquet(“myfile.parquet”)
Mohit Jaggi
Founder,
Data Orchard LLC
www.dataorchardllc.com
> On Apr 27, 2014, at 11:41 PM, Sai
@RK yeah I am thinking perhaps it is a better question to the @dev group. but
from the files that I pointed out the code and the comments that are in those
files I would be more inclined to think that it is actually storing byte code.
On Tue, Aug 23, 2016 4:37 PM, RK Aduri
Can you come up with your complete analysis? A snapshot of what you think the
code is doing. May be that would help us understand what exactly you were
trying to convey.
> On Aug 23, 2016, at 4:21 PM, kant kodali wrote:
>
>
>
apache/spark spark - Mirror of Apache Spark github.com
On Tue, Aug 23, 2016 4:17 PM, kant kodali kanth...@gmail.com wrote:
@RK you may want to look more deeply if you are curious. the code starts from
here
apache/spark spark - Mirror of Apache Spark github.com
and it goes here where it is
@RK you may want to look more deeply if you are curious. the code starts from
here
apache/spark spark - Mirror of Apache Spark github.com
and it goes here where it is trying to save the python code object(which is a
byte code)
apache/spark spark - Mirror of Apache Spark github.com
On Tue,
I think people explained this pretty well, but in practice, this distinction is
also somewhat of a marketing term, because every system will perform some kind
of batching. For example, every time you use TCP, the OS and network stack may
buffer multiple messages together and send them at once;
Thanks everyone for clarifying.
On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal wrote:
> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
> and it mentioned that spark streaming actually mini-batch not actual
> streaming.
>
> I have not used streaming
Using spark 2.0 & scala 2.11.8, I have a DataFrame with a timestamp column
root
|-- ORG_ID: integer (nullable = true)
|-- HEADER_ID: integer (nullable = true)
|-- ORDER_NUMBER: integer (nullable = true)
|-- LINE_ID: integer (nullable = true)
|-- LINE_NUMBER: integer (nullable = true)
I am running a spark streaming process where I am getting batch of data after n
seconds. I am using repartition to scale the application. Since the repartition
size is fixed we are getting lots of small files when batch size is very small.
Is there anyway I can change the partitioner logic
I just had a glance. AFAIK, that is nothing do with RDDs. It’s a pickler used
to serialize and deserialize the python code.
> On Aug 23, 2016, at 2:23 PM, kant kodali wrote:
>
> @Sean
>
> well this makes sense but I wonder what the following source code is doing?
>
>
>
@Sean
well this makes sense but I wonder what the following source code is doing?
apache/spark spark - Mirror of Apache Spark github.com
This code looks like it is trying to store some byte code some where (whether
its memory or disk) but why even go this path like creating a code objects so it
Thanks Nick, Sean and everyone. That did it
BTW I registered UDF for later use in a program
Anyway this is the much simplified code
import scala.util.Random
//
// UDF to create a random string of length characters
//
def randomString(chars: String, length: Int): String =
(0 until
I am trying to use the spark-jdbc package to access an impala table via a
spark data frame. From my understanding
(https://issues.apache.org/jira/browse/SPARK-12312) When loading DataFrames
from JDBC datasource with Kerberos authentication, remote executors
(yarn-client/cluster etc. modes) fail to
Hello,
I'm getting the below exception when testing Spark 2.0 with Kafka 0.10.
16/08/23 16:31:01 INFO AppInfoParser: Kafka version : 0.10.0.0
> 16/08/23 16:31:01 INFO AppInfoParser: Kafka commitId : b8642491e78c5a13
> 16/08/23 16:31:01 INFO CachedKafkaConsumer: Initial fetch for
>
We're probably mixing up some semantics here. An RDD is indeed,
really, just some bookkeeping that records how a certain result is
computed. It is not the data itself.
However we often talk about "persisting an RDD" which means
"persisting the result of computing the RDD" in which case that
Also if I were believe that it stores data then why do RDD needs to be
recomputed in the case of node failure? since the data has already been saved to
disk(according to you) after applying the transformation. It can simply just
bring back those data blocks right there is really no need to
@srkanth are you sure? the whole point of RDD's is to store transformations but
not the data as the spark paper points out but I do lack the practical
experience for me to confirm. when I looked at the spark source code
(specifically the checkpoint code) a while ago it was clearly storing some
That’s because of this:
scala> val text =
Array((1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr"))
text: Array[(Int, String)] = Array((1,hNjLJEgjxn), (2,lgryHkVlCN),
RDD contains data but not JVM byte code i.e. data which is read from source and
transformations have been applied. This is ideal case to persist RDDs.. As
Nirav mentioned this data will be serialized before persisting to disk..
Thanks,
Sreekanth Jella
From: kant kodali
Storing RDD to disk is nothing but storing JVM byte code to disk (in case of
Java or Scala). am I correct?
On Tue, Aug 23, 2016 12:55 PM, Nirav nira...@gmail.com wrote:
You can store either in serialized form(butter array) or just save it in a
string format like tsv or csv. There are
Splitting up the Maps to separate objects did not help.
However, I was able to work around the problem by reimplementing it with
RDD joins.
On Aug 18, 2016 5:16 PM, "Arun Luthra" wrote:
> This might be caused by a few large Map objects that Spark is trying to
>
You can store either in serialized form(butter array) or just save it in a
string format like tsv or csv. There are different RDD save apis for that.
Sent from my iPhone
> On Aug 23, 2016, at 12:26 PM, kant kodali wrote:
>
>
> ok now that I understand RDD can be stored to
On an other note, if you have a streaming app, you checkpoint the RDDs so that
they can be accessed in case of a failure. And yes, RDDs are persisted to DISK.
You can access spark’s UI and see it listed under Storage tab.
If RDDs are persisted in memory, you avoid any disk I/Os so that any
> How about something like
>
> scala> val text = (1 to 10).map(i => (i.toString,
> random_string(chars.mkString(""), 10))).toArray
>
> text: Array[(String, String)] = Array((1,FBECDoOoAC), (2,wvAyZsMZnt),
> (3,KgnwObOFEG), (4,tAZPRodrgP), (5,uSgrqyZGuc), (6,ztrTmbkOhO),
> (7,qUbQsKtZWq),
ok now that I understand RDD can be stored to the disk. My last question on this
topic would be this.
Storing RDD to disk is nothing but storing JVM byte code to disk (in case of
Java or Scala). am I correct?
On Tue, Aug 23, 2016 12:19 PM, RK Aduri rkad...@collectivei.com wrote:
On an other
Hi,
I have released the first version of a new Kafka integration with Spark
that we use in the company I work for: open sourced and named Maelstrom.
It is unique compared to other solutions out there as it reuses the
Kafka Consumer connection to achieve sub-milliseconds latency.
This library
RAM or Virtual memory is finite, so data size needs to be considered before
persist. Please see below documentation when to choose the persistency level.
http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
Thanks,
Sreekanth Jella
From: kant kodali
so when do we ever need to persist RDD on disk? given that we don't need to
worry about RAM(memory) as virtual memory will just push pages to the disk when
memory becomes scarce.
On Tue, Aug 23, 2016 11:23 AM, srikanth.je...@gmail.com wrote:
Hi Kant Kodali,
Based on the input parameter to
Hi Kant Kodali,
Based on the input parameter to persist() method either it will be cached on
memory or persisted to disk. In case of failures Spark will reconstruct the RDD
on a different executor based on the DAG. That is how failures are handled.
Spark Core does not replicate the RDDs as
See
https://github.com/koeninger/kafka-exactly-once
On Aug 23, 2016 10:30 AM, "KhajaAsmath Mohammed"
wrote:
> Hi Experts,
>
> I am looking for some information on how to acheive zero data loss while
> working with kafka and Spark. I have searched online and blogs have
>
Hi gents,
Well I was trying to see whether I can create an array of elements. From
RDD to DF, register as TempTable and store it as a Hive table
import scala.util.Random
//
// UDF to create a random string of charlength characters
//
def random_string(chars: String, charlength: Int) : String =
I am new to spark and I keep hearing that RDD's can be persisted to memory or
disk after each checkpoint. I wonder why RDD's are persisted in memory? In case
of node failure how would you access memory to reconstruct the RDD? persisting
to disk make sense because its like persisting to a Network
what is "text"? i.e. what is the "val text = ..." definition?
If text is a String itself then indeed sc.parallelize(Array(text)) is doing
the correct thing in this case.
On Tue, 23 Aug 2016 at 19:42 Mich Talebzadeh
wrote:
> I am sure someone know this :)
>
> Created
I am sure someone know this :)
Created a dynamic text string which has format
scala> println(text)
(1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr")
now if I do
scala> val
Any methods to achieve this?
On Aug 22, 2016 3:40 PM, "janardhan shetty" wrote:
> Hi,
>
> Are there any pointers, links on stacking multiple models in spark
> dataframes ?. WHat strategies can be employed if we need to combine greater
> than 2 models ?
>
Russell Is correct here.
micro-batch means it does processing within a window. In general there are
three things here.
batch window
This is the basic interval at which the system with receive the data in
batches. This is the interval set when creating a StreamingContext. For
example, if you set
Spark streaming does not process 1 event at a time which is in general I
think what people call "Streaming." It instead processes groups of events.
Each group is a "MicroBatch" that gets processed at the same time.
Streaming theoretically always has better latency because the event is
processed
It's based on "micro batching" model.
Sent from my iPhone
> On Aug 23, 2016, at 8:41 AM, Aseem Bansal wrote:
>
> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and
> it mentioned that spark streaming actually mini-batch not actual streaming.
>
Well, it is perplexing - as I am able to simply call UDTRegistration from
Java. And maybe it is not working properly? I was able to put in a
class/String through the register function. And when I call exists(..) it
returns true. So, it appears to work, but has issues :-)
Regards,
Raghu
--
View
saving offsets to zookeeper is old approach, check-pointing internally
saves the offsets to HDFS/location of checkpointing.
more details here:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
On Tue, Aug 23, 2016 at 10:30 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com>
I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
and it mentioned that spark streaming actually mini-batch not actual
streaming.
I have not used streaming and I am not sure what is the difference in the 2
terms. Hence could not make a judgement myself.
Hi Experts,
I am looking for some information on how to acheive zero data loss while
working with kafka and Spark. I have searched online and blogs have
different answer. Please let me know if anyone has idea on this.
Blog 1:
Hello All -
Hope, you are doing good.
I have a general question. I am working on Hadoop using Apache Spark.
At this moment, we are not using Cassandra but I would like to know what's
the scope of learning and using it in the Hadoop environment.
It would be great if you could provide the use
Ok I sorted out basic problem.
I can create a text string dynamically with 2 columns and numerate rows
scala> println(text)
(1,"VDNiqDKChu"),(2,"LApMjYGYkC"),(3,"HuVCyfizzD"),(4,"kUSzHWquGA"),(5,"OlJGGQQlUh"),(6,"POljdWgAIN"),(7,"wsRqqGZaqy"),(8,"HOgdjAFUln"),(9,"jYwvafOjDo"),(10,"QlvZGMBimd")
(apologies if this appears twice. I sent it 24 hours ago and it hasn't hit
the list yet)
Hi,
I have a bit of an unusual use-case and would greatly appreciate some
feedback from experienced Sparklers as to whether it is a good fit for
spark.
I have a network of compute/data servers configured as
Hello,
I'm using Spark streaming to process kafka message, and wants to use a prop
file as the input and broadcast the properties:
val props = new Properties()
props.load(new FileInputStream(args(0)))
val sc = initSparkContext()
val propsBC = sc.broadcast(props)
println(s"propFileBC 1: " +
The output from score() is very small, just a float. The input, however, could
be as big as several hundred MBs. I would like to broadcast the dataset to all
executors.
Thanks,
Piero
From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana,
Hi,
I can easily do this in shell but wanted to see what I can do in Spark.
I am trying to create a simple table (10 rows, 2 columns) for now and then
register it as tempTable and store in Hive, if it is feasible.
First column col1 is monolithically incrementing integer and the second
column a
The way to use Kryo serializer is similar as Scala, like below, the only
different is lack of convenient method "conf.registerKryoClasses", but it
should be easy to make one by yourself
conf=SparkConf()
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Spark on Yarn by default support customized log4j configuration,
RollingFileAppender could be used to avoid disk overflow as documented below
If you need a reference to the proper location to put log files in the YARN so
that YARN can properly display and aggregate them, use
Hi All,
I am running Java spark streaming jobs in yarn-client mode. Is there a way I
can manage logs rollover on edge node. I have a 10 second batch and log file
volume is huge.
Thanks,
Pradeep
-
To unsubscribe e-mail:
Hi Steve,
Could you share your opinion on whether the token gets renewed or not?
Is the token going to expire after 7 days anyway? Why is the change in
the recent version for token renewal? See
https://github.com/apache/spark/commit/ab648c0004cfb20d53554ab333dd2d198cb94ffa
Pozdrawiam,
Jacek
On 21 Aug 2016, at 20:43, Mich Talebzadeh
> wrote:
Hi Kamesh,
The message you are getting after 7 days:
PriviledgedActionException as:sys_bio_replicator (auth:KERBEROS)
I would suggest you to construct a toy problem and post for solution. At
this moment it's a little unclear what your intentions are.
Generally speaking, group by on a data frame created another data frame,
not multiple ones.
On 23 Aug 2016 16:35, "Wen Pei Yu" wrote:
> Hi
What is --> s below before the text of sql?
*var* sales_order_sql_stmt =* s*"""SELECT ORDER_NUMBER , INVENTORY_ITEM_ID,
ORGANIZATION_ID,
from_unixtime(unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd'),
'-MM-dd') AS schedule_date
FROM sales_order_demand
WHERE
Hi Mirmal
Filter works fine if I want handle one of grouped dataframe. But I has
multiple grouped dataframe, I wish I can apply ML algorithm to all of them
in one job, but not in for loops.
Wenpei.
From: Nirmal Fernando
To: Wen Pei Yu/China/IBM@IBMCN
Cc: User
On Tue, Aug 23, 2016 at 10:32 AM, Deepak Sharma
wrote:
> *val* *df** =
> **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID"
> =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)*
Ignore the last statement.
It should look something
63 matches
Mail list logo