I am new to spark and I keep hearing that RDD's can be persisted to memory or
disk after each checkpoint. I wonder why RDD's are persisted in memory? In case
of node failure how would you access memory to reconstruct the RDD? persisting
to disk make sense because its like persisting to a Network f
so when do we ever need to persist RDD on disk? given that we don't need to
worry about RAM(memory) as virtual memory will just push pages to the disk when
memory becomes scarce.
On Tue, Aug 23, 2016 11:23 AM, srikanth.je...@gmail.com wrote:
Hi Kant Kodali,
Based on the input paramet
en to choose the persistency level.
http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
Thanks,
Sreekanth Jella From: kant kodali
Sent: Tuesday, August 23, 2016 2:42 PM
To: srikanth.je...@gmail.com
Cc: user@spark.apache.org
Subject: Re: Are RDD's ever pers
different RDD save apis for that.
Sent from my iPhone
On Aug 23, 2016, at 12:26 PM, kant kodali < kanth...@gmail.com > wrote:
ok now that I understand RDD can be stored to the disk. My last question on this
topic would be this.
Storing RDD to disk is nothing but storing JVM byte code to disk (in c
ioned this data will be serialized before persisting to disk..
Thanks,
Sreekanth Jella
From: kant kodali
Sent: Tuesday, August 23, 2016 3:59 PM
To: Nirav
Cc: RK Aduri ; srikanth.je...@gmail.com ; user@spark.apache.org
Subject: Re: Are RDD's ever persisted to disk?
Storing RDD to disk
reconstruct an RDD
from its lineage in that case. so this sounds very contradictory to me after
reading the spark paper.
On Tue, Aug 23, 2016 1:28 PM, kant kodali kanth...@gmail.com wrote:
@srkanth are you sure? the whole point of RDD's is to store transformations but
not the data as the spark
le to store a serialized
representation in memory because it may be more compact.
This is not the same as saving/writing an RDD to persistent storage as
text or JSON or whatever.
On Tue, Aug 23, 2016 at 9:28 PM, kant kodali wrote:
> @srkanth are you sure? the whole point of RDD's
, Aug 23, 2016 2:39 PM, RK Aduri rkad...@collectivei.com wrote:
I just had a glance. AFAIK, that is nothing do with RDDs. It’s a pickler used to
serialize and deserialize the python code.
On Aug 23, 2016, at 2:23 PM, kant kodali < kanth...@gmail.com > wrote:
@Sean
well this makes sense but I
apache/spark spark - Mirror of Apache Spark github.com
On Tue, Aug 23, 2016 4:17 PM, kant kodali kanth...@gmail.com wrote:
@RK you may want to look more deeply if you are curious. the code starts from
here
apache/spark spark - Mirror of Apache Spark github.com
and it goes here where it is
...@collectivei.com wrote:
Can you come up with your complete analysis? A snapshot of what you think the
code is doing. May be that would help us understand what exactly you were trying
to convey.
On Aug 23, 2016, at 4:21 PM, kant kodali < kanth...@gmail.com > wrote:
apache/spark spark - Mirror of Apache
What do I loose if I run spark without using HDFS or Zookeper ? which of them is
almost a must in practice?
Hi Guys,
I am new to spark but I am wondering how do I compute the difference given a
bidirectional stream of numbers using spark streaming? To put it more concrete
say Bank A is sending money to Bank B and Bank B is sending money to Bank A
throughout the day such that at any given time we want to
object in your dashboard code and receive the
data in realtime and update the dashboard. You can use Node.js in your dashboard
( socket.io ). I am sure there are other ways too.
Does that help?
Sivakumaran S
On 25-Aug-2016, at 6:30 AM, kant kodali < kanth...@gmail.com > wrote:
so I would need to
it should be
On 25-Aug-2016, at 7:08 AM, kant kodali < kanth...@gmail.com > wrote:
@Sivakumaran when you say create a web socket object in your spark code I assume
you meant a spark "task" opening websocket connection from one of the worker
machines to some node.js server in that cas
case be liable for any monetary
damages arising from such loss, damage or destruction.
On 24 August 2016 at 21:54, kant kodali < kanth...@gmail.com > wrote:
What do I loose if I run spark without using HDFS or Zookeper ? which of them is
almost a must in practice?
format the data in the way your client
(dashboard) requires it and write it to the websocket.
Is your driver code in Python? The link Kevin has sent should start you off.
Regards,
Sivakumaran
On 25-Aug-2016, at 11:53 AM, kant kodali < kanth...@gmail.com > wrote:
yes for now it will be Spark Str
twork/tutorials/obe/java/HomeWebsocket/WebsocketHome.html#section7
)
Regards,
Sivakumaran S
On 25-Aug-2016, at 8:09 PM, kant kodali < kanth...@gmail.com > wrote:
Your assumption is right (thats what I intend to do). My driver code will be in
Java. The link sent by Kevin is a API reference to w
n't think many apps exist that don't read/write data.
The premise here is not just replication, but partitioning data across compute
resources. With a distributed file system, your big input exists across a bunch
of machines and you can send the work to the pieces of data.
On Thu, Aug 25,
also uses ZK for leader election. There seems to be some effort in
supporting etcd, but it's in progress:
https://issues.apache.org/jira/browse/MESOS-1806
On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote:
@Ofir @Sean very good points.
@Mike We dont use Kafka or
.
S3 or NFS will not able to provide that.
On 26 Aug 2016 07:49, "kant kodali" < kanth...@gmail.com > wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://iss
ZFS linux port has got very stable these days given LLNL maintains the linux
port and they also use it as a FileSystem for their super computer (The
supercomputer is one of the top in the nation is what I heard)
On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote:
How about
Hi guys,
Are there any instructions on how to setup spark with S3 on AWS?
Thanks!
:
On 25 Aug 2016, at 22:49, kant kodali < kanth...@gmail.com > wrote:
yeah so its seems like its work in progress. At very least Mesos took the
initiative to provide alternatives to ZK. I am just really looking forward for
this.
https://issues.apache.org/jira/browse/MESOS-3797
I worr
Hi,
I am unable to start spark slaves from my master node. when I run
./start-all.sh in my master node it brings up the master and but fails for
slaves saying "permission denied public key" for slaves but I did add the
master id_rsa.pub to my slaves authorized_keys and I checked manually from
my m
t;fs.s3.awsAcces sKeyId",AccessKey)
hadoopConf.set("fs.s3.awsSecre tAccessKey",SecretKey)
var jobInput = sc.textFile("s3://path to bucket")
Thanks
On Fri, Aug 26, 2016 at 5:16 PM, kant kodali < kanth...@gmail.com > wrote:
Hi guys,
Are there any instructions on how to setup spark with S3 on AWS?
Thanks!
for any monetary
damages arising from such loss, damage or destruction.
On 26 August 2016 at 12:58, kant kodali < kanth...@gmail.com > wrote:
@Steve your arguments make sense however there is a good majority of people who
have extensive experience with zookeeper prefer to avoid zookeeper and
Fixed..I just had to logout and login the master node for some reason
On Fri, Aug 26, 2016 5:32 AM, kant kodali kanth...@gmail.com wrote:
Hi,
I am unable to start spark slaves from my master node. when I run ./start-all.sh
in my master node it brings up the master and but fails for slaves
is there a HTTP2 (v2) endpoint for Spark Streaming?
y http/2? #curious
[1] http://bahir.apache.org/
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Fri, Aug 26, 2016 at 9:42 PM, kant kodali < kanth...@gmail.com > wro
es these days communicate using HTTP. HTTP2 for
better performance.
On Fri, Aug 26, 2016 2:47 PM, kant kodali kanth...@gmail.com wrote:
HTTP2 for fully pipelined out of order execution. other words I should be able
to send multiple requests through same TCP connection and by out of order
execut
I see an example on how
I can tell spark cluster to use Cassandra for checkpointing and others if at
all.
On Fri, Aug 26, 2016 9:50 AM, Steve Loughran ste...@hortonworks.com wrote:
On 26 Aug 2016, at 12:58, kant kodali < kanth...@gmail.com > wrote:
@Steve your arguments make sense howev
I understand that I cannot use spark streaming window operation without
checkpointing to HDFS but Without window operation I don't think we can do much
with spark streaming. so since it is very essential can I use Cassandra as a
distributed storage? If so, can I see an example on how I can tell sp
java.lang.RuntimeException: java.lang.AssertionError: assertion failed: A
ReceiverSupervisor has not been attached to the receiver yet. Maybe you are
starting some computation in the receiver before the Receiver.onStart() has been
called.
importorg.apache.spark.SparkConf;importorg.apache.spark.st
How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?
c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or
Large Scale Distributed Systems makes absolutely no sense. I can write a 10 page
essay on why that wouldn't work so great. you might be wondering why would spark
have it then? well probably because its ease of use for ML
ssException complaining about non-matching
serialVersionUIDs. Shouldn't that be caused by different jars on executors and
driver?
Am 03.09.2016 1:04 nachm. schrieb "Tal Grynbaum" :
My guess is that you're running out of memory somewhere. Try to increase the
driver memory and/or e
@Fridtjof you are right!
changing it to this Fixed it!
ompile group: org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0'
On Sat, Sep 3, 2016 12:30 PM
Hi Guys,
I am running my driver program on my local machine and my spark cluster is on
AWS. The big question is I don't know what are the right settings to get around
this public and private ip thing on AWS? my spark-env.sh currently has the the
following lines
export SPARK_PUBLIC_DNS="52.44.36.2
not sure how to fix this
or what I am missing?
Any help would be great.Thanks!
On Sat, Sep 3, 2016 5:39 PM, kant kodali kanth...@gmail.com
wrote:
Hi Guys,
I am running my driver program on my local machine and my spark cluster is on
AWS. The big question is I don't know what are the rig
Hi All,
I am trying to simplify how to frame my question so below is my code. I see
that BAR gets printed but not FOO and I am not sure why? my batch interval
is 1 second (something I pass in when I create a spark context). any idea?
I have bunch of events and I want to store the number of events
Hi Guys,
I have bunch of data coming in to my spark streaming cluster from a message
queue(not kafka). And this message queue guarantees at least once delivery only
so there is potential that some of the messages that come in to the spark
streaming cluster are actually duplicates and I am trying t
What is the difference between mini-batch vs real time streaming in practice
(not theory)? In theory, I understand mini batch is something that batches in
the given time frame whereas real time streaming is more like do something as
the data arrives but my biggest question is why not have mini batc
ion.
On 27 September 2016 at 08:12, kant kodali wrote:
What is the difference between mini-batch vs real time streaming in practice
(not theory)? In theory, I understand mini batch is something that batches in
the given time frame whereas real time streaming is more like do something as
the data
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError:
org/apache/spark/sql/Dataset at java.lang.Class.getDeclaredMethods0(Native
Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at
java.lang.Class.getDeclaredMethod(Class.java:2128) at
java.io.ObjectStre
I am running locally so they all are on one host
On Wed, Oct 5, 2016 3:12 PM, Jakob Odersky ja...@odersky.com
wrote:
Are all spark and scala versions the same? By "all" I mean the master, worker
and driver instances.
How to Disable or do minimal Logging for apache spark client Driver program? I
couldn't find this information on docs. By Driver program I mean the java
program where I initialize spark context. It produces lot of INFO messages but I
would like to know only when there is error or a Exception such
u
referring to spark local mode? It is possible to also run spark
applications in "distributed mode" (i.e. standalone, yarn or
mesos) just from the command line, however that will require
using spark's launcher interface and bundling your application in
a jar.
On Thu, Oct 6, 2016
Oct 5, 2016 at 3:35 PM, kant kodali wrote:
I am running locally so they all are on one host
On Wed, Oct 5, 2016 3:12 PM, Jakob Odersky ja...@odersky.com
wrote:
Are all spark and scala versions the same? By "all" I mean the master, worker
and driver instances.
perfect! That fixes it all!
On Fri, Oct 7, 2016 1:29 AM, Denis Bolshakov bolshakov.de...@gmail.com
wrote:
You need to have spark-sql, now you are missing it.
7 Окт 2016 г. 11:12 пользователь "kant kodali" написал:
Here are the jar files on my classpath after doing a grep for
I am currently not using spark streaming. I have a ETL pipeline and I want to
just resubmit the job after it is done. Like a typical cron job. is that
possible?
uot;standalone" in the past.
--Jakob
On Thu, Oct 6, 2016 at 10:30 PM, kant kodali wrote:
Hi Jakob,
It is a biggest question for me too since I seem to be on a different page
than everyone else whenever I say "I am also using spark standalone mode
and I don
I tried SpanBy but look like there is a strange error that happening no matter
which way I try. Like the one here described for Java solution.
http://qaoverflow.com/question/how-to-use-spanby-in-java/
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$Serializ
Hi Reynold,
Actually, I did that a well before posting my question here.
Thanks,kant
On Sun, Oct 9, 2016 8:48 PM, Reynold Xin r...@databricks.com
wrote:
You should probably check with DataStax who build the Cassandra connector for
Spark.
On Sun, Oct 9, 2016 at 8:13 PM, kant kodali wrote
27;META-INF/*.DSA'zip64 true}
This successfully creates the jar but the error still persists.
On Sun, Oct 9, 2016 11:44 PM, Shixiong(Ryan) Zhu shixi...@databricks.com
wrote:
Seems the runtime Spark is different from the compiled one. You should mark the
Spark components "provided&q
+1 Wooho I have the same problem. I have been trying hard to fix this.
On Mon, Oct 10, 2016 3:23 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com
wrote:
Hi,
If I change the parameter inside the setMaster() to "local", the program runs.
Is there something wrong with the cluster installa
d lead to the bug you're seeing. A common
>>> reason for a mismatch is if the SPARK_HOME environment variable is set.
>>> This will cause the spark-submit script to use the launcher determined by
>>> that environment variable, regardless of the directory from which it
Hi Guys,
My Spark Streaming Client program works fine as the long as the receiver
receives the data but say my receiver has no more data to receive for few
hours like (4-5 hours) and then its starts receiving the data again at that
point spark client program doesn't seem to process any data. It n
"dag-scheduler-event-loop" java.lang.OutOfMemoryError: unable to create
new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(
ForkJoinPool.java:1672)
at scala.co
Another thing I forgot to mention is that it happens after running for
several hours say (4 to 5 hours) I am not sure why it is creating so many
threads? any way to control them?
On Fri, Oct 28, 2016 at 12:47 PM, kant kodali wrote:
> "dag-scheduler-event-loop" java.lang.Out
On Sun, Oct 30, 2016 at 5:22 PM, Chan Chor Pang
wrote:
> /etc/security/limits.d/90-nproc.conf
>
Hi,
I am using Ubuntu 16.04 LTS. I have this directory /etc/security/limits.d/
but I don't have any files underneath it. This error happens after running
for 4 to 5 hours. I wonder if this is a GC is
ther way
> the jvm process will still not able to create new thread.
>
> btw the default limit for centos is 1024
>
>
> On 10/31/16 9:51 AM, kant kodali wrote:
>
>
> On Sun, Oct 30, 2016 at 5:22 PM, Chan Chor Pang
> wrote:
>
>> /etc/security/limits.d/90-nproc.con
when I did this
cat /proc/sys/kernel/pid_max
I got 32768
On Sun, Oct 30, 2016 at 6:36 PM, kant kodali wrote:
> I believe for ubuntu it is unlimited but I am not 100% sure (I just read
> somewhere online). I ran ulimit -a and this is what I get
>
> core file size (blocks,
gt; other user.
>
>
> On 10/31/16 10:38 AM, kant kodali wrote:
>
> when I did this
>
> cat /proc/sys/kernel/pid_max
>
> I got 32768
>
> On Sun, Oct 30, 2016 at 6:36 PM, kant kodali wrote:
>
>> I believe for ubuntu it is unlimited but I am not 100% sure (
still suspecting the user,
> as the user who run spark-submit is not necessary the pid for the JVM
> process
>
> can u make sure when you "ps -ef | grep {your app id} " the PID is root?
> On 10/31/16 11:21 AM, kant kodali wrote:
>
> The java process is run by the root
when I do
ps -elfT | grep "spark-driver-program.jar" | wc -l
The result is around 32K. why does it create so many threads how can I
limit this?
m to not create so many?
On Mon, Oct 31, 2016 at 3:25 AM, Sean Owen wrote:
> ps -L [pid] is what shows threads. I am not sure this is counting what you
> think it does. My shell process has about a hundred threads, and I can't
> imagine why one would have thousands unless your app spa
pend on your driver program. Do you spawn any threads in
> it? Could you share some more information on the driver program, spark
> version and your environment? It would greatly help others to help you
>
> On Mon, Oct 31, 2016 at 3:47 AM, kant kodali wrote:
> > The source of my p
I am also under the assumption that *onStart *function of the Receiver is
only called only once by Spark. please correct me if I am wrong.
On Mon, Oct 31, 2016 at 11:35 AM, kant kodali wrote:
> My driver program runs a spark streaming job. And it spawns a thread by
> itself only
which types of threads are leaking?
>
> On Mon, Oct 31, 2016 at 11:50 AM, kant kodali wrote:
>
>> I am also under the assumption that *onStart *function of the Receiver is
>> only called only once by Spark. please correct me if I am wrong.
>>
>> On Mon, Oct 31, 201
if the leak threads are in the driver side.
>
> Does it happen in the driver or executors?
>
> On Mon, Oct 31, 2016 at 12:20 PM, kant kodali wrote:
>
>> Hi Ryan,
>>
>> Ahh My Receiver.onStop method is currently empty.
>>
>> 1) I have a hard time seeing why th
e `jstack` to find out
> the name of leaking threads?
>
> On Mon, Oct 31, 2016 at 12:35 PM, kant kodali wrote:
>
>> Hi Ryan,
>>
>> It happens on the driver side and I am running on a client mode (not the
>> cluster mode).
>>
>> Thanks!
>>
>
te:
> Have you tried to get number of threads in a running process using `cat
> /proc//status` ?
>
> On Sun, Oct 30, 2016 at 11:04 PM, kant kodali wrote:
>
>> yes I did run ps -ef | grep "app_name" and it is root.
>>
>>
>>
>> On Sun, Oct 30, 20
Here is a UI of my thread dump.
http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYvMTEvMS8tLWpzdGFja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNoX2ludGVydmFsXzFzLnR4dC0tNi0xNy00Ng==
On Mon, Oct 31, 2016 at 10:32 PM, kant kodali wrote:
> Hi Vadim,
>
> Thank you so much this w
Here is a UI of my thread dump.
http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYvMTEvMS8tLWpzdG
Fja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNoX2ludGVydmFsXz
FzLnR4dC0tNi0xNy00Ng==
On Mon, Oct 31, 2016 at 7:10 PM, kant kodali wrote:
> Hi Ryan,
>
> I think you are right. Thi
This question looks very similar to mine but I don't see any answer.
http://markmail.org/message/kkxhi5jjtwyadzxt
On Mon, Oct 31, 2016 at 11:24 PM, kant kodali wrote:
> Here is a UI of my thread dump.
>
> http://fastthread.io/my-thread-report.jsp?p=c2
ubly ended queues in the
ForkJoinPool.
Thanks!
On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen wrote:
> Possibly https://issues.apache.org/jira/browse/SPARK-17396 ?
>
> On Tue, Nov 1, 2016 at 2:11 AM kant kodali wrote:
>
>> Hi Ryan,
>>
>> I think you are right. This ma
@Sean It looks like this problem can happen with other RDD's as well. Not
just unionRDD
On Tue, Nov 1, 2016 at 2:52 AM, kant kodali wrote:
> Hi Sean,
>
> The comments seem very relevant although I am not sure if this pull
> request https://github.com/apache/spark/pull/14985 w
AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0
On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu wrote:
> Dstream "Window" uses "union" to combine multiple RDDs in one window into
> a single RDD.
>
> On Tue, Nov 1, 2016 at 2:59 AM kant kodali
Looks like upgrading to Spark 2.0.1 fixed it! The thread count now when I
do cat /proc/pid/status is about 84 as opposed to a 1000 in the span of 2
mins in Spark 2.0.0
On Tue, Nov 1, 2016 at 11:40 AM, Shixiong(Ryan) Zhu wrote:
> Yes, try 2.0.1!
>
> On Tue, Nov 1, 2016 at 11:25 AM, ka
Hi Guys,
I have a random idea and it would be great to receive some input.
Can we have a HTTP2 Based receiver for Spark Streaming? I am wondering why
not build micro services using Spark when needed? I can see it is not meant
for that but I like to think it can be possible. To be more concrete, h
I don't see a store() call in your receive().
Search for store() in here http://spark.apache.org/
docs/latest/streaming-custom-receivers.html
On Wed, Nov 2, 2016 at 10:23 AM, Cassa L wrote:
> Hi,
> I am using spark 1.6. I wrote a custom receiver to read from WebSocket.
> But when I start my spa
JavaInputDStream> directKafkaStream =
KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topics, kafkaParams));
How to use Spark SQL to connect to Cassandra from Spark-Shell?
Any examples ? I use Java 8.
Thanks!
kant
ithub.com/datastax/spark-cassandra-connector
>
>
> Yong
>
>
>
> ------
> *From:* kant kodali
> *Sent:* Friday, November 11, 2016 11:04 AM
> *To:* user @spark
> *Subject:* How to use Spark SQL to connect to Cassandra from Spark-Shell?
>
> H
https://academy.datastax.com/courses/ds320-analytics-apache-spark/spark-sql-spark-sql-basics
On Fri, Nov 11, 2016 at 8:11 AM, kant kodali wrote:
> Hi,
>
> This is spark-cassandra-connector
> <https://github.com/datastax/spark-cassandra-connector> but I am looking
> more for
Wait I cannot create CassandraSQLContext from spark-shell. is this only for
enterprise versions?
Thanks!
On Fri, Nov 11, 2016 at 8:14 AM, kant kodali wrote:
> https://academy.datastax.com/courses/ds320-analytics-
> apache-spark/spark-sql-spark-sql-basics
>
> On Fri, Nov 11, 201
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#json-datasets
"Spark SQL can automatically infer the schema of a JSON dataset and load it
as a DataFrame. This conversion can be done using SQLContext.read.json() on
either an RDD of String, or a JSON file."
val df = spark.sql("SELECT
How does predicate push down really help? in the following cases
val df1 = spark.sql("select * from users where age > 30")
vs
val df1 = spark.sql("select * from users")
df.filter("age > 30")
robably better) example would be something like having two table
> A and B which are joined by some common key. Then a filtering is done on
> the key. Moving the filter to be before the join would probably make
> everything faster as filter is a faster operation than a join.
>
&
time taken by the
> filter itself.
>
>
>
> BTW. You can see the differences between the original plan and the
> optimized plan by calling explain(true) on the dataframe. This would show
> you what was parsed, how the optimization worked and what was physically
> run.
Thanks for the effort and clear explanation.
On Thu, Nov 17, 2016 at 12:07 AM, kant kodali wrote:
> Yes thats how I understood it with your first email as well but the key
> thing here sounds like some datasources may not have operators such as
> filter and so on in which case Spark St
Which parts in the diagram above are executed by DataSource connectors and
which parts are executed by Tungsten? or to put it in another way which
phase in the diagram above does Tungsten leverages the Datasource
connectors (such as say cassandra connector ) ?
My understanding so far is that con
e mapping your JSON payloads to
> tractable data structures will depend on business logic.
>
> The strategy of pulling out a blob into its on rdd and feeding it into the
> JSON loader should work for any data source once you have your data
> strategy figured out.
>
> On Wed,
yeah I feel like this is a bug since you can't really modify the settings
once you were given spark session or spark context. so the work around
would be to use --conf. In your case it would be like this
./spark-shell --conf spark.kryoserializer.buffer.max=1g
On Thu, Nov 17, 2016 at 1:59 PM, Ko
Hi All,
I would like to flatten JSON blobs into a Data Frame using Spark/Spark SQl
inside Spark-Shell.
val df = spark.sql("select body from test limit 3"); // body is a json
encoded blob column
val df2 = df.select(df("body").cast(StringType).as("body"))
when I do
df2.show // shows the 3 rows
This seem to work
import org.apache.spark.sql._
val rdd = df2.rdd.map { case Row(j: String) => j }
spark.read.json(rdd).show()
However I wonder if this any inefficiency here ? since I have to apply this
function for billion rows.
How to expose Spark-Shell in the production?
1) Should we expose it on Master Nodes or Executer nodes?
2) Should we simple give access to those machines and Spark-Shell binary?
what is the recommended way?
Thanks!
main/scala/org/apache/spark/sql/functions.scala#L2902>
> function that I think will do what you want.
>
> On Fri, Nov 18, 2016 at 2:29 AM, kant kodali wrote:
>
>> This seem to work
>>
>> import org.apache.spark.sql._
>> val rdd = df2.rdd.map { ca
On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust
wrote:
> The first release candidate should be coming out this week. You can
> subscribe to the dev list if you want to follow the release schedule.
>
> On Mon, Nov 21, 2016 at 9:34 PM, kant kodali wrote:
>
>> Hi Michael,
>&
Hi All,
Spark Shell doesnt seem to use spark workers but Spark Submit does. I had
the workers ips listed under conf/slaves file.
I am trying to count number of rows in Cassandra using spark-shell so I do
the following on spark master
val df = spark.sql("SELECT test from hello") // This has abo
1 - 100 of 481 matches
Mail list logo