Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via
spark-submit) will begin its processing even though it apparently did not
get all of the requested resources; it is running very slowly.
Is there a way to force Spark/YARN to only begin when it has the full set
of
Thanks, will try this out and get back...
On Tue, Jun 23, 2015 at 2:30 AM, Tathagata Das t...@databricks.com wrote:
Try adding the provided scopes
dependency !-- Spark dependency --
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
Are you using checkpointing?
I had a similar issue when recreating a streaming context from checkpoint
as broadcast variables are not checkpointed.
On 23 Jun 2015 5:01 pm, Nipun Arora nipunarora2...@gmail.com wrote:
Hi,
I have a spark streaming application where I need to access a model saved
btw. just for reference I have added the code in a gist:
https://gist.github.com/nipunarora/ed987e45028250248edc
and a stackoverflow reference here:
http://stackoverflow.com/questions/31006490/broadcast-variable-null-pointer-exception-in-spark-streaming
On Tue, Jun 23, 2015 at 11:01 AM, Nipun
It all depends on what it is you need to do with the pages. If you’re just
going to be collecting them then it’s really not much different than a
groupByKey. If instead you’re looking to derive some other value from the
series of pages then you could potentially partition by user id and run a
I don't think I have explicitly check-pointed anywhere. Unless it's
internal in some interface, I don't believe the application is checkpointed.
Thanks for the suggestion though..
Nipun
On Tue, Jun 23, 2015 at 11:05 AM, Benjamin Fradet benjamin.fra...@gmail.com
wrote:
Are you using
Hi,
I have a spark streaming application where I need to access a model saved
in a HashMap.
I have *no problems in running the same code with broadcast variables in
the local installation.* However I get a *null pointer* *exception* when I
deploy it on my spark test cluster.
I have stored a
Thanks! The solution:
https://gist.github.com/dokipen/018a1deeab668efdf455
On Mon, Jun 22, 2015 at 4:33 PM Davies Liu dav...@databricks.com wrote:
Right now, we can not figure out which column you referenced in
`select`, if there are multiple row with the same name in the joined
DataFrame
Hi,
I'm running a program in Spark 1.4 where several Spark Streaming contexts
are created from the same Spark context. As pointed in
https://spark.apache.org/docs/latest/streaming-programming-guide.html each
Spark Streaming context is stopped before creating the next Spark Streaming
context. The
Thanks. Yes, unfortunately, they all need to be grouped. I guess I can
partition the record by user id. However, I have millions of users, do you
think partition by user id will help?
Jianguo
On Mon, Jun 22, 2015 at 6:28 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
You’re right
Thank you Tathagata,
It is great to know about this issue, but our problem is a little bit
different. We have 3 nodes in our Spark cluster, and when the Zookeeper
leader dies, the Master Spark gets shut down, and remains down, but a new
master gets elected and loads the UI. I think if the problem
That issue happens only in python dsl?
El 23/6/2015 5:05 p. m., Bob Corsaro rcors...@gmail.com escribió:
Thanks! The solution:
https://gist.github.com/dokipen/018a1deeab668efdf455
On Mon, Jun 22, 2015 at 4:33 PM Davies Liu dav...@databricks.com wrote:
Right now, we can not figure out which
I logged this Jira this morning:
https://issues.apache.org/jira/browse/SPARK-8566
I'm curious if any of the cognoscenti can advise as to a likely cause of
the problem?
I found the error so just posting on the list.
It seems broadcast variables cannot be declared static.
If you do you get a null pointer exception.
Thanks
Nipun
On Tue, Jun 23, 2015 at 11:08 AM, Nipun Arora nipunarora2...@gmail.com
wrote:
btw. just for reference I have added the code in a
I'm wondering if there is a real benefit for splitting my memory in two for
the datanode/workers.
Datanodes and OS needs memory to perform their business. I suppose there
could be loss of performance if they came to compete for memory with the
worker(s).
Any opinion? :-)
--
View this message
just a heads up, i was doing some basic coding using DataFrame, Row,
StructType, etc. and i ended up with deadlocks in my sbt tests due to the
usage of
ScalaReflectionLock.synchronized in the spark sql code.
the issue away when i changed my tests to run consecutively...
I've only tried it in python
On Tue, Jun 23, 2015 at 12:16 PM Ignacio Blasco elnopin...@gmail.com
wrote:
That issue happens only in python dsl?
El 23/6/2015 5:05 p. m., Bob Corsaro rcors...@gmail.com escribió:
Thanks! The solution:
https://gist.github.com/dokipen/018a1deeab668efdf455
On
64GB in parquet could be many billions of rows because of the columnar
compression. And count distinct by itself is an expensive operation. This
is not just on Spark, even on Presto/Impala, you would see performance dip
with count distincts. And the cluster is not that powerful either.
The one
So the situation is following: got a spray server, with a spark context
available (fair scheduling in a cluster mode, via spark-submit). There are
some http urls, which calling spark rdd, and collecting information from
accumulo / hdfs / etc (using rdd). Noticed, that there is a sort of
Hi,
can you detail the symptom further? Was it that only 12 requests were
services and the other 440 timed out? I don't think that Spark is well
suited for this kind of workload, or at least the way it is being
represented. How long does a single request take Spark to complete?
Even with fair
Mind filing a JIRA?
On Tue, Jun 23, 2015 at 9:34 AM, Koert Kuipers ko...@tresata.com wrote:
just a heads up, i was doing some basic coding using DataFrame, Row,
StructType, etc. and i ended up with deadlocks in my sbt tests due to the
usage of
ScalaReflectionLock.synchronized in the spark
Hi folks, I have been using Spark against an external Metastore service
which runs Hive with Cdh 4.6
In Spark 1.2, I was able to successfully connect by building with the
following:
./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0
-Phive-thriftserver -Phive-0.12.0
I see that in
Hi,
Not sure if this will fit your needs, but if you are trying to
collect+chart some metrics specific to your app, yet want to correlate them
with what's going on in Spark, maybe Spark's performance numbers, you may
want to send your custom metrics to SPM, so they can be
You must not include spark-core and spark-streaming in the assembly. They
are already present in the installation and the presence of multiple
versions of spark may throw off the classloaders in weird ways. So make the
assembly marking the those dependencies as scope=provided.
On Tue, Jun 23,
Hi,
I work at a large financial institution in New York. We're looking into
Spark and trying to learn more about the deployment/use cases for real-time
analytics with Spark. When would it be better to deploy standalone Spark
versus Spark on top of a more comprehensive data management layer
Aaah this could be potentially major issue as it may prevent metrics from
restarted streaming context be not published. Can you make it a JIRA.
TD
On Tue, Jun 23, 2015 at 7:59 AM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com wrote:
Hi,
I'm running a program in Spark 1.4 where
hi
While using spark streaming (1.2) with kafka . I am getting below error
and receivers are getting killed but jobs get scheduled at each stream
interval.
15/06/23 18:42:35 WARN TaskSetManager: Lost task 0.1 in stage 18.0 (TID 82,
ip(XX)): java.io.IOException: Failed to connect to
Yes, this is a known behavior. Some static stuff are not serialized as part
of a task.
On Tue, Jun 23, 2015 at 10:24 AM, Nipun Arora nipunarora2...@gmail.com
wrote:
I found the error so just posting on the list.
It seems broadcast variables cannot be declared static.
If you do you get a null
probably there are already running jobs there
in addition, memory is also a resource, so if you are running 1 application
that took all your memory and then you are trying to run another
application that asks
for the memory the cluster doesn't have then the second app wont be running
so why are u
This should give accurate count for each batch, though for getting the rate
you have to make sure that you streaming app is stable, that is, batches
are processed as fast as they are received (scheduling delay in the spark
streaming UI is approx 0).
TD
On Tue, Jun 23, 2015 at 2:49 AM, anshu
Why are you mixing spark versions between streaming and core??
Your core is 1.2.0 and streaming is 1.4.0.
On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora shushantaror...@gmail.com
wrote:
It throws exception for WriteAheadLogUtils after excluding core and
streaming jar.
Exception in thread
Hi,
So I've done this Node-centered accumulator, I've written a small
piece about it :
http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/
Hope it can help someone
Guillaume
2015-06-18 15:17 GMT+02:00 Guillaume Pitel guillaume.pi...@exensa.com
There should be no difference assuming you don't use the intermediately
stored rdd values you are creating for anything else (rdd1, rdd2). In the
first example it still is creating these intermediate rdd objects you are
just using them implicitly and not storing the value.
It's also worth
It throws exception for WriteAheadLogUtils after excluding core and
streaming jar.
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/streaming/util/WriteAheadLogUtils$
at
org.apache.spark.streaming.kafka.KafkaUtils$.createStream(KafkaUtils.scala:84)
at
Hello,
I am trying use the new Kafka consumer KafkaUtils.createDirectStream
but I am having some issues making it work.
I have tried different versions of Spark v1.4.0 and branch-1.4 #8d6e363 and
I am still getting the same strange exception ClassNotFoundException:
$line49.$read$$iwC$$i
I'm having 30G per machine
This is the first (and only) job I'm trying to submit. So it's weird that
for --total-executor-cores=20 it works, and for --total-executor-cores=4 it
doesn't
On Tue, Jun 23, 2015 at 10:46 PM, Igor Berman igor.ber...@gmail.com wrote:
probably there are already running
Thanks a lot. It worked after keeping all versions to same.1.2.0
On Wed, Jun 24, 2015 at 2:24 AM, Tathagata Das t...@databricks.com wrote:
Why are you mixing spark versions between streaming and core??
Your core is 1.2.0 and streaming is 1.4.0.
On Tue, Jun 23, 2015 at 1:32 PM, Shushant Arora
Hey,
I’m trying to run kafka spark streaming using mesos with following example:
sc.stop
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import
The exception $line49 is referring to a line of the spark shell.
Have you tried it from an actual assembled job with spark-submit ?
On Tue, Jun 23, 2015 at 3:48 PM, syepes sye...@gmail.com wrote:
Hello,
I am trying use the new Kafka consumer
KafkaUtils.createDirectStream
but I am
Hi All ,
What is difference between below in terms of execution to the cluster with
1 or more worker node
rdd.map(...).map(...)...map(..)
vs
val rdd1 = rdd.map(...)
val rdd2 = rdd1.map(...)
val rdd3 = rdd2.map(...)
Thanks,
Ashish
It seems that it doesn't happen in Scala API. Not exactly the same as in
python, but pretty close.
https://gist.github.com/elnopintan/675968d2e4be68958df8
2015-06-23 23:11 GMT+02:00 Davies Liu dav...@databricks.com:
I think it also happens in DataFrames API of all languages.
On Tue, Jun 23,
I think it also happens in DataFrames API of all languages.
On Tue, Jun 23, 2015 at 9:16 AM, Ignacio Blasco elnopin...@gmail.com wrote:
That issue happens only in python dsl?
El 23/6/2015 5:05 p. m., Bob Corsaro rcors...@gmail.com escribió:
Thanks! The solution:
I can not. I've already limited the number of cores to 10, so it gets 5
executors with 2 cores each...
wt., 23.06.2015 o 13:45 użytkownik Akhil Das ak...@sigmoidanalytics.com
napisał:
Use *spark.cores.max* to limit the CPU per job, then you can easily
accommodate your third job also.
Thanks
Hi!
I want to integrate flume with spark streaming. I want to know which sink
type of flume are supported by spark streaming? I saw one example using
avroSink.
Thanks
--
View this message in context:
In a case that memory cannot hold all the cached RDD, then BlockManager
will evict some older block for storage of new RDD block.
Hope that will helpful.
2015-06-24 13:22 GMT+08:00 bit1...@163.com bit1...@163.com:
I am kind of consused about when cached RDD will unpersist its data. I
know we
I am kind of consused about when cached RDD will unpersist its data. I know we
can explicitly unpersist it with RDD.unpersist ,but can it be unpersist
automatically by the spark framework?
Thanks.
bit1...@163.com
to give a bit more data on what I'm trying to get -
I have many tasks I want to run in parallel, so I want each task to catch
small part of the cluster (- only limited part of my 20 cores in the
cluster)
I have important tasks that I want them to get 10 cores, and I have small
tasks that I want
I am invoking it from the java application by creating the sparkcontext
On Tue, Jun 23, 2015 at 12:17 PM, Tathagata Das t...@databricks.com wrote:
How are you adding that to the classpath? Through spark-submit or
otherwise?
On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri
How are you adding that to the classpath? Through spark-submit or otherwise?
On Mon, Jun 22, 2015 at 5:02 PM, Murthy Chelankuri kmurt...@gmail.com
wrote:
Yes I have the producer in the class path. And I am using in standalone
mode.
Sent from my iPhone
On 23-Jun-2015, at 3:31 am, Tathagata
So you have Kafka in your classpath in you Java application, where you are
creating the sparkContext with the spark standalone master URL, right?
The recommended way of submitting spark applications to any cluster is
using spark-submit. See
queue stream does not support driver checkpoint recovery since the RDDs in
the queue are arbitrary generated by the user and its hard for Spark
Streaming to keep track of the data in the RDDs (thats necessary for
recovering from checkpoint). Anyways queue stream is meant of testing and
Why don't you do a normal .saveAsTextFiles?
Thanks
Best Regards
On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla anshushuk...@gmail.com
wrote:
Thanx for reply !!
YES , Either it should write on any machine of cluster or Can you please
help me ... that how to do this . Previously i was
HI All,
I was trying to store a trained model to the local hard disk. i am able to
save it using save() function. while i am trying to retrieve the stored
model using load() function i am end up with following error. kindly help me
on this.
scala val sameModel =
hi,
I'm running spark standalone cluster with 5 slaves, each has 4 cores. When I
run job with the following configuration:
/root/spark/bin/spark-submit -v --total-executor-cores 20
--executor-memory 22g --executor-cores 4 --class
com.windward.spark.apps.MyApp --name dev-app
Maybe this is a known issue with spark streaming and master web ui. Disable
event logging, and it should be fine.
https://issues.apache.org/jira/browse/SPARK-6270
On Mon, Jun 22, 2015 at 8:54 AM, scar scar scar0...@gmail.com wrote:
Sorry I was on vacation for a few days. Yes, it is on. This is
yes , in spark standalone mode witht the master URL.
Jar are copying to execeutor and the application is running fine but its
failing at some point when kafka is trying to load the classes using some
reflection mechanisims for loading the Encoder and Partitioner classes.
Here are my finding so
Looks like a hostname conflict to me.
15/06/22 17:04:45 WARN Utils: Your hostname, datasci01.dev.abc.com resolves
to a loopback address: 127.0.0.1; using 10.0.3.197 instead (on interface
eth0)
15/06/22 17:04:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
Can you paste
May be while producing the messages, you can make it as a keyedMessage with
the timestamp as key and on the consumer end you can easily identify the
key (which will be the timestamp) from the message. If the network is fast
enough, then i think there could be a small millisecond lag.
Thanks
Best
Hi,
I am using spark1.3.1, and have 2 receivers,
On the web UI, I can only see the total records received by all these 2
receivers, but I can't figure out the records received by individual receiver?
Not sure whether the information is shown on the UI in spark1.4.
bit1...@163.com
This could be because of some subtle change in the classloaders used by
executors. I think there has been issues in the past with libraries that
use Class.forName to find classes by reflection. Because the executors load
classes dynamically using custom class loaders, libraries that use
Hi Samsudhin,
If possible, can you please provide a part of the code? Or perhaps try with
the ut in RandomForestSuite to see if the issue repros.
Regards,
yuhao
-Original Message-
From: samsudhin [mailto:samsud...@pigstick.com]
Sent: Tuesday, June 23, 2015 2:14 PM
To:
Well, you could that (Stage information) is an ASCII representation of the
WebUI (running on port 4040). Since you set local[4] you will have 4
threads for your computation, and since you are having 2 receivers, you are
left with 2 threads to process ((0 + 2) -- This 2 is your 2 threads.) And
the
Did you happened to try this?
JavaPairRDDInteger, String hadoopFile = sc.hadoopFile(
/sigmoid, DataInputFormat.class, LongWritable.class,
Text.class)
Thanks
Best Regards
On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 yadanfu1...@gmail.com wrote:
Hello, everyone! I'm new in spark.
Hi,
I have a spark streaming application that runs locally with two receivers, some
code snippet is as follows:
conf.setMaster(local[4])
//RPC Log Streaming
val rpcStream = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder](ssc, consumerParams, topicRPC,
Hi, Akhil,
Thank you for the explanation!
bit1...@163.com
From: Akhil Das
Date: 2015-06-23 16:29
To: bit1...@163.com
CC: user
Subject: Re: What does [Stage 0: (0 + 2) / 2] mean on the console
Well, you could that (Stage information) is an ASCII representation of the
WebUI (running on port
Awesome, thanks Pawan – for now I’ll give spark-notebook a go until
Zeppelin catches up to Spark 1.4 (and when Zeppelin has a binary release –
my PC doesn’t seem too happy about building a Node.js app from source).
Thanks for the detailed instructions!!
*From:* pawan kumar
I have set up small standalone cluster: 5 nodes, every node has 5GB of
memory an 8 cores. As you can see, node doesn't have much RAM.
I have 2 streaming apps, first one is configured to use 3GB of memory per
node and second one uses 2GB per node.
My problem is, that smaller app could easily run
Use *spark.cores.max* to limit the CPU per job, then you can easily
accommodate your third job also.
Thanks
Best Regards
On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła w.pit...@gmail.com wrote:
I have set up small standalone cluster: 5 nodes, every node has 5GB of
memory an 8 cores. As you
I am calculating input rate using the following logic.
And i think this foreachRDD is always running on driver (println are
seen on driver)
1- Is there any other way to do that in less cost .
2- Will this give me the correct count for rate .
//code -
inputStream.foreachRDD(new
Thanks alot ,
Because i just want to log timestamp and unique message id and not full
RDD .
On Tue, Jun 23, 2015 at 12:41 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Why don't you do a normal .saveAsTextFiles?
Thanks
Best Regards
On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla
Yes, and typically needs are 100ms. Now imagine even 10 concurrent
requests. My experience has been that this approach won't nearly
scale. The best you could probably do is async mini-batch
near-real-time scoring, pushing results to some store for retrieval,
which could be entirely suitable for
Thanks for the suggestions everyone, appreciate the advice.
I tried replacing DISTINCT for the nested GROUP BY, running on 1.4 instead
of 1.3, replacing the date casts with a between operation on the
corresponding long constants instead and changing COUNT(*) to COUNT(1).
None of these seem to
On 23 Jun 2015, at 00:09, Danny kont...@dannylinden.de wrote:
hi,
have you tested
s3://ww-sandbox/name_of_path/ instead of s3://ww-sandbox/name_of_path
+ make sure the bucket is there already. Hadoop s3 clients don't currently
handle that step
or have you test to add your file
Probably your application has crashed or was terminated without invoking the
stop method of spark context - in such cases it doesn't create the empty
flag file which apparently tells the history server that it can safely show
the log data - simpy go to some of the other dirs of the history server
https://spark.apache.org/docs/latest/streaming-flume-integration.html
Yep, avro sink is the correct one.
--
Ruslan Dautkhanov
On Tue, Jun 23, 2015 at 9:46 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi!
I want to integrate flume with spark streaming. I want to know which sink
Check the available resources you have (cpu cores memory ) on master web
ui.
The log you see means the job can't get any resources.
On Wed, Jun 24, 2015 at 5:03 AM, Nizan Grauer ni...@windward.eu wrote:
I'm having 30G per machine
This is the first (and only) job I'm trying to submit. So
One example is that you'd like to set up jdbc connection for each partition
and share this connection across the records.
mapPartitions is much more like the paradigm of mapper in mapreduce. In the
mapper of mapreduce, you have setup method to do any initialization stuff
before processing the
I don't think this is the correct question. Spark can be deployed on
different cluster manager frameworks like standard alone, yarn mesos.
Spark can't run without these cluster manager framework, that means spark
depend on cluster manager framework.
And the data management layer is the upstream
Why do you want it start until all the resources are ready ? Make it start
as early as possible should make it complete earlier and increase the
utilization of resources
On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra arun.lut...@gmail.com wrote:
Sometimes if my Hortonworks yarn-enabled cluster
I don't think there is yarn related stuff to access in spark. Spark don't
depend on yarn.
BTW, why do you want the yarn application id ?
On Mon, Jun 22, 2015 at 11:45 PM, roy rp...@njit.edu wrote:
Hi,
Is there a way to get Yarn application ID inside spark application, when
running spark
Please run it in your own application and not in the spark shell. I see
that you are trying to stop the Spark context and create a new
StreamingContext. That will lead to unexpected issue, that you are seeing.
Please make a standalone SBT/Maven app for Spark Streaming.
On Tue, Jun 23, 2015 at
If yo change to ```val numbers2 = numbers```, then it have the same problem
On Tue, Jun 23, 2015 at 2:54 PM, Ignacio Blasco elnopin...@gmail.com wrote:
It seems that it doesn't happen in Scala API. Not exactly the same as in
python, but pretty close.
I think one of the primary cases where mapPartitions is useful if you are
going to be doing any setup work that can be re-used between processing
each element, this way the setup work only needs to be done once per
partition (for example creating an instance of jodatime).
Both map and
A rough estimate of the worst case memory requirement for driver is
about 2 * k * runs * numFeatures * numPartitions * 8 bytes. I put 2 at
the beginning because the previous centers are still in memory while
receiving new center updates. -Xiangrui
On Fri, Jun 19, 2015 at 9:02 AM, Rogers Jeffrey
This is on the wish list for Spark 1.5. Assuming that the items from
the same transaction are distinct. We can still follow FP-Growth's
steps:
1. find frequent items
2. filter transactions and keep only frequent items
3. do NOT order by frequency
4. use suffix to partition the transactions
Hi Kevin I never did. I checked for free space in the root partition, don't
think this was an issue. Now that 1.4 is officially out I'll probably give
it another shot.
On Jun 22, 2015 4:28 PM, Kevin Markey kevin.mar...@oracle.com wrote:
Matt: Did you ever resolve this issue? When running on a
It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
more information to guess what happened:
1. Could you share the ALS settings, e.g., number of blocks, rank and
number of iterations, as well as number of users/items in your
dataset?
2. If you monitor the progress in the WebUI,
Please check the input path to your test data, and call `.count()` and
see whether there are records in it. -Xiangrui
On Sat, Jun 20, 2015 at 9:23 PM, Gavin Yue yue.yuany...@gmail.com wrote:
Hey,
I am testing the StreamingLinearRegressionWithSGD following the tutorial.
It works, but I could
We have multinomial logistic regression implemented. For your case,
the model size is 500 * 300,000 = 150,000,000. MLlib's implementation
might not be able to handle it efficiently, we plan to have a more
scalable implementation in 1.5. However, it shouldn't give you an
array larger than MaxInt
Please see the current version of code for better documentation.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
Sincerely,
DB Tsai
--
Blog: https://www.dbtsai.com
PGP
I wrote a brief howto on building nested records in spark and storing them
in parquet here:
http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/
2015-06-23 16:12 GMT-07:00 Richard Catlin richard.m.cat...@gmail.com:
How do I create a DataFrame(SchemaRDD) with a nested array of Rows
Creating millions of temporary (immutable) objects is bad for
performance. It should be simple to do a micro-benchmark locally.
-Xiangrui
On Mon, Jun 22, 2015 at 7:25 PM, mzeltser mzelt...@gmail.com wrote:
Using StatCounter as an example, I'd like to understand if pure functional
implementation
Hi DB Tsai,
Thanks for your reply. I went through the source code of
LinearRegression.scala. The algorithm minimizes square error L = 1/2n ||A
weights - y||^2^. I cannot match this with the elasticNet loss function
found here http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html, which
is the
yes, I have two clusters one standalone an another using Mesos
Sebastian YEPES
http://sebastian-yepes.com
On Wed, Jun 24, 2015 at 12:37 AM, drarse [via Apache Spark User List]
ml-node+s1001560n23457...@n3.nabble.com wrote:
Hi syepes,
Are u run the application in standalone mode?
Regards
How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a
column? Is there an example? Will this store as a nested parquet file?
Thanks.
Richard Catlin
Thanks DB Tsai, it is very helpful.
Cheers,
Wei
2015-06-23 16:00 GMT-07:00 DB Tsai dbt...@dbtsai.com:
Please see the current version of code for better documentation.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
I have a Spark job that has 7 stages. The first 3 stage complete and the
fourth stage beings (joins two RDDs). This stage has multiple task
failures all the below exception.
Multiple tasks (100s) of them get the same exception with different hosts.
How can all the host suddenly stop responding
Hi syepes,
Are u run the application in standalone mode?
Regards
El 23/06/2015 22:48, syepes [via Apache Spark User List]
ml-node+s1001560n23456...@n3.nabble.com escribió:
Hello,
I am trying use the new Kafka consumer
KafkaUtils.createDirectStream but I am having some issues making it
The regularization is handled after the objective function of data is
computed. See
https://github.com/apache/spark/blob/6a827d5d1ec520f129e42c3818fe7d0d870dcbef/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
line 348 for L2.
For L1, it's handled by OWLQN, so you
I know when to use a map () but when should i use mapPartitions() ?
Which is faster ?
--
Deepak
1 - 100 of 102 matches
Mail list logo