Again, by Storm, you mean Storm Trident, correct?
On Wednesday, 17 June 2015 10:09 PM, Michael Segel
msegel_had...@hotmail.com wrote:
Actually the reverse.
Spark Streaming is really a micro batch system where the smallest window is 1/2
a second (500ms). So for CEP, its not really a
I don't think this is the same issue as it works just fine in pyspark
v1.3.1.
Are you aware of any workaround? I was hoping to start testing one of my
apps in Spark 1.4 and I use the CSV exports as a safety valve to easily
debug my data flow.
-Don
On Sun, Jun 14, 2015 at 7:18 PM, Burak Yavuz
I am trying to run a hive query from Spark code using HiveContext object.
It was running fine earlier but since the Apache Sentry has been set
installed the process is failing with this exception :
*org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn,
Hi,
I am running this on Spark stand-alone mode. I find that when I examine the
web UI, a couple bugs arise:
1. There is a discrepancy between the number denoting the duration of the
application when I run the history server and the number given by the web UI
(default address is master:8080). I
Is it possible to achieve serial batching with Spark Streaming?
Example:
I configure the Streaming Context for creating a batch every 3 seconds.
Processing of the batch #2 takes longer than 3 seconds and creates a
backlog of batches:
batch #1 takes 2s
batch #2 takes 10s
batch #3 takes 2s
batch
Hi John,
Did you also set spark.sql.planner.externalSort to true? Probably you will
not see executor lost with this conf. For now, maybe you can manually split
the query to two parts, one for skewed keys and one for other records.
Then, you union then results of these two parts together.
Thanks,
So, the second attemp of those tasks failed with NPE can complete and the
job eventually finished?
On Mon, Jun 15, 2015 at 10:37 PM, Night Wolf nightwolf...@gmail.com wrote:
Hey Yin,
Thanks for the link to the JIRA. I'll add details to it. But I'm able to
reproduce it, at least in the same
Hello shreesh,
That would be quite a challenge to understand.
A few things that I think should help estimate those numbers:
1) Understanding the cost of the individual transformations in the
application
E.g a flatMap can be more expansive in memory as opposed to a map
2) The communication
Ok what was wrong was that the spark-env did not contain the
HADOOP_CONF_DIR properly set to /etc/hadoop/conf/
With that fixed, this issue is gone, but I can't seem to get Spark SQL
1.4.0 with Hive working on CDH 5.3 or 5.4 :
Using this command line :
IPYTHON=1
This documentation is only for writes to an external system, but all the
counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow
to keep track of a running count) is exactly-once. When you write to a storage
system, no matter which streaming framework you use, you'll
For 1)
In standalone mode, you can increase the worker's resource allocation in
their local conf/spark-env.sh with the following variables:
SPARK_WORKER_CORES,
SPARK_WORKER_MEMORY
At application submit time, you can tune the number of resource allocated to
executors with /--executor-cores/ and
Also, still for 1), in conf/spark-defaults.sh, you can give the following
arguments to tune the Driver's resources:
spark.driver.cores
spark.driver.memory
Not sure if you can pass them at submit time, but it should be possible.
--
View this message in context:
Hi Matei,
Ah, can't get more accurate than from the horse's mouth... If you don't
mind helping me understand it correctly..
From what I understand, Storm Trident does the following (when used with
Kafka):
1) Sit on Kafka Spout and create batches
2) Assign global sequential ID to the batches
3)
The major difference is that in Spark Streaming, there's no *need* for a
TridentState for state inside your computation. All the stateful operations
(reduceByWindow, updateStateByKey, etc) automatically handle exactly-once
processing, keeping updates in order, etc. Also, you don't need to run a
Hi! I would like to know what is the difference between the following
transformations when they are executed right before writing RDD to a file?
1. coalesce(1, shuffle = true)
2. coalesce(1, shuffle = false)
Code example:
val input = sc.textFile(inputFile)
val filtered =
Hi Gajan,
Please subscribe our user mailing list, which is the best place to get
your questions answered. We don't have weighted instance support, but
it should be easy to add and we plan to do it in the next release
(1.5). Thanks for asking!
Best,
Xiangrui
On Wed, Jun 17, 2015 at 2:33 PM,
What's the size of this table? Is the data skewed (so that speculation
is probably triggered)?
Cheng
On 6/15/15 10:37 PM, Night Wolf wrote:
Hey Yin,
Thanks for the link to the JIRA. I'll add details to it. But I'm able
to reproduce it, at least in the same shell session, every time I do a
Does increasing executor memory fix the memory problem?
How many columns does the schema contain? Parquet can be super memory
consuming when writing wide tables.
Cheng
On 6/15/15 5:48 AM, Bipin Nag wrote:
HI Davies,
I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and
save
We don't have R-like model summary in MLlib, but we plan to add some
in 1.5. Please watch https://issues.apache.org/jira/browse/SPARK-7674.
-Xiangrui
On Thu, May 28, 2015 at 3:47 PM, rafac rafaelme...@hotmail.com wrote:
I have a simple problem:
i got mean number of people on one place by
Hi, here's how to get Parrallel search pipleine:
package org.apache.spark.ml.pipeline
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql._
class ParralelGridSearchPipelineextends Pipeline {
override def fit(dataset:
Hi,
I'm running Spark-1.4.0 on Mesos. I have been trying to read file from
Mapr cluster but not have much success with it. I tried 2 versions of
Apache Spark (with and without Hadoop).
I can get to the spark-shell in the with-hadoop version, but still
can't access maprfs[2]. Without-Hadoop
That's is a bug, which will be fixed in
https://github.com/apache/spark/pull/6622. I disabled Model.copy
because models usually doesn't have a default constructor and hence
the default Params.copy implementation won't work. Unfortunately, due
to insufficient test coverage, StringIndexModel.copy is
There is no plan at this time. We haven't reached 100% coverage on
user-facing API in PySpark yet, which would have higher priority.
-Xiangrui
On Sun, Jun 7, 2015 at 1:42 AM, martingoodson martingood...@gmail.com wrote:
Am I right in thinking that Python mllib does not contain the optimization
We don't have it in MLlib. The closest would be the ChiSqSelector,
which works for categorical data. -Xiangrui
On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov dautkha...@gmail.com wrote:
What would be closest equivalent in MLLib to Oracle Data Miner's Attribute
Importance mining function?
LabeledPoint was used for both classification and regression, where label
type is Double for simplicity. So in BinaryClassificationMetrics, we still
use Double for labels. We compute the confusion matrix at each threshold
internally, but this is not exposed to users (
You need to build the spark assembly with your modification and deploy
into cluster.
Sincerely,
DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
On Wed, Jun 17, 2015 at 5:11 PM, Raghav Shankar raghav0110...@gmail.com wrote:
Hi all,
Is there any way running Spark job in programmatic way on Yarn cluster
without using spark-submit script ?
I cannot include Spark jars on my Java application (due o dependency
conflict and other reasons), so I'll be shipping Spark assembly uber jar
(spark-assembly-1.3.1-hadoop2.3.0.jar)
On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira h...@inesctec.pt wrote:
Hi,
I am currently experimenting with linear regression (SGD) (Spark + MLlib,
ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I
do this (for now) by an exhaustive grid search of the step size and
Please following the code examples from the user guide:
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark.
-Xiangrui
On Tue, May 26, 2015 at 12:34 AM, Yasemin Kaya godo...@gmail.com wrote:
Hi,
In CF
String path = data/mllib/als/test.data;
JavaRDDString
Try to grant read execute access through sentry.
On 18 Jun 2015 05:47, Nitin kak nitinkak...@gmail.com wrote:
I am trying to run a hive query from Spark code using HiveContext object.
It was running fine earlier but since the Apache Sentry has been set
installed the process is failing with
In 1.3, we added some model save/load support in Parquet format. You
can use Parquet's C++ library (https://github.com/Parquet/parquet-cpp)
to load the data back. -Xiangrui
On Wed, Jun 10, 2015 at 12:15 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Hope Swig and JNA might help for accessing
Hi Matt,
If you place your jars on HDFS in a public location, YARN will cache them
on each node after the first download. You can also use the
spark.executor.extraClassPath config to point to them.
-Sandy
On Wed, Jun 17, 2015 at 4:47 PM, Sweeney, Matt mswee...@fourv.com wrote:
Hi folks,
With Sentry, only hive user has the permission for read/write/execute on
the subdirectories of warehouse. All the users get translated to hive
when interacting with hiveserver2. But i think HiveContext is bypassing
hiveserver2.
On Wednesday, June 17, 2015, ayan guha guha.a...@gmail.com wrote:
I’ve implemented this in the suggested manner. When I build Spark and attach
the new spark-core jar to my eclipse project, I am able to use the new method.
In order to conduct the experiments I need to launch my app on a cluster. I am
using EC2. When I setup my master and slaves using the EC2
I'm running Spark on Amazon EMR using their install bootstrap.
My Scala code had println(message here) statements - where can I find the
output of these statements?
I changed my code to use log4j - my log.info and log.error output is nowhere to
be found.
I've checked /mnt/var/log/hadoop/steps
Because we don't have random access to the record, sampling still need
to go through the records sequentially. It does save some computation,
which is perhaps noticeable only if you have data cached in memory.
Different random seeds are used for trees. -Xiangrui
On Wed, Jun 3, 2015 at 4:40 PM,
Hi Hafiz,
As Ewan mentioned, the path is the path to the S3 files unloaded from
Redshift. This is a more scalable way to get a large amount of data
from Redshift than via JDBC. I'd recommend using the SQL API instead
of the Hadoop API (https://github.com/databricks/spark-redshift).
Best,
Yes. You can apply HashingTF on your input stream and then use
StreamingKMeans for training and prediction. -Xiangrui
On Mon, Jun 8, 2015 at 11:05 AM, Ruslan Dautkhanov dautkha...@gmail.com wrote:
Hello,
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
would Feature
Hi,
I'm running Spark on mesos and trying to read file from Maprcluster but not
have much success with that. I tried 2 versions of Apache Spark (with and
without Hadoop).
I can get to the spark-shell in the with-hadoop version, but still can't
access maprfs. Without-Hadoop version bails out with
You can try hashing to control the feature dimension. MLlib's k-means
implementation can handle sparse data efficiently if the number of
features is not huge. -Xiangrui
On Tue, Jun 16, 2015 at 2:44 PM, Rex X dnsr...@gmail.com wrote:
Hi Sujit,
That's a good point. But 1-hot encoding will make
That sounds like a bug. Could you create a JIRA and ping Yin Huai
(cc'ed). -Xiangrui
On Wed, May 27, 2015 at 12:57 AM, Justin Yip yipjus...@prediction.io wrote:
Hello,
I am trying out 1.4.0 and notice there are some differences in behavior with
Timestamp between 1.3.1 and 1.4.0.
In 1.3.1, I
Hi folks,
I’m looking to deploy spark on YARN and I have read through the docs
(https://spark.apache.org/docs/latest/running-on-yarn.html). One question that
I still have is if there is an alternate means of including your own app jars
as opposed to the process in the “Adding Other Jars”
This is implemented in MLlib:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41.
-Xiangrui
On Wed, Jun 10, 2015 at 1:53 PM, erisa erisa...@gmail.com wrote:
Hi,
I am a Spark newbie, and trying to solve the same problem, and
not being able to read from Kafka using multiple nodes
Kafka is plenty capable of doing this, by clustering together multiple
consumer instances into a consumer group.
If your topic is sufficiently partitioned, the consumer group can consume
the topic in a parallelized fashion.
If it isn't, you
You should add spark-mllib_2.10 as a dependency instead of declaring
it as the artifactId. And always use the same version for spark-core
and spark-mllib. I saw you used 1.3.0 for spark-core but 1.4.0 for
spark-mllib, which is not guaranteed to work. If you set the scope to
provided, mllib jar
With Sentry, only hive user has the permission for read/write/execute on
the subdirectories of warehouse. All the users get translated to hive
when interacting with hiveserver2. But i think HiveContext is bypassing
hiveserver2.
On Wednesday, June 17, 2015, ayan guha guha.a...@gmail.com wrote:
all of them.
Sincerely,
DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
On Wed, Jun 17, 2015 at 5:15 PM, Raghav Shankar raghav0110...@gmail.com wrote:
So, I would add the assembly jar to the just the master or would I have
So, I would add the assembly jar to the just the master or would I have to add
it to all the slaves/workers too?
Thanks,
Raghav
On Jun 17, 2015, at 5:13 PM, DB Tsai dbt...@dbtsai.com wrote:
You need to build the spark assembly with your modification and deploy
into cluster.
Sincerely,
So I've seen in the documentation that (after the overhead memory is
subtracted), the memory allocations of each executor are as follows (assume
default settings):
60% for cache
40% for tasks to process data
Reading about how Spark implements shuffling, I've also seen it say 20% of
executor
The only thing which doesn't make much sense in Spark Streaming (and I am
not saying it is done better in Storm) is the iterative and redundant
shipping of the essentially the same tasks (closures/lambdas/functions) to
the cluster nodes AND re-launching them there again and again
This is a
Thanks for reporting this. Would you mind to help creating a JIRA for this?
On 6/16/15 2:25 AM, patcharee wrote:
I found if I move the partitioned columns in schemaString and in Row
to the end of the sequence, then it works correctly...
On 16. juni 2015 11:14, patcharee wrote:
Hi,
I am
I am trying to run a hive query from Spark code using HiveContext object. It
was running fine earlier but since the Apache Sentry has been set installed
the process is failing with this exception :
/org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn,
The default behavior should be that batch X + 1 starts processing only
after batch X completes. If you are using Spark 1.4.0, could you show us a
screenshot of the streaming tab, especially the list of batches? And could
you also tell us if you are setting any SparkConf configurations?
On Wed,
To add more information beyond what Matei said and answer the original
question, here are other things to consider when comparing between Spark
Streaming and Storm.
* Unified programming model and semantics - Most occasions you have to
process the same data again in batch jobs. If you have two
Hi there!
It seems like you have Read/Execute access permission (and no
update/insert/delete access). What operation are you performing?
Ajay
On Jun 17, 2015, at 5:24 PM, nitinkak001 nitinkak...@gmail.com wrote:
I am trying to run a hive query from Spark code using HiveContext object. It
This looks really awesome.
On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie jie.hu...@intel.com wrote:
Hi All
We are happy to announce Performance portal for Apache Spark
http://01org.github.io/sparkscore/ !
The Performance Portal for Apache Spark provides performance data on the
Spark
This is not independent programmatic way of running of Spark job on Yarn
cluster.
That example demonstrates running on *Yarn-client* mode, also will be
dependent of Jetty. Users writing Spark programs do not want to depend on
that.
I found this SparkLauncher class introduced in Spark 1.4 version
ok solved. Looks like breathing the the spark-summit SFO air for 3 days helped
a lot !
Piping the 7 million records to local disk still runs out of memory.So piped
the results into another Hive table. I can live with that :-)
/opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -e use aers; create
Hi,
Looking from my executor logs, the submitted application jar is
transmitted to each executors?
Why does spark do the above? To my understanding, the tasks to be run
are already serialized with TaskDescription.
Regards.
-
Hi,
Is there anyway in spark streaming to keep data across multiple
micro-batches? Like in a HashMap or something?
Can anyone make suggestions on how to keep data across iterations where
each iteration is an RDD being processed in JavaDStream?
This is especially the case when I am trying to
Hi,
Is there anyway in spark streaming to keep data across multiple
micro-batches? Like in a HashMap or something?
Can anyone make suggestions on how to keep data across iterations where
each iteration is an RDD being processed in JavaDStream?
This is especially the case when I am trying to
I am having trouble using a UDF on a column of Vectors in PySpark which can
be illustrated here:
from pyspark import SparkContext
from pyspark.sql import Row
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import udf
from pyspark.mllib.linalg import Vectors
FeatureRow =
Hello;
I am trying to get predictions after running the ALS model.
The model works fine. In the prediction/recommendation , I have about 30
,000 products and 90 Millions users.
When i try the predict all it fails.
I have been trying to formulate the problem as a Matrix multiplication where
I
Nick is right. I too have implemented this way and it works just fine. In
my case, there can be even more products. You simply broadcast blocks of
products to userFeatures.mapPartitions() and BLAS multiply in there to get
recommendations. In my case 10K products form one block. Note that you
would
Can someone help? Thank you!
From: Haopu Wang
Sent: Monday, June 15, 2015 3:36 PM
To: user; d...@spark.apache.org
Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125
I use the attached program to test checkpoint. It's quite simple.
When I run
Done.
https://issues.apache.org/jira/browse/SPARK-8420
Justin
On Wed, Jun 17, 2015 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:
That sounds like a bug. Could you create a JIRA and ping Yin Huai
(cc'ed). -Xiangrui
On Wed, May 27, 2015 at 12:57 AM, Justin Yip yipjus...@prediction.io
Hi, just answered in your other thread as well...
Depending on your requirements, you can look at the updateStateByKey API
From: Nipun Arora
Date: Wednesday, June 17, 2015 at 10:51 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Iterative Programming by keeping data across
Depending on your requirements, you can look at the updateStateByKey API
From: Nipun Arora
Date: Wednesday, June 17, 2015 at 10:48 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: no subject
Hi,
Is there anyway in spark streaming to keep data across multiple micro-batches?
Sean,
Spark on YARN (https://spark.apache.org/docs/latest/running-on-yarn.html)
follows the logging construct of YARN.
If you are using cluster deployment mode on yarn (master=yarn-cluster) then the
logging performed in the driver (your code) would be picked up by YARN’s logs
in the
Actually talk about this exact thing in a blog post here
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
Keep in mind, you're actually doing a ton of math. Even with proper caching
and use of broadcast variables this will
I have an RDD which is list of list
And another RDD which is list of pairs
No duplicates in inner list of first RDD and
No duplicates in the pairs from second rdd
I am trying to check if any pair of second RDD is present in the any list of
--
View this message in context:
I have multiple input paths which each contain data that need to be mapped
in a slightly different way into a common data structure. My approach boils
down to:
RDDT rdd = null;
for (Configuration conf : configurations) {
RDDT nextRdd = loadFromConfiguration(conf);
rdd = (rdd == null) ?
Hi guys,
Running with a parquet backed table in hive ‘dim_promo_date_curr_p' which has
the following data;
scala sqlContext.sql(select * from pz.dim_promo_date_curr_p).show(3)
15/06/18 00:53:21 INFO ParseDriver: Parsing command: select * from
pz.dim_promo_date_curr_p
15/06/18 00:53:21 INFO
Hi,
I have a spark streaming program running for ~ 25hrs. When I check the
Streaming UI tab. I found the “Waiting batches” is 144. But the “scheduling
delay” is 0. I am a bit confused.
If the “waiting batches” is 144, that means many batches are waiting in the
queue to be processed? If this is
Hi Silvio,
Thanks for your response.
I should clarify. I would like to do updates on a structure iteratively. I
am not sure if updateStateByKey meets my criteria.
In the current situation, I can run some map reduce tasks and generate a
JavaPairDStreamKey,Value, after this my algorithm is
Hi Elkhan,
There are couple of ways to do this.
1) Spark-jobserver is a popular web server that is used to submit spark jobs.
https://github.com/spark-jobserver/spark-jobserver
https://github.com/spark-jobserver/spark-jobserver
2) Spark-submit script sets the classpath for the job. Bypassing
TaskDescription only serialize the jar path not the jar content. Multiple
tasks can run on the same executor. Executor will check whether the jar has
been fetched when each time task is launched. If so, it won't fetch it
again.
Only serialize the jar path can prevent serialize jar multiple times
An example of being able to do this is provided in the Spark Jetty Server
project [1]
[1] https://github.com/calrissian/spark-jetty-server
On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
Hi all,
Is there any way running Spark job in programmatic way on Yarn
I have an RDD with more than 1000 elements. I have to form the combinations
of elements.
I tried to use Cartesian transformation and then filtering them. But
failing with eof error. Is there any other way to do the same using
partitions
I am using pyspark
--
View this message in
One issue is that you broadcast the product vectors and then do a dot product
one-by-one with the user vector.
You should try forming a matrix of the item vectors and doing the dot product
as a matrix-vector multiply which will make things a lot faster.
Another optimisation that is
Thank you Xiangrui.
Oracle's attribute importance mining function have a target variable.
Attribute importance is a supervised function that ranks attributes
according to their significance in predicting a target.
MLlib's ChiSqSelector does not have a target variable.
--
Ruslan Dautkhanov
My Use case is below
We are going to receive lot of event as stream ( basically Kafka Stream )
and then we need to process and compute
Consider you have a phone contract with ATT and every call / sms / data
useage you do is an event and then it needs to calculate your bill on real
time basis so
I think
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
might shed some light on the behaviour you’re seeing.
Mark
From: canan chen [mailto:ccn...@gmail.com]
Sent: June-17-15 5:57 AM
To: spark users
Subject: Intermedate stage will be cached automatically ?
Here's
version
We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1
cqlsh
select * from site_users
returns fast, subsecond, only 3 rows
Can you show some code how you're doing the reads?
dse beeline
!connect ...
select * from site_users
--table has 3 rows, several columns in each row. Spark eunts 769
I am student of telecommunications engineering and this year I worked with
spark. It is a world that I like and want to know if this job having in
this area.
Thanks for all
Regards
1.4.0 resolves the problem.
The total classes loaded for an updateStateByKey over Int and String types
does not increase.
The total classes loaded for an updateStateByKey over case classes does
increase over time, but
the processing remains stable. Both memory consumption and CPU load remain
Can you show some code how you're doing the reads? Have you successfully
read other stuff from Cassandra (i.e. do you have a lot of experience with
this path and this particular table is causing issues or are you trying to
figure out the right way to do a read).
What version of Spark and
Can you try repartitioning the rdd after creating the K,V. And also, while
calling the rdd1.join(rdd2, Pass the # partition argument too)
Thanks
Best Regards
On Wed, Jun 17, 2015 at 12:15 PM, Al M alasdair.mcbr...@gmail.com wrote:
I have 2 RDDs I want to Join. We will call them RDD A and RDD
Hi Nathan,
Thanks a lot for the detailed report, especially the information about
nonconsecutive part numbers. It's confirmed to be a race condition bug
and just filed https://issues.apache.org/jira/browse/SPARK-8406 to track
this. Will deliver a fix ASAP and this will be included in 1.4.1.
I have finish training MatrixFactorizationModel, I want to load this model in
spark-streaming.
I think it can be works, but actually not. I don't know why, who can help
me?
I wrote code like this:
val ssc = new StreamingContext ...
val bestModel =
Whatever you write in bolts would be the logic you want to apply on your
events. In Spark, that logic would be coded in map() or similar such
transformations and/or actions. Spark doesn't enforce a structure for
capturing your processing logic like Storm does.
Regards
Sab
Probably overloading the
Thanks Himanshu and RahulKumar!
The databricks forum post was extremely useful. It is great to see an
article that clearly details how and when shuffles are cleaned up.
--
View this message in context:
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a
certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve
Hi, all
I have a question about spark access hbase with yarn-cluster mode on a
kerberos yarn Cluster. Is it the only way to enable Spark access HBase by
distributing the keytab to each NodeManager?
It seems that Spark doesn't provide a delegation token like MR job, am
I right?
When you say Storm, did you mean Storm with Trident or Storm?
My use case does not have simple transformation. There are complex events that
need to be generated by joining the incoming event stream.
Also, what do you mean by No Back PRessure ?
On Wednesday, 17 June 2015 11:57 AM, Enno
Is there any good sample code in java to implement *Implementing and
Using a Custom Actor-based Receiver .*
--
Thanks Regards,
Anshu Shukla
Hey,
I noticed that my code spends hours with `generateTreeString` even though the
actual dag/dataframe execution takes seconds.
I’m running a query that grows exponential in the number of iterations when
evaluated without caching,
but should be linear when caching previous results.
E.g.
I guess both. In terms of syntax, I was comparing it with Trident.
If you are joining, Spark Streaming actually does offer windowed join out
of the box. We couldn't use this though as our event stream can grow
out-of-sync, so we had to implement something on top of Storm. If your
event streams
Hi, can somebody suggest me the way to reduce quantity of task?
2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com:
Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each
of them has spark worker.
The problem is that spark runs 869 task to read 3 lines:
Hi,
I downloaded the source from Downloads page and ran the make-distribution.sh
script.
# ./make-distribution.sh --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests
clean package
The script has “-x” set in the beginning.
++ /tmp/a/spark-1.4.0/build/mvn help:evaluate
1 - 100 of 142 matches
Mail list logo