I figured that out, And these are my findings:
- It just enters in an infinite loop when there's a duplicate partition
id.
- It enters in an infinite loop when the partition id starts from 1 rather
than 0
Something like this piece of code can reproduce it: (in getPartitions())
val
Can you look in the worker logs and see what going wrong?
Thanks
Best Regards
On Wed, Aug 12, 2015 at 9:53 PM, Nupur Kumar (BLOOMBERG/ 731 LEX)
nkumar...@bloomberg.net wrote:
Hello,
I am doing this for the first time so feel free to let me know/forward
this to where it needs to be if not
Hi,
I have a cluster of 1 master and 2 slaves. I'm running a spark streaming in
master and I want to utilize all nodes in my cluster. i had specified some
parameters like driver memory and executor memory in my code. when i
give --deploy-mode cluster --master yarn-cluster in my spark-submit, it
In case anyone runs into this issue in the future, we got it working: the
following variable must be set on the edge node:
export
PYSPARK_PYTHON=/your/path/to/whatever/python/you/want/to/run/bin/python
I didn't realize that variable gets passed to every worker node. All I saw
when searching for
I've got a Spark application running on a host with 64 character FQDN. When
running with Spark master local[*] I get the following error. Note, the host
name should be
ip-10-248-0-177.us-west-2.compute.internaldna.corp.adaptivebiotech.com but the
last 6 characters are missing. The same
You need to add that jar in the classpath. While submitting the job, you
can use --jars, --driver-classpath etc configurations to add the jar. Apart
from that if you are running the job as a standalone application, then you
can use the sc.addJar option to add the jar (which will ship this jar into
Please notice that 'jars: null'
I don't know why you put ///. but I would propose you just put normal
absolute paths.
dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld
--jars /home/missingmerch/postgresql-9.4-1201.jdbc41.jar
/home/missingmerch/dse.jar
Hi all,
I want to write a Spark Streaming program that listens to Kafka for a list
of topics.
The list of topics that I want to consume is stored in a DB and might
change dynamically. I plan to periodically refresh this list of topics in
the Spark Streaming app.
My question is is it possible to
Hi,
I am using spark 1.4 when an issue occurs to me.
I am trying to use the aggregate function:
JavaRddString rdd = some rdd;
HashMapLong, TypeA zeroValue = new HashMap();
// add initial key-value pair for zeroValue
rdd.aggregate(zeroValue,
new
-dev
If you want to guarantee the side effects happen you should use foreach or
foreachPartitions. A `take`, for example, might only evaluate a subset of
the partitions until it find enough results.
On Wed, Aug 12, 2015 at 7:06 AM, Eugene Morozov fathers...@list.ru wrote:
Hi!
I’d like to
Hi,
It worked after removing that line. Thank you for the response and fix .
Thanks Regards, Meethu M
On Thursday, 13 August 2015 4:12 AM, Burak Yavuz brk...@gmail.com wrote:
For the record:https://github.com/apache/spark/pull/8147
https://issues.apache.org/jira/browse/SPARK-9916
Yes. this file is available in this path in the same machine where i'm
running the spark. later i moved spark-1.4.1 folder to all other machines
in my cluster but still i'm facing the same issue.
*Thanks*,
https://in.linkedin.com/in/ramkumarcs31
On Thu, Aug 13, 2015 at 1:17 PM, Akhil Das
Hi Tim,
An option like spark.mesos.executor.max to cap the number of executors per
node/application would be very useful. However, having an option like
spark.mesos.executor.num
to specify desirable number of executors per node would provide even/much
better control.
Thanks,
Ajay
On Wed, Aug
A chain of map and flatmap does not cause any
serialization-deserialization.
On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info
wrote:
Hello everyone,
I am wondering what the effect of serialization is within a stage.
My understanding of Spark as an execution engine is
While submitting the job, you can use --jars, --driver-classpath etc
configurations to add the jar. Apart from that if you are running the
job as a standalone application, then you can use the sc.addJar option
to add the jar (which will ship this jar into all the executors)
Regards,
Anish
On
yes there certainly is, so long as eclipse has the right plugins and so on
to run scala programs. You're really asking two questions: (1) Can I use a
modern IDE to develop spark apps and (2) can we easily unit test spark
streaming apps.
the answer is yes to both...
Regarding your IDE:
I like
I have a pretty complex nested structure with several levels. So in order to
create it I use SQLContext.createDataFrame method and provide specific Rows
with specific StrucTypes, both of which I build myself.
To build a Row I iterate over my values and literally build a Row.
ListObject
I am looking to decide what is best for my production grade spark
application(s).
YARN
=
1. YARN supports security. When Spark is run over YARN the communication
between processes can use secure authentication through Kerberos.
2. Spark standalone cluster can only run Spark jobs and
Hi,
I was working with non-reliable receiver version of Spark-Kafka streaming
i.e.
KafkaUtils,createStream... where for testing purpose I was getting data at
constant rate from kafka and it was acting as expected.
But when there was exponential data in Kafka, my program started crashing
saying
I would recommend this spark package for your unit testing needs (
http://spark-packages.org/package/holdenk/spark-testing-base).
Best,
Burak
On Thu, Aug 13, 2015 at 5:51 AM, jay vyas jayunit100.apa...@gmail.com
wrote:
yes there certainly is, so long as eclipse has the right plugins and so on
Hello everyone,
in the new Kafka Direct API, what are the benefits of setting a value for
*spark.streaming.maxRatePerPartition*?
In my case, I have 2 seconds batches consuming ~15k tuples from a topic
split into 48 partitions (4 workers, 16 total cores).
Is there any particular value I should
The DataFrame issue has been fixed in Spark 1.5. Refer to SPARK-7990
https://issues.apache.org/jira/browse/SPARK-7990 and Stackoverflow: Spark
specify multiple column conditions for dataframe join
http://stackoverflow.com/a/31889190/1344789.
On Tue, Apr 28, 2015 at 12:55 PM, Ali Bajwa
All you'd need to do is *transform* the rdd before writing it, e.g. using
the .map function.
On Thu, Aug 13, 2015 at 11:30 AM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi All,
I have a question in writing rdd to cassandra. Instead of writing entire
rdd to cassandra, i want to write
Hi,
I want to know how you coalesce the partition to one to improve the performance
Thanks
在2015年08月11日 23:31,Al M 写道:
I am using DataFrames with Spark 1.4.1. I really like DataFrames but the
partitioning makes no sense to me.
I am loading lots of very small files and joining them together.
The current kafka stream implementation assumes the set of topics doesn't
change during operation.
You could either take a crack at writing a subclass that does what you
need; stop/start; or if your batch duration isn't too small, you could run
it as a series of RDDs (using the existing
Thanks a lot guys, that's exactly what I hoped for :-).
Cheers,
Mark
2015-08-13 6:35 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com:
A chain of map and flatmap does not cause any
serialization-deserialization.
On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info
wrote:
Serialization only occurs intra-stage, when you are using Python, and as
far as I know, only in the first stage, when reading the data and passing
it to the Python interpreter the first time.
Multiple operations are just chains of simple *map *and *flatMap *operators
at task level on simple Scala
Hi Todd,
We have not got a chance to update it. We will update it after 1.5 release.
Thanks,
Yin
On Thu, Aug 13, 2015 at 6:49 AM, Todd bit1...@163.com wrote:
Hi,
I got a question about the spark-sql-perf project by Databricks at
https://github.com/databricks/spark-sql-perf/
The
Hi All,
I have a question in writing rdd to cassandra. Instead of writing entire
rdd to cassandra, i want to write individual statement into cassandra
beacuse there is a need to perform to ETL on each message ( which requires
checking with the DB).
How could i insert statements individually?
Hi Philip,
I have the following requirement -
I read the streams of data from various partitions of kafka topic. And then
I union the dstreams and apply hash partitioner so messages of same key
would go into single partition of an rdd, which is ofcourse handled by a
single thread. This way we
Never mind me, I've found an email to this list from Raghavendra Pandey which
got me what I needed
val nestedCol = struct(df(nested2.column1), df(nested2.column2),
df(flatcolumn))
val df2 = df.select(df(nested1), nestedCol as nested2)
Thanks,
Ewan
From: Ewan Leith
Sent: 13 August 2015 15:44
HI,
Please let me know if I am missing anything in the below mail, to get the
issue fixed
Regards,
Satish Chandra
On Wed, Aug 12, 2015 at 6:59 PM, satish chandra j jsatishchan...@gmail.com
wrote:
HI,
The below mentioned code is working very well fine in Spark Shell but when
the same is
Hello,
Any idea on why this is happening?
Thanks
Naga
-- Forwarded message --
From: Naga Vij nvbuc...@gmail.com
Date: Wed, Aug 12, 2015 at 5:47 PM
Subject: - Spark 1.4.1 - run-example SparkPi - Failure ...
To: user@spark.apache.org
Hi,
I am evaluating Spark 1.4.1
Any idea on
Hi
I've been at this problem for a few days now and wasn't able to solve it.
I'm hoping that I'm missing something that you don't!
I'm trying to run a simple python application on a 2-node-cluster I set up
in standalone mode. A master and a worker, whereas the master also takes on
the role of a
Has anyone used withColumn (or another method) to add a column to an existing
nested dataframe?
If I call:
df.withColumn(nested.newcolumn, df(oldcolumn))
then it just creates the new column with a . In it's name, not under the
nested structure.
Thanks,
Ewan
I have this call trying to save to hdfs 2.6
wordCounts.saveAsNewAPIHadoopFiles(prefix, txt);
but I am getting the following:
java.lang.RuntimeException: class scala.runtime.Nothing$ not
org.apache.hadoop.mapreduce.OutputFormat
So you need some state between messages in a partition. You can use
mapPartitions or foreachPartition, which allow you to write code to process
an entire partition.
On Thu, Aug 13, 2015 at 11:48 AM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi Philip,
I have the following requirement -
I
What are ideas around Spark cluster for streaming purposes ?
What is better standalone / Mesos / YARN ?
Please share cluster details and size of data and type of processing.
(multiple processing points) (architecture or similar)
I see folks using YARN cluster for streaming purposes.
Regards,
I am training a boosted trees model on a couple million input samples (with
around 300 features) and am noticing that the input size of each stage is
increasing each iteration. For each new tree, the first step seems to be
building the decision tree metadata, which does a .count() on the input
When deploying a spark streaming application I want to be able to retrieve
the lastest kafka offsets that were processed by the pipeline, and create
my kafka direct streams from those offsets. Because the checkpoint
directory isn't guaranteed to be compatible between job deployments, I
don't want
Not that I have any answer at this point, but I was discussing this
exact same problem with Johannes today. An input size of ~20K records
was growing each iteration by ~15M records. I could not see why on a
first look.
@jkbradley I know it's not much info but does that ring any bells? I
think
It's just to limit the maximum number of records a given executor needs to
deal with in a given batch.
Typical usage would be if you're starting a stream from the beginning of a
kafka log, or after a long downtime, and don't want ALL of the messages in
the first batch.
On Thu, Aug 13, 2015 at
Is this an artifact of a recent change? Does this not show up in any of the
tests or benchmarks?
On Thu, Aug 13, 2015 at 2:33 PM, Sean Owen so...@cloudera.com wrote:
Not that I have any answer at this point, but I was discussing this
exact same problem with Johannes today. An input size of
Thanks Michael for the answer. I will watch the project and hope the update
will be coming soon, :-)
At 2015-08-14 02:13:32, Michael Armbrust mich...@databricks.com wrote:
Hey sorry, I've been doing a bunch of refactoring on this project. Most of the
data generation was a huge hack (it was
Access the offsets using HasOffsetRanges, save them in your datastore,
provide them as the fromOffsets argument when starting the stream.
See https://github.com/koeninger/kafka-exactly-once
On Thu, Aug 13, 2015 at 3:53 PM, Stephen Durfey sjdur...@gmail.com wrote:
When deploying a spark
Hi Guys,
We need to do some state checkpointing (an rdd thats updated using
updateStateByKey). We would like finer control over the serialization.
Also, this would allow us to do schema evolution in the deserialization
code when we need to modify the structure of the classes associated with
the
oh I see, you are defining your own RDD Partition types, and you had a
bug where partition.index did not line up with the partitions slot in
rdd.getPartitions. Is that correct?
On Thu, Aug 13, 2015 at 2:40 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I figured that out, And these are my
I am exploring utilizing Spark as my dataset is becoming more and more
difficult to manage and analyze. I'd appreciate if anyone could provide
feedback on the following questions for me:
* I am especially interested in training large datasets using machine
learning algorithms.
o
Hey sorry, I've been doing a bunch of refactoring on this project. Most of
the data generation was a huge hack (it was done before we supported
partitioning natively) and used some private APIs that don't exist
anymore. As a result, while doing the regression tests for 1.5 I deleted a
bunch of
I'm using spark-1.4.1 and compile it against CDH5.3.2. When I use
ALS.trainImplicit to build a model, I got this error when rank=40 and
iterations=30.
It worked for (rank=10, iteration=10) and (rank=20, iteration=20).
What was wrong with (rank=40, iterations=30)?
15/08/13 01:16:40 INFO
Hi All,
After creating a direct stream like below:
val events = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
I would like to convert the above stream into data frames, so that I could
run hive queries over it. Could anyone
It is strange that there are always two tasks slower than others, and the
corresponding partitions's data are larger, no matter how many partitions?
Executor ID Address Task Time Shuffle Read Size /
Records
1 slave129.vsvs.com:56691 16 s1 99.5 MB /
Any idea anyone?
On Fri, Aug 14, 2015 at 10:11 AM, Mohit Durgapal durgapalmo...@gmail.com
wrote:
Hi All,
After creating a direct stream like below:
val events = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
I would
Hi Everyone
Which one should work faster (coalesce or repartition) if I need to reduce
number of partitions from 5000 to 3 before saving RDD asTextFile
Total data size is about 400MB on disk in text format
Thank you
Hi,
I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines,
for a workload similar to TPCH (analytical queries with multiple/multi-way
large joins and aggregations). Each machine has 12GB of Memory and 4 cores.
My total data size is 150GB, stored in HDFS (stored as Hive tables),
I was able to get this working by using an alternative method however I
only see 0 bytes files in hadoop. I've verified that the output does exist
in the logs however it's missing from hdfs.
On Thu, Aug 13, 2015 at 10:49 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I have this call trying to
Hello,I am new to Spark. I am looking for a matrix inverse and multiplication
solution. I did a quick search and found a couple of solutions but my
requirements are:- large matrix (up to 2 millions x 2 m)- need to support
complex double data type- preferably in Java
There is one post
The code and error didn't go through.
Mind sending again ?
Which Spark release are you using ?
On Thu, Aug 13, 2015 at 6:17 PM, dizzy5112 dave.zee...@gmail.com wrote:
the code below works perfectly on both cluster and local modes
but when i try to create a graph in cluster mode (it works
Oh forgot to note using the Scala REPL for this.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/graphx-class-not-found-error-tp24253p24254.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I would ask whether there are slides, blogs or videos on the topic about how
spark sql is implemented, the process or the whole picture when spark sql
executes the code, Thanks!.
You can look under Developer Track:
https://spark-summit.org/2015/#day-1
http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1 (slightly
old)
Catalyst design:
https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit
FYI
On Thu, Aug
I’m observing an unusual situation where my step duration increases as I add
further executors to my cluster. My algorithm is fully data parallelizable into
a map phase, followed by a reduce step at the end that amounts to matrix
addition. So I’ve kicked a cluster of, say, 100 executors with 4
the code below works perfectly on both cluster and local modes
but when i try to create a graph in cluster mode (it works in local mode)
I get the following error:
any help appreciated
--
View this message in context:
If --jars doesn't work,
try --conf spark.executor.extraClassPath=path-to-jar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Executor-java-lang-NoClassDefFoundError-tp24244p24256.html
Sent from the Apache Spark User List mailing list archive at
Tim,
The ability to specify fine-grain configuration could be useful for many
reasons. Let's take an example of a node with 32 cores. All of first, as
per my understanding, having 5 executors each with 6 cores will almost
always perform better than having a single executor with 30 cores .
Hello!
I know there is LRU eviction policy for the in memory cached partitions, but
unfortunately I cannot find anything regarding evicition policy from rdd
presisted on disk. Is there any?
Does spark assume that it has infinite disk available and never evict anything?
Even then, if it’s
Hi,
I got a question about the spark-sql-perf project by Databricks at
https://github.com/databricks/spark-sql-perf/
The Tables.scala
(https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/bigdata/Tables.scala)
and BigData
Are you running on mesos, yarn or standalone? If you're on mesos, are you
using coarse grain or fine grained mode?
On Thu, Aug 13, 2015 at 10:13 PM, Ara Vartanian arav...@cs.wisc.edu wrote:
I’m observing an unusual situation where my step duration increases as I
add further executors to my
Hi,May I know the performance difference the rdd.join function and spark SQL
join operation. If I want to join several big Rdds, how should I decide which
one I should use? What are the factors to consider here? Thanks!
I’d say spark will be faster in this case, because it avoids storing
intermediate data to disk after map and before reduce tasks.
It’ll be faster even if you use Combiner (I’d assume Pig is able to figure that
out).
Hard to say how much faster as it’ll depend on disks available (ssd vs sshd vs
70 matches
Mail list logo