If you want a long running application, then go with spark streaming (which
kind of blocks your resources). On the other hand, if you use job server
then you can actually use the resources (CPUs) for other jobs also when
your dbjob is not using them.
Thanks
Best Regards
On Sun, Jul 5, 2015 at
Hi,
I am using TeraSort benchmark from ehiggs's branch
https://github.com/ehiggs/spark-terasort
https://github.com/ehiggs/spark-terasort . Then I noticed that in
TeraSort.scala, it is using Kryo Serializer. So I made a small change from
org.apache.spark.serializer.KryoSerializer to
Looks like, it spend more time writing/transferring the 40GB of shuffle
when you used kryo. And surpirsingly, JavaSerializer has 700MB of shuffle?
Thanks
Best Regards
On Sun, Jul 5, 2015 at 12:01 PM, Gavin Liu ilovesonsofanar...@gmail.com
wrote:
Hi,
I am using TeraSort benchmark from
I have a very simple application, where I am intializing the Spark Context
and using the context. The problem happens with both Spark 1.3.1 and 1.4.0;
Scala 2.10.4; Java 1.7.0_79
Full Program
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
Hi Ayan,How continuous is your workload? As Akhil points out, with streaming,
you'll give up at least one core for receiving, will need at most one more core
for processing. Unless you're running on something like Mesos, this means that
those cores are dedicated to your app, and can't be
Thanks Akhil. In case I go with spark streaming, I guess I have to implment
a custom receiver and spark streaming will call this receiver every batch
interval, is that correct? Any gotcha you see in this plan? TIA...Best, Ayan
On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das ak...@sigmoidanalytics.com
Sorry for the silly question. I'm fairly new to Spark.
Because of the cleanup log messages, I didn't see scala, so I thought
it's still working on something. If I press Enter, I got disconnected. I
finally tried typing the variable name, which actually worked.
--
View this message in context:
Hi
Thanks for the reply. here is my situation: I hve a DB which enbles
synchronus CDC, think this as a DBtrigger which writes to a taable with
changed values as soon as something changes in production table. My job
will need to pick up the data as soon as it arrives which can be every 1
min
Hi,
I am trying to integrate Drools rules API with Spark so that the solution
could solve few CEP centric use cases.
When I read data from a local file (simple FileWriter - readLine()), I see
that all my rules are reliably fired and everytime I get the results as
expected. I have tested with
Are you seeing this after the app has already been running for some time,
or just at the beginning? Generally, registration should only occur once
initially, and a timeout would be due to the master not being accessible.
Try telneting to the master IP/port from the machine on which the driver
will
Please note -- I am trying to run this with sbt run or spark-submit --
getting the same errors in both.
Since I am in stand-alone mode, I assume I need not start the spark master,
am I right.
I realize this is probably a basic setup issue, but am unable to get past
it. Any help will be
Scala used to run on .NET
http://www.scala-lang.org/old/node/10299
--
Ruslan Dautkhanov
On Thu, Jul 2, 2015 at 1:26 PM, pedro ski.rodrig...@gmail.com wrote:
You might try using .pipe() and installing your .NET program as a binary
across the cluster (or using addFile). Its not ideal to pipe
That code doesn't appear to be registering classes with Kryo, which means the
fully-qualified classname is stored with every Kryo record. The Spark
documentation has more on this:
https://spark.apache.org/docs/latest/tuning.html#data-serialization
Regards,
Will
On July 5, 2015, at 2:31 AM,
If it is indeed a reactive use case, then Spark Streaming would be a good
choice.
One approach worth considering - is it possible to receive a message via kafka
(or some other queue). That'd not need any polling, and you could use standard
consumers. If polling isn't an issue, then writing a
It's not possible to specify YARN RM paramers at command line of
spark-submit time. You have to specify all resources that are available on
your cluster to YARN upfront. If you want to limit amount of resource
available for your Spark job, consider using YARN dynamic resource pools
instead
Hi
Apache Flink outperforms Apache Spark in processing machine learning graph
algorithms and relational queries but not in batch processing!
The results were published in the proceedings of the 18th International
Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
2015.
Unfortunately, afaik that project is long dead.
It'd be an interesting project to create an intermediary protocol, perhaps
using something that nearly everything these days understand (unfortunately [!]
that might be JavaScript). For example, instead of pickling language
constructs, it might be
You can also find more info here:
http://tachyon-project.org/master/Running-Spark-on-Tachyon.html
Hope this helps.
Haoyuan
On Tue, Jun 30, 2015 at 11:28 PM, Himanshu Mehra
himanshumehra@gmail.com wrote:
Hi neprasad,
You should give a try to Tachyon system. or any other in memory db.
Ironpython shares with python only the syntax - at best. It is a scripting
language within the .NET framework. Many applications have this for
scripting the application itself. This won't work for you. You can use
pipes or write your spark jobs in java/scala/r and submit them via your
.NET
Sam:
bq. where would one set this timeout?
With the following, it would be relatively easier to see which conf to
change:
[SPARK-6980] [CORE] Akka timeout exceptions indicate which conf controls
them (RPC Layer)
FYI
On Sun, Jul 5, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
Usually
hi,
We're running spark 1.4.0 on ec2, with 6 machines, 4 cores each. We're
trying to run an application on a number of total-executor-cores. but we
want it to run on the minimal number of machines as possible (e.g.
total-executor-cores=4, we'll want single machine. total-executor-cores=12,
we'll
There was no mentioning of the versions of Flink and Spark used in
benchmarking.
The size of cluster is quite small.
Cheers
On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com wrote:
Hi
Apache Flink outperforms Apache Spark in processing machine learning
graph
algorithms and
Usually this message means that the test was starting some process
like a Spark master and it didn't ever start. The eventual error is
timeout. You have to try to dig in to the test and logs to catch the
real reason.
On Sun, Jul 5, 2015 at 9:23 PM, SamRoberts samueli.robe...@yahoo.com wrote:
I'm trying to use SparkSQL to efficiently query structured data from
datasets in S3. The data is naturally partitioned by date, so I've laid it
out in S3 as follows:
s3://bucket/dataset/dt=2015-07-05/
s3://bucket/dataset/dt=2015-07-04/
s3://bucket/dataset/dt=2015-07-03/
etc.
In each directory,
Joe Duffy, director of engineering on Microsoft's compiler team made a comment
about investigating F# type providers for Spark.
https://twitter.com/xjoeduffyx/status/614076012372955136
From: Ashic Mahtabmailto:as...@live.com
Sent: ?Sunday?, ?July? ?5?, ?2015 ?1?:?29? ?PM
To: Ruslan
Also, it's not clear where to 1 millisec timeout is coming from. Can
someone explain -- and if it's a legitimate timeout problem, where would one
set this timeout?
--
View this message in context:
Hi Prakhar,
How about you check the following related web resources to image processing
using Spark. They are all listed in my Big Data Knowledge Base:
http://www.SparkBigData.com:
1. Scaling Up Fast: Real-time Image Processing and Analytics using Spark -
Kevin Mader (ETH Zurich) [VIDEO]
We have hit the same issue in spark shell when registering a temp table. We
observed it happening with those who had JDK 6. The problem went away after
installing jdk 8. This was only for the tutorial materials which was about
loading a parquet file.
Regards
Andy
On Sat, Jul 4, 2015 at 2:54 AM,
I had run into the same problem where everything was working swimmingly
with Spark 1.3.1. When I switched to Spark 1.4, either by upgrading to
Java8 (from Java7) or by knocking up the PermGenSize had solved my issue.
HTH!
On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au
Hi guys,
I just read the paper too. There is no much information regarding why Flink
is faster than Spark for data science type of workloads in the benchmark.
It is very difficult to generalize the conclusion of a benchmark from my
point of view. How much experience the author has with Spark is
Hi all!
I am trying to read records from offset 100 to 110 from a table using
following piece of code
val sc = new SparkContext(new
SparkConf().setAppName(SparkJdbcDs).setMaster(local[*]))
val sqlContext = new SQLContext(sc)
val options = new HashMap[String, String]()
The file is at
https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1
The command was included in the gist
SPARK_REPL_OPTS=-XX:MaxPermSize=256m
spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages
com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g
Hello,
I'm migrating some RDD-based code to using DataFrames. We've seen massive
speedups so far!
One of the operations in the old code creates an array of the values for
each key, as follows:
val collatedRDD =
valuesRDD.mapValues(value=Array(value)).reduceByKey((array1,array2) =
I think you should mention partitionColumn like below and the Colum type should
be numeric. It works for my case.
options.put(partitionColumn, revision);
Thanks,
Manohar
From: Hafiz Mujadid [via Apache Spark User List]
[mailto:ml-node+s1001560n23635...@n3.nabble.com]
Sent: Monday,
thanks
On Mon, Jul 6, 2015 at 10:46 AM, Manohar753 [via Apache Spark User List]
ml-node+s1001560n23637...@n3.nabble.com wrote:
I think you should mention partitionColumn like below and the Colum type
should be numeric. It works for my case.
options.put(partitionColumn, revision);
Hi,
I try to use for one table created in spark, but it seems the results are
all empty, I want to get metadata for table, what's other options?
Thanks
+---+
|result |
+---+
| # col_name|
|
Maybe some flink benefits from some pts they outline here:
http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap a
bit(or a lot) with spark moving towards similar style off-heap memory mgmt,
Yin,
With 512Mb PermGen, the process still hung and had to be kill -9ed.
At 1Gb the spark shell associated processes stopped hanging and started
exiting with
scala println(dfCount.first.getLong(0))
15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040) called with
curMem=0,
I have a long live spark application running on YARN.
In some nodes, it try to write to the shuffle path in the shuffle map task.
But the root path /search/hadoop10/yarn_local/usercache/spark/ was deleted,
so the task is failed. So every time when running shuffle map task on this
node, it was
Thanks.. tried local[*] -- it didn't help.
I agree that it is something to do with the SparkContext..
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Futures-timed-out-after-1-milliseconds-tp23622p23633.html
Sent from the Apache Spark User List mailing
Node cloud10141049104.wd.nm.nop.sogou-op.org and
cloud101417770.wd.nm.ss.nop.sogou-op.org failed too many times, I want to
know if it can be auto offline when failed too many times?
2015-07-06 12:25 GMT+08:00 Tao Li litao.bupt...@gmail.com:
I have a long live spark application running on
One more data point -- sbt seems to have a bigger problem with this than
spark-submit. With spark-submit, I am able to get it to run several times,
while sbt fails most of the time (or more recently all the time).
--
View this message in context:
Can you provide the result set you are using and specify how you integrated
the drools engine?
Drools basically is based on a large shared memory. Hence, if you have
several tasks in Shark they end up using different shared memory areas.
A full integration of drools requires some sophisticated
I have never seen issue like this. Setting PermGen size to 256m should
solve the problem. Can you send me your test file and the command used to
launch the spark shell or your application?
Thanks,
Yin
On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov s...@swoop.com wrote:
Yin,
With 512Mb
Hi ,
I am trying to run an ETL on spark which involves expensive shuffle
operation. Basically I require a self-join to be performed on a
sparkDataFrame RDD . The job runs fine for around 15 hours and when the
stage(which performs the sef-join) is about to complete, I get a
*java.io.IOException:
Sim,
Can you increase the PermGen size? Please let me know what is your setting
when the problem disappears.
Thanks,
Yin
On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee denny.g@gmail.com wrote:
I had run into the same problem where everything was working swimmingly
with Spark 1.3.1. When I
46 matches
Mail list logo