Hi List,
We're recently trying to running spark on Mesos, however, we encountered a fatal
error that mesos-master process will continuousely consume memory and finally
killed by OOM Killer, this situation only happening if has spark job
(fine-grained mode) running.
We finally root caused the
What operation are you performing before doing the saveAsTextFile? If you
are doing a groupBy/sortBy/mapPartition/reduceByKey operations then you can
specify the number of partitions. We were facing these kind of problems and
specifying the correct partition solved the issue.
Thanks
Best Regards
There's no way to avoid a shuffle due to the first and last elements
of each partition needing to be computed with the others, but I wonder
if there is a way to do a minimal shuffle.
On Thu, Aug 21, 2014 at 6:13 PM, cjwang c...@cjwang.us wrote:
One way is to do zipWithIndex on the RDD. Then use
Because map-reduce tasks like join will save shuffle data to disk . So the
only diffrence with caching or no-caching version is :
.map { case (x, (n, i)) = (x, n)}
-
Thanks,
Nieyuan
--
View this message in context:
You can check out this pull request: https://github.com/apache/spark/pull/476
LDA is on the roadmap for the 1.2 release, hopefully we will officially support
it then!
Best,
Burak
- Original Message -
From: Denny Lee denny.g@gmail.com
To: user@spark.apache.org
Sent: Thursday, August
Why dont you directly use DStream created as output of windowing process?
Any reason
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Aug 21, 2014 at 8:38 PM, Josh J joshjd...@gmail.com wrote:
Hi,
I
Hi
The following code gives me 'Task not serializable:
java.io.NotSerializableException: scala.collection.mutable.ArrayOps$ofInt'
var x = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3)
var iter = Array(5).toIterator
var value = 5
var value2 = iter.next
x.map( q = q*value).collect //Line 1, it works.
Hi,
Hopefully a simple question. Though is there an example of where to save
the output of countByWindow ? I would like to save the results to external
storage (kafka or redis). The examples show only stream.print()
Thanks,
Josh
Hello Team,
I was just trying to install spark on my windows server 2012 machine and use it
in my project; but unfortunately I do not find any documentation for the same.
Please let me know if we have drafted anything for spark users on Windows. I am
really in need of it as we are using
Hi all,
1. On Spark Standalone mode, client sumbit application. Where the driver
program will run? client or master?
2. Standalone is reliable? can use in production mode ?
taoist...@gmail.com
I am using PySpark with IPython notebook.
pre
data = sc.parallelize(range(1000), 10)
#successful
data.map(lambda x: x+1).collect()
#Error
data.count()
/pre
Something
similar:http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-td3415.html
But it does not
I'm running pyspark with Python 2.7.8 under Virtualenv
System Python Version: Python 2.6.x
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12645.html
Sent from the Apache
Hi all,
Somehow related to this question and this data structure, what is the best
way to extract features using names instead of positions? Of course, it is
previously necessary to store the names in some way...
Thanks in advance
--
View this message in context:
Hi everyone
I back ported kinesis-asl to spark 1.0.2 and ran a quick test on my local
machine. It seems to be working fine but I keep getting the following
warnings. I am not sure what it means and weather it is something to worry
about or not.
2014-08-22 15:53:43,803 [pool-1-thread-7] WARN
Does anyone knw a way to do this?
I tried it by sorting it and writing an auto increment function.
But since its parallel computing the result is wrong.
Is there anyway? please reply
--
View this message in context:
Hi everyone
Sorry about the noob question, but I am struggling to understand ways to
create DStreams in Spark. Here is my understanding based on what I could
gather from documentation and studying Spark code (as well as some hunch).
Please correct me if I am wrong.
1. In most cases, one would
Hi all,
I have a spark cluster of 30 machines, 16GB / 8 cores on each running in
standalone mode. Previously my application was working well ( several
RDDs the largest being around 50G).
When I started processing larger amounts of data (RDDs of 100G) my app
is losing executors. Im currently
Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work
correctly?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html
Sent from the Apache Spark
Hi everyone!
Nowadays Spark has set the Snappy as the default compression codec in
spark-1.1.0-Snapshot.
So if I want run a shuffle job, do I have to install snappy in linux?
Hello all,
I am new to Spark and I want to analyze csv file using Spark on my local
machine. The csv files contains airline database and I want to get a few
descriptive statistics (e.g. maximum of one column, mean, standard deviation in
a column, etc.) for my file. I am reading the file using
Folks,
I am wondering why Spark uses ClassTag in RDD[T: ClassTag] instead of the
more functional TypeTag option.
I have some code that needs TypeTag functionality and I don't know if a
typeTag can be converted to a classTag.
Mohit.
Hi Calvin,
When you say until all the memory in the cluster is allocated and the job
gets killed, do you know what's going on? Spark apps should never be
killed for requesting / using too many resources? Any associated error
message?
Unfortunately there are no tools currently for tweaking the
Hi Sankar,
You need to create an external table in order to specify the location of
data (i.e. using CREATE EXTERNAL TABLE user1 LOCATION). You can take
a look at this page
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable
for
Hello Yin,
I have tried the create external table command as well. I get the same error.
Please help me to find the root cause.
Thanks and Regards,
Sankar S.
On Friday, 22 August 2014, 22:43, Yin Huai huaiyin@gmail.com wrote:
Hi Sankar,
You need to create an external table in
Hello Yin,
Forgot to mention one thing, the same query works fine in Hive and Shark..
Thanks and Regards,
Sankar S.
On , S Malligarjunan smalligarju...@yahoo.com wrote:
Hello Yin,
I have tried the create external table command as well. I get the same error.
Please help me to find the
It would be nice if an RDD that was massaged by OrderedRDDFunction could know
its neighbors.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-tp12621p12664.html
Sent from the Apache Spark User List mailing
This is probably a bit ridiculous, but I'm wondering if it's possible
to use scala libraries in a python module? The Cassandra connector
here https://github.com/datastax/spark-cassandra-connector is in
Scala, would I need a Python version of that library to use Python
Spark?
Personally I have no
Hi,
I am having this FetchFailed issue when the driver is about to collect about
2.5M lines of short strings (about 10 characters each line) from a YARN
cluster with 400 nodes:
*14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 205.0 in stage
0.0 (TID 1228, aaa.xxx.com):
Hello Sankar,
Add JAR in SQL is not supported at the moment. We are working on it (
https://issues.apache.org/jira/browse/SPARK-2219). For now, can you try
SparkContext.addJar or using --jars your-jar to launch spark shell?
Thanks,
Yin
On Fri, Aug 22, 2014 at 2:01 PM, S Malligarjunan
Is there any way to control the ordering of values for each key during a
groupByKey() operation? Is there some sort of implicit ordering in place
already?
Thanks
Arpan
Hi All,
I have set of 1000k Workers of a company with different attribute associated
with them. I like at anytime to be able to report on their current state and
update the reports every 5 second.
Spark Streaming allows you to receive events about the Workers state changes
and process them.
I am using Spark's Thrift server to connect to Hive and use JDBC to issue
queries. Is there a way to cache table in Sparck by using JDBC call?
Thanks,
Ken
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/cache-table-with-JDBC-tp12675.html
Sent from the
I thought the fix had been pushed to the apache master ref. commit
[SPARK-2848] Shade Guava in uber-jars By Marcelo Vanzin on 8/20. So my
previous email was based on own build of the apache master, which turned
out not working yet.
Marcelo: Please correct me if I got that commit wrong.
Thanks,
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue
seems related to Hadoop1. When switching to using
spark-1.0.2-bin-hadoop*2*, the issue disappears.
--
View this message in context:
This is all that I see related to spark.MapOutputTrackerMaster in the master
logs after OOME
14/08/21 13:24:45 ERROR ActorSystemImpl: Uncaught fatal error from thread
[spark-akka.actor.default-dispatcher-27] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: Java heap space
Is it possible to connect to the thrift server using an ODBC client
(ODBC-JDBC)?
My thrift server is built from branch-1.0-jdbc using Hive 0.13.1
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ODBC-and-HiveThriftServer2-tp12680.html
Sent from the Apache
On 08/22/2014 04:32 PM, Arpan Ghosh wrote:
Is there any way to control the ordering of values for each key during a
groupByKey() operation? Is there some sort of implicit ordering in place
already?
Thanks
Arpan
there's no implicit ordering in place. the same holds for the order of
keys,
As far as I know, only yarn mode can set --num-executors, someone proved to
set more number-execuotrs for will perform better than set only 1 or 2
executor with large mem and core. sett
http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-vs-num-executors-td9878.html
Why
Hi Xiangrui,
You can refer to An Introduction to Statistical Learning with Applications in
R, there are many stander hypothesis test to do regarding to linear
regression and logistic regression, they should be implement in the fist order,
then we will list some other testes, which are also
you can kv.mapValues(sorted), but that's definitely less efficient than
sorting during the groupBy
you could try using combineByKey directly w/ heapq...
from heapq import heapify, heappush, merge
def createCombiner(x):
return [x]
def mergeValues(xs, x):
heappush(xs, x)
return xs
TypeTags are unfortunately not thread-safe in Scala 2.10. They were still
somewhat experimental at the time so we decided not to use them. If you want
though, you can probably design other APIs that pass a TypeTag around (e.g.
make a method that takes an RDD[T] but also requires an implicit
You should be able to just download / unzip a Spark release and run it on a
Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd.
The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on
Windows, but you can launch a standalone cluster manually using
Anyone know why I would see this in a bunch of executor logs? Is it just
classical overloading of the cluster network, OOM, or something else? If
anyone's seen this before, what do I need to tune to make some headway here?
Thanks,
Victor
Caused by: org.apache.spark.FetchFailedException: Fetch
I think it depends on your job. My personal experiences when I run TB data.
spark got loss connection failure if I use big JVM with large memory, but with
more executors with small memory, it can run very smoothly. I was running spark
on yarn.
Thanks.
Zhan Zhang
On Aug 21, 2014, at 3:42 PM,
44 matches
Mail list logo