Update: as expected, switching to kryo merely delays the inevitable. Does
anyone have experience controlling memory consumption while processing
(e.g. writing out) imbalanced partitions?
On 09-Aug-2014 10:41 am, Bharath Ravi Kumar reachb...@gmail.com wrote:
Our prototype application reads a 20GB
Spark cached the RDD in JVM, so presumably, yes, the singleton trick should
work.
Sent from my Google Nexus 5
On Aug 9, 2014 11:00 AM, Kevin James Matzen kmat...@cs.cornell.edu
wrote:
I have a related question. With Hadoop, I would do the same thing for
non-serializable objects and setup().
Hi,
I am new to spark and just going through all different features and
integration projects, so this could be very naive question.
I have requirement where I want to access data stored into other
application. It would be nice if I can share Spark Worker node inside the
same JVM. From one of the
This should work:
jobs.saveAsTextFile(file:home/hysom/testing)
Note the 4 slashes, it's really 3 slashes + absolute path.
This should be mentioned in the docu though, I only remember that from
having seen it somewhere else.
The output folder, here testing, will be created and must therefore
hi, all,
I am playing with docker, trying to create a spark cluster with docker
containers.
since spark master, worker, driver all need to visit each other, I
configured a dns server, and set hostname and domain name of each node.
but when spark master start up, it seems to be using hostname
Currently the SQL dialect provided by Spark SQL only support a set of most
frequently used structures and doesn't support DDL and DML operations. In
the long run, we'd like to replace it with a full featured SQL-92
implementation.
On Sat, Aug 9, 2014 at 8:11 AM, Sathish Kumaran Vairavelu
Hi Jenny, does this issue only happen when running Spark SQL with YARN in
your environment?
On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao linlin200...@gmail.com wrote:
Hi,
I am able to run my hql query on yarn cluster mode when connecting to the
default hive metastore defined in
there is docker script for spark 0.9 in spark git
Yours, Xuefeng Wu 吴雪峰 敬上
On 2014年8月10日, at 下午8:27, 诺铁 noty...@gmail.com wrote:
hi, all,
I am playing with docker, trying to create a spark cluster with docker
containers.
since spark master, worker, driver all need to visit each
Hi,
I am using spark-scala system to train distributed svm. For training svm I
am using the files in LIBSVM format. I want to partition a file into fixed
number of partititions, with each partition having equal number
of datapoints(assume that the number of datapoints in the file is exactly
I have a CDH5.0.3 cluster with Hive tables written in Parquet.
The tables have the DeprecatedParquetInputFormat on their metadata, and
when I try to select from one using Spark SQL, it blows up with a stack
trace like this:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
As far as I can tell, the method was removed after 0.12.0 in the fix
for HIVE-5223
(https://github.com/apache/hive/commit/4059a32f34633dcef1550fdef07d9f9e044c722c#diff-948cc2a95809f584eb030e2b57be3993),
and that fix was back-ported in its entirety to 5.0.0+:
If the file is big enough, you can try MLUtils.loadLibSVMFile with a
minPartitions argument. This doesn't shuffle data but it might not
give you the exact number of partitions. If you want to have the exact
number, use RDD.repartition, which requires data shuffling. -Xiangrui
On Sun, Aug 10, 2014
Hi Sean,
Thanks for the reply. I'm on CDH 5.0.3 and upgrading the whole cluster to
5.1.0 will eventually happen but not immediately.
I've tried running the CDH spark-1.0 release and also building it from
source. This, unfortunately goes into a whole other rathole of
dependencies. :-(
Eric
Hm, I was thinking that the issue is that Spark has to use a forked
hive-exec since hive-exec unfortunately includes a bunch of
dependencies it shouldn't. It forked Hive 0.12.0:
http://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/0.12.0
... and then I was thinking maybe CDH wasn't
Hallo,
Out of curiosity, I try to implement the following example in Java
according to the following site:
http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html
Unfortunately, I did not find a recent example for using a Twitter source
in Spark Streaming with Java.
Yeah, that's what I feared. Unfortunately upgrades on very large production
clusters aren't a cheap way to find out what else is broken.
Perhaps I can create an RCFile table and sidestep parquet for now.
On Aug 10, 2014, at 1:45 PM, Sean Owen so...@cloudera.com wrote:
Hm, I was thinking
I imagine it's not the only instance of this kind of problem people
will ever encounter. Can you rebuild Spark with this particular
release of Hive?
Unfortunately the Hive APIs that we use change to much from release to
release to make this possible. There is a JIRA for compiling Spark SQL
Sounds like you need to use lateral view with explode
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView,
which is supported in Spark SQL's HiveContext.
On Sat, Aug 9, 2014 at 6:43 PM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:
I have a simple JSON
Thanks Michael, I can try that too.
I know you guys aren't in sales/marketing (thank G-d), but given all the hoopla
about the CDH-DataBricks partnership, it'd be awesome if you guys were
somewhat more aligned, by which I mean that the DataBricks releases on Apache
that say for CDH5 would
If the link to PR/1819 is broken. Here is the one
https://github.com/apache/spark/pull/1819.
On Sun, Aug 10, 2014 at 5:56 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:
Thanks Michael, I can try that too.
I know you guys aren't in sales/marketing (thank G-d), but given all the
hoopla
Is it possible to create custom transformations in Spark? For example data
security transforms such as encrypt and decrypt. Ideally its something one
would like to reuse across Spark streaming, Spark SQL and Spark.
Thanks Tathagata!
I did mean using the transformation in the form of a UDF in Spark SQL. This
function I envision of works on individual records as described by you.
On Fri, Aug 8, 2014 at 6:48 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
You can always define an arbitrary RDD-to-RDD
On Sun, Aug 10, 2014 at 2:43 PM, Michael Armbrust mich...@databricks.com
wrote:
if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in
order to get DeprecatedParquetInputFormat, I find out that there is an
incompatibility in the SerDeUtils class. Spark's Hive snapshot
Hi, all
I am trying to execute the SQL like
explain create table test as select a.key, b.value from src a inner join
src b on a.key=b.key;
But I got
Physical execution plan:
ExistingRdd [], ParallelCollectionRDD[1] at parallelize at
SparkStrategies.scala:182
Does that mean exception raised
On Fri, Aug 8, 2014 at 9:12 AM, Baoqiang Cao bqcaom...@gmail.com wrote:
Hi There
I ran into a problem and can’t find a solution.
I was running bin/pyspark ../python/wordcount.py
you could use bin/spark-submit ../python/wordcount.py
The wordcount.py is here:
Hi Andy,
That is the case in Spark 1.0, yes. However, as of Spark 1.1 which is
coming out very soon, you will be able to run SVD on non-TS matrices.
If you try to apply the current algorithm to a matrix with more than 10,000
columns, you will overburden the master node, which has to compute a 10k
I have wrote some very simple Scala codes followed,these codes make a Spark
job,and write something into Couchbase. But when I run this job on Spark,I
got an exception.I am wondering how this happend.
Something must be pointed, if I don't use Spark , the code can work.
Some other tips:
27 matches
Mail list logo