Re: OOM writing out sorted RDD

2014-08-10 Thread Bharath Ravi Kumar
Update: as expected, switching to kryo merely delays the inevitable. Does anyone have experience controlling memory consumption while processing (e.g. writing out) imbalanced partitions? On 09-Aug-2014 10:41 am, Bharath Ravi Kumar reachb...@gmail.com wrote: Our prototype application reads a 20GB

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-10 Thread DB Tsai
Spark cached the RDD in JVM, so presumably, yes, the singleton trick should work. Sent from my Google Nexus 5 On Aug 9, 2014 11:00 AM, Kevin James Matzen kmat...@cs.cornell.edu wrote: I have a related question. With Hadoop, I would do the same thing for non-serializable objects and setup().

Sharing memory across applications

2014-08-10 Thread Tushar Khairnar
Hi, I am new to spark and just going through all different features and integration projects, so this could be very naive question. I have requirement where I want to access data stored into other application. It would be nice if I can share Spark Worker node inside the same JVM. From one of the

Re: saveAsTextFile

2014-08-10 Thread durin
This should work: jobs.saveAsTextFile(file:home/hysom/testing) Note the 4 slashes, it's really 3 slashes + absolute path. This should be mentioned in the docu though, I only remember that from having seen it somewhere else. The output folder, here testing, will be created and must therefore

how to use SPARK_PUBLIC_DNS

2014-08-10 Thread 诺铁
hi, all, I am playing with docker, trying to create a spark cluster with docker containers. since spark master, worker, driver all need to visit each other, I configured a dns server, and set hostname and domain name of each node. but when spark master start up, it seems to be using hostname

Re: Spark SQL dialect

2014-08-10 Thread Cheng Lian
Currently the SQL dialect provided by Spark SQL only support a set of most frequently used structures and doesn't support DDL and DML operations. In the long run, we'd like to replace it with a full featured SQL-92 implementation. On Sat, Aug 9, 2014 at 8:11 AM, Sathish Kumaran Vairavelu

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-10 Thread Cheng Lian
Hi Jenny, does this issue only happen when running Spark SQL with YARN in your environment? On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao linlin200...@gmail.com wrote: Hi, I am able to run my hql query on yarn cluster mode when connecting to the default hive metastore defined in

Re: how to use SPARK_PUBLIC_DNS

2014-08-10 Thread Xuefeng Wu
there is docker script for spark 0.9 in spark git Yours, Xuefeng Wu 吴雪峰 敬上 On 2014年8月10日, at 下午8:27, 诺铁 noty...@gmail.com wrote: hi, all, I am playing with docker, trying to create a spark cluster with docker containers. since spark master, worker, driver all need to visit each

Partitioning a libsvm format file

2014-08-10 Thread ayandas84
Hi, I am using spark-scala system to train distributed svm. For training svm I am using the files in LIBSVM format. I want to partition a file into fixed number of partititions, with each partition having equal number of datapoints(assume that the number of datapoints in the file is exactly

CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
I have a CDH5.0.3 cluster with Hive tables written in Parquet. The tables have the DeprecatedParquetInputFormat on their metadata, and when I try to select from one using Spark SQL, it blows up with a stack trace like this: java.lang.RuntimeException: java.lang.ClassNotFoundException:

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Sean Owen
As far as I can tell, the method was removed after 0.12.0 in the fix for HIVE-5223 (https://github.com/apache/hive/commit/4059a32f34633dcef1550fdef07d9f9e044c722c#diff-948cc2a95809f584eb030e2b57be3993), and that fix was back-ported in its entirety to 5.0.0+:

Re: Partitioning a libsvm format file

2014-08-10 Thread Xiangrui Meng
If the file is big enough, you can try MLUtils.loadLibSVMFile with a minPartitions argument. This doesn't shuffle data but it might not give you the exact number of partitions. If you want to have the exact number, use RDD.repartition, which requires data shuffling. -Xiangrui On Sun, Aug 10, 2014

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
Hi Sean, Thanks for the reply. I'm on CDH 5.0.3 and upgrading the whole cluster to 5.1.0 will eventually happen but not immediately. I've tried running the CDH spark-1.0 release and also building it from source. This, unfortunately goes into a whole other rathole of dependencies. :-( Eric

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Sean Owen
Hm, I was thinking that the issue is that Spark has to use a forked hive-exec since hive-exec unfortunately includes a bunch of dependencies it shouldn't. It forked Hive 0.12.0: http://mvnrepository.com/artifact/org.spark-project.hive/hive-exec/0.12.0 ... and then I was thinking maybe CDH wasn't

SparkStreaming 0.9.0 / Java / Twitter issue

2014-08-10 Thread Jörn Franke
Hallo, Out of curiosity, I try to implement the following example in Java according to the following site: http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html Unfortunately, I did not find a recent example for using a Twitter source in Spark Streaming with Java.

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
Yeah, that's what I feared. Unfortunately upgrades on very large production clusters aren't a cheap way to find out what else is broken. Perhaps I can create an RCFile table and sidestep parquet for now. On Aug 10, 2014, at 1:45 PM, Sean Owen so...@cloudera.com wrote: Hm, I was thinking

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Michael Armbrust
I imagine it's not the only instance of this kind of problem people will ever encounter. Can you rebuild Spark with this particular release of Hive? Unfortunately the Hive APIs that we use change to much from release to release to make this possible. There is a JIRA for compiling Spark SQL

Re: Spark SQL JSON dataset query nested datastructures

2014-08-10 Thread Michael Armbrust
Sounds like you need to use lateral view with explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView, which is supported in Spark SQL's HiveContext. On Sat, Aug 9, 2014 at 6:43 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: I have a simple JSON

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
Thanks Michael, I can try that too. I know you guys aren't in sales/marketing (thank G-d), but given all the hoopla about the CDH-DataBricks partnership, it'd be awesome if you guys were somewhat more aligned, by which I mean that the DataBricks releases on Apache that say for CDH5 would

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Yin Huai
If the link to PR/1819 is broken. Here is the one https://github.com/apache/spark/pull/1819. On Sun, Aug 10, 2014 at 5:56 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Thanks Michael, I can try that too. I know you guys aren't in sales/marketing (thank G-d), but given all the hoopla

Second Attempt: Custom transformations in Spark

2014-08-10 Thread Jeevak Kasarkod
Is it possible to create custom transformations in Spark? For example data security transforms such as encrypt and decrypt. Ideally its something one would like to reuse across Spark streaming, Spark SQL and Spark.

Re: Custom Transformations in Spark

2014-08-10 Thread Jeevak Kasarkod
Thanks Tathagata! I did mean using the transformation in the form of a UDF in Spark SQL. This function I envision of works on individual records as described by you. On Fri, Aug 8, 2014 at 6:48 PM, Tathagata Das tathagata.das1...@gmail.com wrote: You can always define an arbitrary RDD-to-RDD

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Eric Friedman
On Sun, Aug 10, 2014 at 2:43 PM, Michael Armbrust mich...@databricks.com wrote: if I try to add hive-exec-0.12.0-cdh5.0.3.jar to my SPARK_CLASSPATH, in order to get DeprecatedParquetInputFormat, I find out that there is an incompatibility in the SerDeUtils class. Spark's Hive snapshot

Explain throws exception in SparkSQL

2014-08-10 Thread Yu Gavin
Hi, all I am trying to execute the SQL like explain create table test as select a.key, b.value from src a inner join src b on a.key=b.key; But I got Physical execution plan: ExistingRdd [], ParallelCollectionRDD[1] at parallelize at SparkStrategies.scala:182 Does that mean exception raised

Re: error with pyspark

2014-08-10 Thread Davies Liu
On Fri, Aug 8, 2014 at 9:12 AM, Baoqiang Cao bqcaom...@gmail.com wrote: Hi There I ran into a problem and can’t find a solution. I was running bin/pyspark ../python/wordcount.py you could use bin/spark-submit ../python/wordcount.py The wordcount.py is here:

Re: Does MLlib in spark 1.0.2 only work for tall-and-skinny matrix?

2014-08-10 Thread Reza Zadeh
Hi Andy, That is the case in Spark 1.0, yes. However, as of Spark 1.1 which is coming out very soon, you will be able to run SVD on non-TS matrices. If you try to apply the current algorithm to a matrix with more than 10,000 columns, you will overburden the master node, which has to compute a 10k

Exception when call couchbase sdk in Spark Job

2014-08-10 Thread sunchen
I have wrote some very simple Scala codes followed,these codes make a Spark job,and write something into Couchbase. But when I run this job on Spark,I got an exception.I am wondering how this happend. Something must be pointed, if I don't use Spark , the code can work. Some other tips: