Re: Breaking the previous large-scale sort record with Spark

2014-10-11 Thread Henry Saputra
Congrats to Reynold et al leading this effort! - Henry On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark

Re: How to do broadcast join in SparkSQL

2014-10-11 Thread Jianshi Huang
It works fine, thanks for the help Michael. Liancheng also told me a trick, using a subquery with LIMIT n. It works in latest 1.2.0 BTW, looks like the broadcast optimization won't be recognized if I do a left join instead of a inner join. Is that true? How can I make it work for left joins?

Re: Window comparison matching using the sliding window functionality: feasibility

2014-10-11 Thread Sean Owen
I suspect you do not actually need to change the number of partitions dynamically. Do you just have groupings of data to process? use an RDD of (K,V) pairs and things like groupByKey. If really have only 1000 unique keys, yes, only half of the 2000 workers would get data in a phase that groups by

Re: Spark SQL - Exception only when using cacheTable

2014-10-11 Thread Cheng Lian
How was the table created? Would you mind to share related code? It seems that the underlying type of the |customer_id| field is actually long, but the schema says it’s integer, basically it’s a type mismatch error. The first query succeeds because |SchemaRDD.count()| is translated to

Re: spark-sql failing for some tables in hive

2014-10-11 Thread Cheng Lian
Hmm, the details of the error didn't show in your mail... On 10/10/14 12:25 AM, sadhan wrote: We have a hive deployement on which we tried running spark-sql. When we try to do describe table_name for some of the tables, spark-sql fails with this: while it works for some of the other tables.

Re: RDD size in memory - Array[String] vs. case classes

2014-10-11 Thread Sean Owen
Yes of course. If your number is 123456, the this takes 4 bytes as an int. But as a String in a 64-bit JVM you have an 8-byte reference, 4-byte object overhead, a char count of 4 bytes, and 6 2-byte chars. Maybe more i'm not thinking of. On Sat, Oct 11, 2014 at 6:29 AM, Liam Clarke-Hutchinson

Re: return probability \ confidence instead of actual class

2014-10-11 Thread Adamantios Corais
Thank you Sean. I'll try to do it externally as you suggested, however, can you please give me some hints on how to do that? In fact, where can I find the 1.2 implementation you just mentioned? Thanks! On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen so...@cloudera.com wrote: Plain old SVMs don't

Fwd: how to find the sources for spark-project

2014-10-11 Thread Sadhan Sood
-- Forwarded message -- From: Sadhan Sood sadhan.s...@gmail.com Date: Sat, Oct 11, 2014 at 10:26 AM Subject: Re: how to find the sources for spark-project To: Stephen Boesch java...@gmail.com Thanks, I still didn't find it - is it under some particular branch ? More specifically,

Re: how to find the sources for spark-project

2014-10-11 Thread Ted Yu
I found this on computer where I built Spark: $ jar tvf /homes/hortonzy/.m2/repository//org/spark-project/hive/hive-exec/0.13.1/hive-exec-0.13.1.jar | grep ParquetHiveSerDe 2228 Mon Jun 02 12:50:16 UTC 2014 org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe$1.class 1442 Mon Jun 02

How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread arthur.hk.c...@gmail.com
Hi, My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 subquery in my Spark SQL, below are my sample table structures and a SQL that contains more than 1 subquery. Question 1: How to load a HIVE table into Scala/Spark? Question 2: How to implement a

RE: Spark SQL parser bug?

2014-10-11 Thread Mohammed Guller
I tried even without the “T” and it still returns an empty result: scala val sRdd = sqlContext.sql(select a from x where ts = '2012-01-01 00:00:00';) sRdd: org.apache.spark.sql.SchemaRDD = SchemaRDD[35] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [a#0] ExistingRdd

Streams: How do RDDs get Aggregated?

2014-10-11 Thread jay vyas
Hi spark ! I dont quite yet understand the semantics of RDDs in a streaming context very well yet. Are there any examples of how to implement CustomInputDStreams, with corresponding Receivers in the docs ? Ive hacked together a custom stream, which is being opened and is consuming data

Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Evan Samanas
It's true that it is an implementation detail, but it's a very important one to document because it has the possibility of changing results depending on when I use take or collect. The issue I was running in to was when the executor had a different operating system than the driver, and I was

Re: Blog post: An Absolutely Unofficial Way to Connect Tableau to SparkSQL (Spark 1.1)

2014-10-11 Thread Matei Zaharia
Very cool Denny, thanks for sharing this! Matei On Oct 11, 2014, at 9:46 AM, Denny Lee denny.g@gmail.com wrote: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql If you're wondering how to connect Tableau to SparkSQL - here are the steps to connect Tableau to SparkSQL.

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread Ilya Ganelin
Because of how closures work in Scala, there is no support for nested map/rdd-based operations. Specifically, if you have Context a { Context b { } } Operations within context b, when distributed across nodes, will no longer have visibility of variables specific to context a because

Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Davies Liu
Created JIRA for this: https://issues.apache.org/jira/browse/SPARK-3915 On Sat, Oct 11, 2014 at 12:40 PM, Evan Samanas evan.sama...@gmail.com wrote: It's true that it is an implementation detail, but it's a very important one to document because it has the possibility of changing results