Re: A proposal for Spark 2.0

2015-11-11 Thread Sean Owen
On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: > to the Spark community. A major release should not be very different from a > minor release and should not be gated based on new features. The main > purpose of a major release is an opportunity to fix things that are

Support for local disk columnar storage for DataFrames

2015-11-11 Thread Cristian O
Hi, I was wondering if there's any planned support for local disk columnar storage. This could be an extension of the in-memory columnar store, or possibly something similar to the recently added local checkpointing for RDDs This could also have the added benefit of enabling iterative usage for

Re: A proposal for Spark 2.0

2015-11-11 Thread Jonathan Kelly
If Scala 2.12 will require Java 8 and we want to enable cross-compiling Spark against Scala 2.11 and 2.12, couldn't we just make Java 8 a requirement if you want to use Scala 2.12? On Wed, Nov 11, 2015 at 9:29 AM, Koert Kuipers wrote: > i would drop scala 2.10, but definitely

Re: A proposal for Spark 2.0

2015-11-11 Thread Zoltán Zvara
Hi, Reconsidering the execution model behind Streaming would be a good candidate here, as Spark will not be able to provide the low latency and sophisticated windowing semantics that more and more use-cases will require. Maybe relaxing the strict batch model would help a lot. (Mainly this would

Re: A proposal for Spark 2.0

2015-11-11 Thread Koert Kuipers
i would drop scala 2.10, but definitely keep java 7 cross build for scala 2.12 is great, but i dont know how that works with java 8 requirement. dont want to make java 8 mandatory. and probably stating the obvious, but a lot of apis got polluted due to binary compatibility requirement. cleaning

Re: A proposal for Spark 2.0

2015-11-11 Thread Koert Kuipers
good point about dropping <2.2 for hadoop. you dont want to deal with protobuf 2.4 for example On Wed, Nov 11, 2015 at 4:58 AM, Sean Owen wrote: > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin wrote: > > to the Spark community. A major release should

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Shixiong Zhu
In addition, if you have more than two text files, you can just put them into a Seq and use "reduce(_ ++ _)". Best Regards, Shixiong Zhu 2015-11-11 10:21 GMT-08:00 Jakob Odersky : > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workaround,

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky
Hey Jeff, Do you mean reading from multiple text files? In that case, as a workaround, you can use the RDD#union() (or ++) method to concatenate multiple rdds. For example: val lines1 = sc.textFile("file1") val lines2 = sc.textFile("file2") val rdd = lines1 union lines2 regards, --Jakob On 11

Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Although user can use the hdfs glob syntax to support multiple inputs. But sometimes, it is not convenient to do that. Not sure why there's no api of SparkContext#textFiles. It should be easy to implement that. I'd love to create a ticket and contribute for that if there's no other consideration

Re: A proposal for Spark 2.0

2015-11-11 Thread Tim Preece
Considering Spark 2.x will run for 2 years, would moving up to Scala 2.12 ( pencilled in for Jan 2016 ) make any sense ? - although that would then pre-req Java 8. -- View this message in context:

Map Tasks - Disk I/O

2015-11-11 Thread gsvic
According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Hi Pradeep ≥≥≥ Looks like what I was suggesting doesn't work. :/ I guess you mean put comma separated path into one string and pass it to existing API (SparkContext#textFile). It should not work. I suggest to create new api SparkContext#textFiles to accept an array of string. I have already

Re: Proposal for SQL join optimization

2015-11-11 Thread Xiao Li
Hi, Zhan, That sounds really interesting! Please at me when you submit the PR. If possible, please also posted the performance difference. Thanks, Xiao Li 2015-11-11 14:45 GMT-08:00 Zhan Zhang : > Hi Folks, > > I did some performance measurement based on TPC-H

Re: Support for local disk columnar storage for DataFrames

2015-11-11 Thread Reynold Xin
Thanks for the email. Can you explain what the difference is between this and existing formats such as Parquet/ORC? On Wed, Nov 11, 2015 at 4:59 AM, Cristian O wrote: > Hi, > > I was wondering if there's any planned support for local disk columnar > storage. >

Re: Choreographing a Kryo update

2015-11-11 Thread Reynold Xin
We should consider this for Spark 2.0. On Wed, Nov 11, 2015 at 2:01 PM, Steve Loughran wrote: > > > Spark is currently on a fairly dated version of Kryo 2.x; it's trailing on > the fixes in Hive and, as the APIs are incompatible, resulted in that > mutant

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
I know these workaround, but wouldn't it be more convenient and straightforward to use SparkContext#textFiles ? On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra wrote: > For more than a small number of files, you'd be better off using > SparkContext#union instead of

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In

Re: A proposal for Spark 2.0

2015-11-11 Thread hitoshi
It looks like Chill is willing to upgrade their Kryo to 3.x if Spark and Hive will. As it is now Spark, Chill, and Hive have Kryo jar but it really can't be used because Kryo 2 can't serdes some classes. Since Spark 2.0 is a major release, it really would be nice if we can resolve the Kryo issue.

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Yes, that's what I suggest. TextInputFormat support multiple inputs. So in spark side, we just need to provide API to for that. On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota wrote: > IIRC, TextInputFormat supports an input path that is a comma separated > list. I

Re: A proposal for Spark 2.0

2015-11-11 Thread hitoshi
Resending my earlier message because it wasn't accepted. Would like to add a proposal to upgrade jars when they do not break APIs and fixes a bug. To be more specific, I would like to see Kryo to be upgraded from 2.21 to 3.x. Kryo 2.x has a bug (e.g.SPARK-7708) that is blocking it usage in

Re: A proposal for Spark 2.0

2015-11-11 Thread Matei Zaharia
I like the idea of popping out Tachyon to an optional component too to reduce the number of dependencies. In the future, it might even be useful to do this for Hadoop, but it requires too many API changes to be worth doing now. Regarding Scala 2.12, we should definitely support it eventually,

Choreographing a Kryo update

2015-11-11 Thread Steve Loughran
Spark is currently on a fairly dated version of Kryo 2.x; it's trailing on the fixes in Hive and, as the APIs are incompatible, resulted in that mutant spark-project/hive JAR needed for the Hive 1.2.1 support But: updating it hasn't been an option, because Spark needs to be in sync with

Proposal for SQL join optimization

2015-11-11 Thread Zhan Zhang
Hi Folks, I did some performance measurement based on TPC-H recently, and want to bring up some performance issue I observed. Both are related to cartesian join. 1. CartesianProduct implementation. Currently CartesianProduct relies on RDD.cartesian, in which the computation is realized as