Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
cores there will be some contention and so they >> remain active for longer. >> >> So I think this is a test case issue configuring the number of executors >> too high. >> >> On 15 September 2015 at 18:54, Reynold Xin wrote: >> >>> Maybe we ca

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Reynold Xin
experiment with the page size calculation to see what effect it has. > > Cheers, > > > > On 16 September 2015 at 06:53, Reynold Xin wrote: > >> It is exactly the issue here, isn't it? >> >> We are using memory / N, where N should be the maximum number of acti

Re: RDD API patterns

2015-09-16 Thread Reynold Xin
I'm not sure what we can do here. Nested RDDs are a pain to implement, support, and explain. The programming model is not well explored. Maybe a UDAF interface that allows going through the data twice? On Mon, Sep 14, 2015 at 4:36 PM, sim wrote: > I'd like to get some feedback on an API design

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread Reynold Xin
Thanks Shane and Jon for the heads up. On Wednesday, September 16, 2015, shane knapp wrote: > good morning, denizens of the aether! > > your hard working build system (and some associated infrastructure) > has been in need of some updates and housecleaning for quite a while > now. we will be sp

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Reynold Xin
.scala:66) >> at org.apache.spark.scheduler.Task.run(Task.scala:88) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) >>

Re: SparkR streaming source code

2015-09-16 Thread Reynold Xin
You should reach out to the speakers directly. On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong wrote: > SparkR streaming is mentioned at about page 17 in below pdf, can anyone > share source code? (could not find it on GitHub) > > > > https://spark-summit.org/2015-east/wp-content/uploads/2015/03/S

Re: And.eval short circuiting

2015-09-16 Thread Reynold Xin
o the second filter. Even weirder is that if you call collect() after the > first filter you won't see nulls, and if you write the data to disk and > reread it, the NPE won't happen. > > It's bewildering! Is this the intended behavior? >

Re: New Spark json endpoints

2015-09-16 Thread Reynold Xin
Do we need to increment the version number if it is just strict additions? On Wed, Sep 16, 2015 at 7:10 PM, Kevin Chen wrote: > Just wanted to bring this email up again in case there were any thoughts. > Having all the information from the web UI accessible through a supported > json API is ver

Re: RDD: Execution and Scheduling

2015-09-17 Thread Reynold Xin
Your understanding is mostly correct. Replies inline. On Thu, Sep 17, 2015 at 5:23 AM, gsvic wrote: > After reading some parts of Spark source code I would like to make some > questions about RDD execution and scheduling. > > At first, please correct me if I am wrong at the following: > 1) The n

Re: And.eval short circuiting

2015-09-17 Thread Reynold Xin
er should not reorder the > filters for correctness. Please correct me if I have an incorrect > assumption about the guarantees of the optimizer. > > Is there a bug filed that tracks the change you suggested below, btw? I’d > like to follow the issue, if there’s one. > > Thanks,

Re: 答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-18 Thread Reynold Xin
Maybe we should add some inline comment explaining why it is ok for that message to be not serializable. On Thu, Sep 17, 2015 at 4:08 AM, Huangguowei wrote: > Thanks for your reply. I just want to do some monitors, never mind! > > > > *发件人:* Shixiong Zhu [mailto:zsxw...@gmail.com] > *发送时间:* 201

Re: 答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-18 Thread Reynold Xin
ards, > Shixiong Zhu > > 2015-09-18 15:10 GMT+08:00 Reynold Xin : > >> Maybe we should add some inline comment explaining why it is ok for that >> message to be not serializable. >> >> >> On Thu, Sep 17, 2015 at 4:08 AM, Huangguowei >> wrote: >&

Re: One element per node

2015-09-18 Thread Reynold Xin
Use a global atomic boolean and return nothing from that partition if the boolean is true. Note that your result won't be deterministic. On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander wrote: Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical no

Re: One element per node

2015-09-18 Thread Reynold Xin
global long value and get the element on partition only if > someFunction(partitionId, globalLong)==true? Or by using some specific > partitioner that creates such partitionIds that can be decomposed into > nodeId and number of partitions per node? > > > > *From:* Reynold Xin [mai

Re: spark-shell 1.5 doesn't seem to work in local mode

2015-09-19 Thread Reynold Xin
Maybe you have a hdfs-site.xml lying around somewhere? On Sat, Sep 19, 2015 at 9:14 AM, Madhu wrote: > I downloaded spark-1.5.0-bin-hadoop2.6.tgz recently and installed on > CentOS. > All my local Spark code works fine locally. > > For some odd reason, spark-shell doesn't work in local mode. >

Re: BUILD SYSTEM: fire and power event at UC berkeley's IST colo, jenkins offline

2015-09-19 Thread Reynold Xin
Great! Jon / Shane: Thanks for handling this. On Saturday, September 19, 2015, shane knapp wrote: > we're up and building! time for breakfast... :) > > https://amplab.cs.berkeley.edu/jenkins/ > > On Sat, Sep 19, 2015 at 7:35 AM, shane knapp > wrote: > > it was definitely one of our servers..

Re: RDD: Execution and Scheduling

2015-09-20 Thread Reynold Xin
On Sun, Sep 20, 2015 at 3:58 PM, gsvic wrote: > Concerning answers 1 and 2: > > 1) How Spark determines a node as a "slow node" and how slow is that? > There are two cases here: 1. If a node is busy (e.g. all slots are already occupied), the scheduler cannot schedule anything on it. See "Delay

Re: Join operation on DStreams

2015-09-20 Thread Reynold Xin
stream.map(record => (keyFunction(record), record)) For future reference, this question should go to the user list, not dev list. On Sun, Sep 20, 2015 at 11:47 PM, guoxu1231 wrote: > Hi Spark Experts, > > I'm trying to use join(otherStream, [numTasks]) on DStreams, and it > requires called o

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is dirt

Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Reynold Xin
What's the plan if you run explain? In 1.5 the default should be TungstenAggregate, which does spill (switching from hash-based aggregation to sort-based aggregation). On Mon, Sep 21, 2015 at 5:34 PM, Matt Cheah wrote: > Hi everyone, > > I’m debugging some slowness and apparent memory pressure

Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-22 Thread Reynold Xin
There is an implicit conversion in scope https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153 /** * An implicit conversion function internal to this class for us to avoid doing * "new DataFrame(...)" everywhere. */ @inline pri

Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-22 Thread Reynold Xin
There is an implicit conversion in scope https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153 /** * An implicit conversion function internal to this class for us to avoid doing * "new DataFrame(...)" everywhere. */ @inline pri

[VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.1 [ ] -1 Do not release this package because ... The

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
I forked a new thread for this. Please discuss NOTICE file related things there so it doesn't hijack this thread. On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote: > On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas > wrote: > > Under your guidance, I would be happy to help compile a NOTICE f

[Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Reynold Xin
Richard, Thanks for bringing this up and this is a great point. Let's start another thread for it so we don't hijack the release thread. On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote: > On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas > wrote: > > Under your guidance, I would be happy t

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
I'm going to +1 this myself. Tested on my laptop. On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin wrote: > I forked a new thread for this. Please discuss NOTICE file related things > there so it doesn't hijack this thread. > > > On Thu, Sep 24, 2015 at 10:51 AM, Sean

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
getY work now (Thanks). Slower; saturates the CPU. A > non-scientific snapshot below. I know that this really has to be done more > rigorously, on a bigger machine, with more cores et al.. > [image: Inline image 1] [image: Inline image 2] > > On Thu, Sep 24, 2015 at 12:27 AM, Reynold X

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-27 Thread Reynold Xin
Thanks everybody for voting. I'm going to close the vote now. The vote passes with 17 +1 votes and 1 -1 vote. I will work on packaging this asap. +1: Reynold Xin* Sean Owen Hossein Falaki Xiangrui Meng* Krishna Sankar Joseph Bradley Sean McNamara* Luciano Resende Doug Balog Eugene Zhu

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Reynold Xin
You can pass the schema into json directly, can't you? On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith wrote: > Hi all, > > > > We really like the ability to infer a schema from JSON contained in an > RDD, but when we’re using Spark Streaming on small batches of data, we > sometimes find that Spark

[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All, Spark 1.5.1 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.0 users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.1 http://spark.apach

Re: [Build] repo1.maven.org: spark libs 1.5.0 for scala 2.10 poms are broken (404)

2015-10-02 Thread Reynold Xin
Both work for me. It's possible maven.org is having problems with some servers. On Fri, Oct 2, 2015 at 11:08 AM, Ted Yu wrote: > Andy: > 1.5.1 has been released. > > Maybe you can use this: > > https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.5.1/spark-streaming_2.10-1.5.1

Re: Python UDAFs

2015-10-02 Thread Reynold Xin
No, not yet. On Fri, Oct 2, 2015 at 12:20 PM, Justin Uang wrote: > Hi, > > Is there a Python API for UDAFs? > > Thanks! > > Justin >

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
You can write the data to local hdfs (or local disk) and just load it from there. On Mon, Oct 5, 2015 at 4:37 PM, Jegan wrote: > Thanks for your suggestion Ted. > > Unfortunately at this point of time I cannot go beyond 1000 partitions. I > am writing this data to BigQuery and it has a limit of

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
t; Jegan > > On Mon, Oct 5, 2015 at 4:42 PM, Reynold Xin wrote: > >> You can write the data to local hdfs (or local disk) and just load it >> from there. >> >> >> On Mon, Oct 5, 2015 at 4:37 PM, Jegan wrote: >> >>> Thanks for your suggestion Ted. >

Re: Pyspark dataframe read

2015-10-06 Thread Reynold Xin
I think the problem is that comma is actually a legitimate character for file name, and as a result ... On Tuesday, October 6, 2015, Josh Rosen wrote: > Could someone please file a JIRA to track this? > https://issues.apache.org/jira/browse/SPARK > > On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers

multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
The current implementation of multiple count distinct in a single query is very inferior in terms of performance and robustness, and it is also hard to guarantee correctness of the implementation in some of the refactorings for Tungsten. Supporting a better version of it is possible in the future,

Re: multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
(distinct colA, colB) from foo; On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin wrote: > The current implementation of multiple count distinct in a single query is > very inferior in terms of performance and robustness, and it is also hard > to guarantee correctness of the implementation in so

Fwd: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Reynold Xin
Adding user list too. -- Forwarded message -- From: Reynold Xin Date: Tue, Oct 6, 2015 at 5:54 PM Subject: Re: multiple count distinct in SQL/DataFrame? To: "dev@spark.apache.org" To provide more context, if we do remove this feature, the following SQL query woul

Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-08 Thread Reynold Xin
The problem only applies to the sbt build because it treats warnings as errors. @Iulian - how about we disable warnings -> errors for 2.11? That would seem better until we switch 2.11 to be the default build. On Thu, Oct 8, 2015 at 7:55 AM, Ted Yu wrote: > I tried building with Scala 2.11 on L

Re: spark over drill

2015-10-08 Thread Reynold Xin
You probably saw that in a presentation given by the drill team. You should check with them on that. On Thu, Oct 8, 2015 at 11:51 AM, Pranay Tonpay wrote: > hi ,, > Is spark-drill integration already done ? if yes, which spark version > supports it ... it was in the "upcming list for 2015" is wh

a few major changes / improvements for Spark 1.6

2015-10-12 Thread Reynold Xin
Hi Spark devs, It is hard to track everything going on in Spark with so many pull requests and JIRA tickets. Below are 4 major improvements that will likely be in Spark 1.6. We have already done prototyping for all of them, and want feedback on their design. 1. SPARK-9850 Adaptive query executio

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Reynold Xin
+dev list On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > All, > > Does anyone meet memory leak issue with spark streaming and spark sql in > spark 1.5.1? I can see the memory is increasing all the time when running > this simple sample: > > val sc = new SparkContext(conf) >

If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-14 Thread Reynold Xin
Can you reply to this email and provide us with reasons why you disable it? Thanks.

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Reynold Xin
That could break a lot of applications. In particular, a lot of input data sources (csv, json) don't have clean schema, and can have duplicate column names. For the case of join, maybe a better solution is to ask the left/right prefix/suffix in the user code, similar to what Pandas does. On Wed,

Re: Spark Implicit Functions

2015-10-16 Thread Reynold Xin
Thanks for sharing, Bill. On Fri, Oct 16, 2015 at 2:06 PM, Bill Bejeck wrote: > All, > > I just did a post on adding groupByKeyToList and groupByKeyUnique using > implicit classes. I thought it might be useful to someone. > > http://codingjunkie.net/learning-scala-implicits-with-spark/ > > Tha

flaky test "map stage submission with multiple shared stages and failures"

2015-10-17 Thread Reynold Xin
I just saw this happening: [info] - map stage submission with multiple shared stages and failures *** FAILED *** (566 milliseconds) [info] java.lang.IndexOutOfBoundsException: 2 [info] at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) [info] at scala.collection.

Re: MapStatus too large for drvier

2015-10-20 Thread Reynold Xin
How big is your driver heap size? And any reason why you'd need 200k map and 200k reduce tasks? On Mon, Oct 19, 2015 at 11:59 PM, yaoqin wrote: > Hi everyone, > > When I run a spark job contains quite a lot of tasks(in my case is > 200,000*200,000), the driver occured OOM mainly caused by t

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Reynold Xin
Jerry - I think that's been fixed in 1.5.1. Do you still see it? On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam wrote: > I disabled it because of the "Could not acquire 65536 bytes of memory". It > happens to fail the job. So for now, I'm not touching it. > > On Tue, Oct 20, 2015 at 4:48 PM, charmee

Fwd: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Reynold Xin
With Jerry's permission, sending this back to the dev list to close the loop. -- Forwarded message -- From: Jerry Lam Date: Tue, Oct 20, 2015 at 3:54 PM Subject: Re: If you use Spark 1.5 and disabled Tungsten mode ... To: Reynold Xin Yup, coarse grained mode works just

Re: Exception when using cosh

2015-10-21 Thread Reynold Xin
I think we made a mistake and forgot to register the function in the registry: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala Do you mind submitting a pull request to fix this? Should be an one line change. I fi

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-21 Thread Reynold Xin
at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:56) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:339) > > > On Tue, Oct 20, 2015 at 9:10 PM, Reynold Xin wrote: > >> With Jerry&#x

Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-22 Thread Reynold Xin
Why do you do a glom? It seems unnecessarily expensive to materialize each partition in memory. On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 wrote: > Hi, spark community > I have an application which I try to migrate from MR to Spark. > It will do some calculations from Hive and output to h

[VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-25 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.2 [ ] -1 Do not release this package because ... The

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-27 Thread Reynold Xin
kar > wrote: > >> Guys, >>The sc.version returns 1.5.1 in python and scala. Is anyone getting >> the same results ? Probably I am doing something wrong. >> Cheers >> >> >> On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin > > wrote: >> >>

Re: Exception when using some aggregate operators

2015-10-27 Thread Reynold Xin
Try count(distinct columnane) In SQL distinct is not part of the function name. On Tuesday, October 27, 2015, Shagun Sodhani wrote: > Oops seems I made a mistake. The error message is : Exception in thread > "main" org.apache.spark.sql.AnalysisException: undefined function > countDistinct > On

Re: Pickle Spark DataFrame

2015-10-28 Thread Reynold Xin
What are you trying to accomplish to pickle a Spark DataFrame? If your dataset is large, it doesn't make much sense to pickle it. If your dataset is small, maybe it's best to just pickle a Pandas dataframe. On Tue, Oct 27, 2015 at 9:47 PM, agg212 wrote: > Hi, I'd like to "pickle" a Spark DataFr

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
)""") >>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show() >>>>>> >>>>>> Cheers >>>>>> >>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sod

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
ctions$> > to > be treated as sql operators as well? I do see that these are mentioned as > Functions > available for DataFrame > <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html> > but > it would be great if you can clarify this. >

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-01 Thread Reynold Xin
>>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.rdd.MapPartitionsWithPreparati

Re: Unchecked contribution (JIRA and PR)

2015-11-03 Thread Reynold Xin
Sergio, Usually it takes a lot of effort to get something merged into Spark itself, especially for relatively new algorithms that might not have established itself yet. I will leave it to mllib maintainers to comment on the specifics of the individual algorithms proposed here. Just another genera

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Reynold Xin
I don't think there is any special handling w.r.t. Tachyon vs in-heap caching. As a matter of fact, I think the current offheap caching implementation is pretty bad, because: 1. There is no namespace sharing in offheap mode 2. Similar to 1, you cannot recover the offheap memory once Spark driver o

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Reynold Xin
am using tachyon for caching, if an executor is lost, then > that partition is lost for the purposes of spark? > > On Tue, Nov 3, 2015 at 5:53 PM Reynold Xin wrote: > >> I don't think there is any special handling w.r.t. Tachyon vs in-heap >> caching. As a matte

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Reynold Xin
contexts, > but where the notebooks can be idle for long periods of time while holding > onto cached rdds. > > On Tue, Nov 3, 2015 at 10:15 PM Reynold Xin wrote: > >> It is lost unfortunately (although can be recomputed automatically). >> >> >> On Tue, Nov 3

[VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-03 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.2 [ ] -1 Do not release this package because ... The r

Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
ecutor, causing entire > stages to be retried. In fine-grained mode, only the task fails and > subsequently gets retried without taking out an entire stage or worse. > > On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin wrote: > >> If you are using Spark with Mesos fine grained mode, can yo

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data from in-memory binary f

Re: Sort Merge Join from the filesystem

2015-11-04 Thread Reynold Xin
It's not supported yet, and not sure if there is a ticket for it. I don't think there is anything fundamentally hard here either. On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > (this is kind of a cross-post from the user list) > > Does Spark support doi

Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Reynold Xin
Can you use the broadcast hint? e.g. df1.join(broadcast(df2)) the broadcast function is in org.apache.spark.sql.functions On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote: > Hi, > > If I have a hive table, analyze table compute statistics will ensure Spark > SQL has statistics of that t

Re: How to force statistics calculation of Dataframe?

2015-11-05 Thread Reynold Xin
n dataframe api. > > On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin wrote: > >> Can you use the broadcast hint? >> >> e.g. >> >> df1.join(broadcast(df2)) >> >> the broadcast function is in org.apache.spark.sql.functions >> >> >> >

Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Reynold Xin
You can hack around this by constructing logical plans yourself and then creating a DataFrame in order to execute them. Note that this is all depending on internals of the framework and can break when Spark upgrades. On Thu, Nov 5, 2015 at 4:18 PM, Yana Kadiyska wrote: > I don't think a view wo

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
Are you looking for this? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L69 On Wed, Nov 4, 2015 at 5:11 AM, Tóth Zoltán wrote: > Hi, > > I'd like to write a parquet file from the driver. I could use

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Reynold Xin
on using yesterday's code, > build locally. > > Regression running in Yarn Cluster mode against few internal ML ( logistic > regression, linear regression, random forest and statistic summary) as well > Mlib KMeans. all seems to work fine. > > Chester > > > On Tue,

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-08 Thread Reynold Xin
Thanks everybody for voting. I'm going to close the vote now. The vote passes with 14 +1 votes and no -1 vote. I will work on packaging this asap. +1: Jean-Baptiste Onofré Egor Pahomov Luc Bourlier Tom Graves* Chester Chen Michael Armbrust* Krishna Sankar Robin East Reynold Xin* Joseph Br

[ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Reynold Xin
Hi All, Spark 1.5.2 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.x users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.2 http://spark.apach

A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ... First - I want to propose a premise for how to think about S

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > > > 3. Assembly-free distribution of Spark: don’t require building an > enormous assembly jar in order to run Spark. > > Could you elaborate a bit on this? I'm not sure what an assembly-free > distribution mea

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
n 3 transition because the upgrade process was too > painful > >>> for too long. The Spark community will benefit greatly from our > explicitly > >>> looking to avoid a similar situation. > >>> > >>> > 3. Assembly-free distribution of

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
; be surprised if mapPartitionsWithContext was baked into a number of apps) >> and merit a little extra consideration. >> >> Maybe also obvious, but I think a migration guide with API equivlents and >> the like would be incredibly useful in easing the transition. >> >

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
be that way" category when > it comes time to discuss the Spark 3.0 release? I'd be fine if we don't > try at all to anticipate what is needed -- working from the premise that > being forced into a 3.x release earlier than we expect would be less > painful than trying to

Re: Choreographing a Kryo update

2015-11-11 Thread Reynold Xin
We should consider this for Spark 2.0. On Wed, Nov 11, 2015 at 2:01 PM, Steve Loughran wrote: > > > Spark is currently on a fairly dated version of Kryo 2.x; it's trailing on > the fixes in Hive and, as the APIs are incompatible, resulted in that > mutant spark-project/hive JAR needed for the H

Re: Support for local disk columnar storage for DataFrames

2015-11-11 Thread Reynold Xin
Thanks for the email. Can you explain what the difference is between this and existing formats such as Parquet/ORC? On Wed, Nov 11, 2015 at 4:59 AM, Cristian O wrote: > Hi, > > I was wondering if there's any planned support for local disk columnar > storage. > > This could be an extension of th

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin
It only runs tests that are impacted by the change. E.g. if you only modify SQL, it won't run the core or streaming tests. On Fri, Nov 13, 2015 at 11:17 AM, Ted Yu wrote: > Hi, > I noticed that SparkPullRequestBuilder completes much faster than maven > Jenkins build. > > From > https://amplab.c

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin
ned and enabled again ? > > Cheers > > On Fri, Nov 13, 2015 at 11:20 AM, Reynold Xin wrote: > >> It only runs tests that are impacted by the change. E.g. if you only >> modify SQL, it won't run the core or streaming tests. >> >> >> On Fri, Nov 1

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin
I actually tried to build a binary for 1.4.2 and wanted to start voting, but there was an issue with the release script that failed the jenkins job. Would be great to kick off a 1.4.2 release. On Fri, Nov 13, 2015 at 1:00 PM, Andrew Lee wrote: > Hi All, > > > I'm wondering if Spark 1.4.2 had be

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin
In the interim, you can just build it off branch-1.4 if you want. On Fri, Nov 13, 2015 at 1:30 PM, Reynold Xin wrote: > I actually tried to build a binary for 1.4.2 and wanted to start voting, > but there was an issue with the release script that failed the jenkins job. > Would be

Re: Are map tasks spilling data to disk?

2015-11-15 Thread Reynold Xin
It depends on what the next operator is. If the next operator is just an aggregation, then no, the hash join won't write anything to disk. It will just stream the data through to the next operator. If the next operator is shuffle (exchange), then yes. On Sun, Nov 15, 2015 at 10:52 AM, gsvic wrote

Re: Support for local disk columnar storage for DataFrames

2015-11-15 Thread Reynold Xin
the compact columnar representation and Tungsten >> optimisations. >> >> I'm not quite sure if something like this can be achieved by other means >> or has been investigated before, hence why I'm looking for feedback here. >> >> While one could use extern

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
It's a completely different path. On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar wrote: > I would like to know if Hive on Spark uses or shares the execution code > with Spark SQL or DataFrames? > > More specifically, does Hive on Spark benefit from the changes made to > Spark SQL, project Tung

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
No it does not -- although it'd benefit from some of the work to make shuffle more robust. On Sun, Nov 15, 2015 at 10:45 PM, kiran lonikar wrote: > So does not benefit from Project Tungsten right? > > > On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin wrote: > >> It

Re: orc read issue n spark

2015-11-18 Thread Reynold Xin
What do you mean by starts delay scheduling? Are you saying it is no longer doing local reads? If that's the case you can increase the spark.locality.read timeout. On Wednesday, November 18, 2015, Renu Yadav wrote: > Hi , > I am using spark 1.4.1 and saving orc file using > df.write.format("orc

Re: How to Add builtin geometry type to SparkSQL?

2015-11-18 Thread Reynold Xin
Have you looked into https://github.com/harsha2010/magellan ? On Wednesday, November 18, 2015, ddcd wrote: > Hi all, > > I'm considering adding geometry type to SparkSQL. > > I know that there is a project named sparkGIS > which is an add-on of sparkSQL. Th

Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-19 Thread Reynold Xin
I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I think everybody is for that. https://issues.apache.org/jira/browse/SPARK-11807 Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is to say, keep only Hadoop 2.6 and greater. What are the community's th

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Reynold Xin
OK I'm not exactly asking for a vote here :) I don't think we should look at it from only maintenance point of view -- because in that case the answer is clearly supporting as few versions as possible (or just rm -rf spark source code and call it a day). It is a tradeoff between the number of user

Re: Using spark MLlib without installing Spark

2015-11-21 Thread Reynold Xin
You can use MLlib and Spark directly without "installing anything". Just run Spark in local mode. On Sat, Nov 21, 2015 at 4:05 PM, Rad Gruchalski wrote: > Bowen, > > What Andy is doing in the notebook is a slightly different thing. He’s > using sbt to bring all spark jars (core, mllib, repl, wh

Re: Datasets on experimental dataframes?

2015-11-23 Thread Reynold Xin
The experimental tag is intended for user facing APIs. It has nothing to do with internal dependencies. On Monday, November 23, 2015, Jakob Odersky wrote: > Hi, > > datasets are being built upon the experimental DataFrame API, does this > mean DataFrames won't be experimental in the near future?

Re: why does shuffle in spark write shuffle data to disk by default?

2015-11-23 Thread Reynold Xin
I think for most jobs the bottleneck isn't in writing shuffle data to disk, since shuffle data needs to be "shuffled" and sent across the network. You can always use a ramdisk yourself. Requiring ramdisk by default would significantly complicate configuration and platform portability. On Mon, No

Re: A proposal for Spark 2.0

2015-11-23 Thread Reynold Xin
t;> about. I think it may be going too far to deprecate it, but changing >>>>> emphasis is something that we might consider. The RDD API came well >>>>> before >>>>> DataFrames and DataSets, so programming guides, introductory how-to >>>>>

Re: A proposal for Spark 2.0

2015-11-25 Thread Reynold Xin
breaks compatibility in the move to 2.0 makes it much more >> difficult for them to make this transition. >> >> Using the same set of APIs also means that it will be easier to backport >> critical fixes to the 1.x line. >> >> It's not clear to me that avoiding

Re: A proposal for Spark 2.0

2015-11-26 Thread Reynold Xin
gt;> want to upgrade to the stable version of the Dataset / DataFrame API so >>> they don't need to do so twice. Requiring that they absorb all the other >>> ways that Spark breaks compatibility in the move to 2.0 makes it much more >>> difficult for them to make thi

<    5   6   7   8   9   10   11   12   13   14   >