Re: And.eval short circuiting

2015-09-17 Thread Reynold Xin
Please file a ticket and cc me. Thanks. On Thu, Sep 17, 2015 at 11:20 PM, Mingyu Kim wrote: > That sounds good. I think the optimizer should not change the behavior of > execution and reordering the filters can easily result in errors as > exemplified below. I agree that the optimizer should no

Re: And.eval short circuiting

2015-09-17 Thread Mingyu Kim
That sounds good. I think the optimizer should not change the behavior of execution and reordering the filters can easily result in errors as exemplified below. I agree that the optimizer should not reorder the filters for correctness. Please correct me if I have an incorrect assumption about th

Re: RDD: Execution and Scheduling

2015-09-17 Thread gsvic
Concerning answers 1 and 2: 1) How Spark determines a node as a "slow node" and how slow is that? 2) How an RDD choose a location as a preferred location and with which criteria? Could you please also include the links of the source files for the two questions above? -- View this message in c

Re: JDBC Dialect tests

2015-09-17 Thread Luciano Resende
Thanks Reynold, Also, what is the status of the associated PR are we planning to merge it soon ? This will help me with the Db2 dialect test framework using Docker. Thanks [1] https://github.com/apache/spark/pull/8101 On Mon, Sep 14, 2015 at 1:47 PM, Reynold Xin wrote: > SPARK-9818 you link t

Re: RDD API patterns

2015-09-17 Thread Debasish Das
Rdd nesting can lead to recursive nesting...i would like to know the usecase and why join can't support it...you can always expose an api over a rdd and access that in another rdd mappartition...use a external data source like hbase cassandra redis to support the api... For ur case group by and th

Re: RDD: Execution and Scheduling

2015-09-17 Thread Reynold Xin
Your understanding is mostly correct. Replies inline. On Thu, Sep 17, 2015 at 5:23 AM, gsvic wrote: > After reading some parts of Spark source code I would like to make some > questions about RDD execution and scheduling. > > At first, please correct me if I am wrong at the following: > 1) The n

Re: New Spark json endpoints

2015-09-17 Thread Kevin Chen
Thank you all for the feedback. I’ve created a corresponding JIRA ticket at https://issues.apache.org/jira/browse/SPARK-10565, updated with a summary of this thread. From: Mark Hamstra Date: Thursday, September 17, 2015 at 8:00 AM To: Imran Rashid Cc: Kevin Chen , "dev@spark.apache.org" , Ma

Re: [MLlib] BinaryLogisticRegressionSummary on test set

2015-09-17 Thread Feynman Liang
We have kept that private because we need to decide on a name for the method which evaluates on a test set (see the TODO comment ); perhaps you could push for this to happen by creating a Jira and pinging jkb

[MLlib] BinaryLogisticRegressionSummary on test set

2015-09-17 Thread Hao Ren
Working on spark.ml.classification.LogisticRegression.scala (spark 1.5), It might be useful if we can create a summary for any given dataset, not just training set. Actually, BinaryLogisticRegressionTrainingSummary is only created when model is computed based on training set. As usual, we need to

Re: New Spark json endpoints

2015-09-17 Thread Mark Hamstra
While we're at it, adding endpoints that get results by jobGroup (cf. SparkContext#setJobGroup) instead of just for a single Job would also be very useful to some of us. On Thu, Sep 17, 2015 at 7:30 AM, Imran Rashid wrote: > Hi Kevin, > > I think it would be great if you added this. It never go

Re: New Spark json endpoints

2015-09-17 Thread Imran Rashid
Hi Kevin, I think it would be great if you added this. It never got added in the first place b/c the original PR was already pretty bloated, and just never got back to this. I agree with Reynold -- you shouldn't need to increase the version for just adding new endpoints (or even adding new field

RDD: Execution and Scheduling

2015-09-17 Thread gsvic
After reading some parts of Spark source code I would like to make some questions about RDD execution and scheduling. At first, please correct me if I am wrong at the following: 1) The number of partitions equals to the number of tasks will be executed in parallel (e.g. , when an RDD is repartitio

答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-17 Thread Huangguowei
Thanks for your reply. I just want to do some monitors, never mind! 发件人: Shixiong Zhu [mailto:zsxw...@gmail.com] 发送时间: 2015年9月17日 17:23 收件人: Huangguowei; dev@spark.apache.org 主题: Re: bug in Worker.scala, ExecutorRunner is not serializable RequestWorkerState is an internal message between Worker a

Re: QueueStream doesn't support checkpoint makes it difficult to do unit test

2015-09-17 Thread Bin Wang
Never mind. I've found a PR and it merged: https://github.com/apache/spark/pull/8624/commits Bin Wang 于2015年9月17日周四 下午4:50写道: > I'm using spark streaming and use updateStateByKey, which forced to use > checkpoint. In my unit test, I create a queueStream to test. But in spark > 1.5, QueueStream wi

Re: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-17 Thread Shixiong Zhu
RequestWorkerState is an internal message between Worker and WorkerWebUI. Since they are in the same process, that's fine. Actually, these are not public APIs. Could you elaborate your use case? Best Regards, Shixiong Zhu 2015-09-17 16:36 GMT+08:00 Huangguowei : > > > Is it possible to get Execu

QueueStream doesn't support checkpoint makes it difficult to do unit test

2015-09-17 Thread Bin Wang
I'm using spark streaming and use updateStateByKey, which forced to use checkpoint. In my unit test, I create a queueStream to test. But in spark 1.5, QueueStream will throw an exception while use it with checkpoint, it makes difficult to do unit test. Is there an option to disable this? Though I k

re: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-17 Thread Huangguowei
Is it possible to get Executors status when running an application? 发件人: Sean Owen [mailto:so...@cloudera.com] 发送时间: 2015年9月17日 15:54 收件人: Huangguowei; Dev 主题: Re: bug in Worker.scala, ExecutorRunner is not serializable Did this cause an error for you? On Thu, Sep 17, 2015, 8:51 AM Huangguowei

答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-17 Thread Huangguowei
Not error in normal case. But if I want to ask Worker through akkaUrl to get executors status, it will cause Exception. 发件人: Sean Owen [mailto:so...@cloudera.com] 发送时间: 2015年9月17日 15:54 收件人: Huangguowei; Dev 主题: Re: bug in Worker.scala, ExecutorRunner is not serializable Did this cause an err

Re: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-17 Thread Sean Owen
Did this cause an error for you? On Thu, Sep 17, 2015, 8:51 AM Huangguowei wrote: > > > In Worker.scala line 480: > > > > case RequestWorkerState => > > sender ! WorkerStateResponse(host, port, workerId, > executors.values.toList, > > finishedExecutors.values.toList, drivers.va

bug in Worker.scala, ExecutorRunner is not serializable

2015-09-17 Thread Huangguowei
In Worker.scala line 480: case RequestWorkerState => sender ! WorkerStateResponse(host, port, workerId, executors.values.toList, finishedExecutors.values.toList, drivers.values.toList, finishedDrivers.values.toList, activeMasterUrl, cores, memory, coresUsed, mem

how to send additional configuration to the RDD after it was lazily created

2015-09-17 Thread Gil Vernik
Hi, I have the following case, which i am not sure how to resolve. My code uses HadoopRDD and creates various RDDs on top of it (MapPartitionsRDD, and so on ) After all RDDs were lazily created, my code "knows" some new information and i want that "compute" method of the HadoopRDD will be awar