Re: And.eval short circuiting

2015-09-18 Thread Mingyu Kim
That sounds good. I think the optimizer should not change the behavior of execution and reordering the filters can easily result in errors as exemplified below. I agree that the optimizer should not reorder the filters for correctness. Please correct me if I have an incorrect assumption about

Re: And.eval short circuiting

2015-09-18 Thread Reynold Xin
Please file a ticket and cc me. Thanks. On Thu, Sep 17, 2015 at 11:20 PM, Mingyu Kim wrote: > That sounds good. I think the optimizer should not change the behavior of > execution and reordering the filters can easily result in errors as > exemplified below. I agree that the

Re: And.eval short circuiting

2015-09-18 Thread Mingyu Kim
I filed SPARK-10703. Thanks! Mingyu From: Reynold Xin Date: Thursday, September 17, 2015 at 11:22 PM To: Mingyu Kim Cc: Zack Sampson, "dev@spark.apache.org", Peter Faiman, Matt Cheah, Michael Armbrust Subject: Re: And.eval short circuiting Please file a ticket and cc me. Thanks. On

Re: One element per node

2015-09-18 Thread Reynold Xin
Use a global atomic boolean and return nothing from that partition if the boolean is true. Note that your result won't be deterministic. On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander wrote: Thank you! How can I guarantee that I have only one element per executor (per

RE: One element per node

2015-09-18 Thread Ulanov, Alexander
Sounds interesting! Is it possible to make it deterministic by using global long value and get the element on partition only if someFunction(partitionId, globalLong)==true? Or by using some specific partitioner that creates such partitionIds that can be decomposed into nodeId and number of

Re: One element per node

2015-09-18 Thread Feynman Liang
AFAIK the physical distribution is not exposed in the public API; the closest I can think of is `rdd.coalesce(numPhysicalNodes).mapPartitions(...` but this assumes that one partition exists per node On Fri, Sep 18, 2015 at 4:09 PM, Ulanov, Alexander wrote: > Thank

Re: [MLlib] BinaryLogisticRegressionSummary on test set

2015-09-18 Thread Hao Ren
Thank you for the reply. I have created a jira issue and pinged mengxr. Here is the link: https://issues.apache.org/jira/browse/SPARK-10691 I did not find jkbradley on jira. I saw he is on github. BTW, should I create a pull request on removing the private modifier for further discussion ?

Does anyone use ShuffleDependency directly?

2015-09-18 Thread Josh Rosen
Does anyone use ShuffleDependency directly in their Spark code or libraries? If so, how do you use it? Similarly, does anyone use ShuffleHandle

Re: One element per node

2015-09-18 Thread Reynold Xin
The reason it is nondeterministic is because tasks are not always scheduled to the same nodes -- so I don't think you can make this deterministic. If you assume no failures and tasks take a while to run (so it runs slower than the scheduler can schedule them), then I think you can make it

RE: One element per node

2015-09-18 Thread Ulanov, Alexander
Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical node)? From: Feynman Liang [mailto:fli...@databricks.com] Sent: Friday, September 18, 2015 4:06 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: One element per node

One element per node

2015-09-18 Thread Ulanov, Alexander
Dear Spark developers, Is it possible (and how to do it if possible) to pick one element per physical node from an RDD? Let's say the first element of any partition on that node. The result would be an RDD[element], the count of elements is equal to the N of nodes that has partitions of the

Re: One element per node

2015-09-18 Thread Feynman Liang
rdd.mapPartitions(x => new Iterator(x.head)) On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander wrote: > Dear Spark developers, > > > > Is it possible (and how to do it if possible) to pick one element per > physical node from an RDD? Let’s say the first element of

Re: RDD API patterns

2015-09-18 Thread sim
Thanks everyone for the comments! I waited for more replies to come before I responded as I was interested in the community's opinion. The thread I'm noticing in this thread (pun intended) is that most responses focus on the nested RDD issue. I think we all agree that it is problematic for many

Re: RDD API patterns

2015-09-18 Thread sim
@debasish83, yes, there are many ways to optimize and work around the limitation of no nested RDDs. The point of this thread is to discuss the API patterns of Spark in order to make the platform more accessible to lots of developers solving interesting problems quickly. We can get API consistency

Re: RDD API patterns

2015-09-18 Thread sim
Aniket, yes, I've done the separate file trick. :) Still, I think we can solve this problem without nested RDDs. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14192.html Sent from the Apache Spark Developers List mailing

Re: RDD API patterns

2015-09-18 Thread sim
Robin, my point exactly. When an API is valuable, let's expose it in a way that it may be used easily for all data Spark touches. It should not require much development work to implement the sampling logic to work for an Iterable as opposed to an RDD. -- View this message in context:

Re: 答复: bug in Worker.scala, ExecutorRunner is not serializable

2015-09-18 Thread Shixiong Zhu
I'm wondering if we should create a tag trait (e.g., LocalMessage) for messages like this and add the comment in the trait. Looks better than adding inline comments for all these messages. Best Regards, Shixiong Zhu 2015-09-18 15:10 GMT+08:00 Reynold Xin : > Maybe we should