Re: rdd.distinct with Partitioner

2016-06-08 Thread 汪洋
Frankly speaking, I think reduceByKey with Partitioner has the same problem too and it should not be exposed to public user either. Because it is a little hard to fully understand how the partitioner behaves without looking at the actual code. And if there exits a basic contract of a

DAG in Pipeline

2016-06-08 Thread Pranay Tonpay
Hi, Pipeline as of now seems to be having a series of transformers and estimators in a serial fashion. Is it possible to create a DAG sort of thing - Eg - Two transformers running in parallel to cleanse data (a custom built Transformer) in some way and then their outputs ( two outputs ) used for

Re: rdd.distinct with Partitioner

2016-06-08 Thread Alexander Pivovarov
reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result Why reduceByKey with Partitioner exists then? On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 wrote: > Hi Alexander, > > I think it does not guarantee to be right if an arbitrary Partitioner is > passed in.

Re: rdd.distinct with Partitioner

2016-06-08 Thread Mridul Muralidharan
The example violates the basic contract of a Partitioner. It does make sense to take Partitioner as a param to distinct - though it is fairly trivial to simulate that in user code as well ... Regards Mridul On Wednesday, June 8, 2016, 汪洋 wrote: > Hi Alexander, > > I

Re: rdd.distinct with Partitioner

2016-06-08 Thread 汪洋
Hi Alexander, I think it does not guarantee to be right if an arbitrary Partitioner is passed in. I have created a notebook and you can check it out.

rdd.distinct with Partitioner

2016-06-08 Thread Alexander Pivovarov
most of the RDD methods which shuffle data take Partitioner as a parameter But rdd.distinct does not have such signature Should I open a PR for that? /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] =

Re: Kryo registration for Tuples?

2016-06-08 Thread Reynold Xin
Yes you can :) On Wed, Jun 8, 2016 at 6:00 PM, Alexander Pivovarov wrote: > Can I just enable spark.kryo.registrationRequired and look at error > messages to get unregistered classes? > > On Wed, Jun 8, 2016 at 5:52 PM, Reynold Xin wrote: > >> Due to

Re: Kryo registration for Tuples?

2016-06-08 Thread Alexander Pivovarov
Can I just enable spark.kryo.registrationRequired and look at error messages to get unregistered classes? On Wed, Jun 8, 2016 at 5:52 PM, Reynold Xin wrote: > Due to type erasure they have no difference, although watch out for Scala > tuple serialization. > > > On

Re: Kryo registration for Tuples?

2016-06-08 Thread Reynold Xin
Due to type erasure they have no difference, although watch out for Scala tuple serialization. On Wednesday, June 8, 2016, Ted Yu wrote: > I think the second group (3 classOf's) should be used. > > Cheers > > On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov

Re: Kryo registration for Tuples?

2016-06-08 Thread Ted Yu
I think the second group (3 classOf's) should be used. Cheers On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov wrote: > if my RDD is RDD[(String, (Long, MyClass))] > > Do I need to register > > classOf[MyClass] > classOf[(Any, Any)] > > or > > classOf[MyClass] >

Kryo registration for Tuples?

2016-06-08 Thread Alexander Pivovarov
if my RDD is RDD[(String, (Long, MyClass))] Do I need to register classOf[MyClass] classOf[(Any, Any)] or classOf[MyClass] classOf[(Long, MyClass)] classOf[(String, (Long, MyClass))] ?

Re: NegativeArraySizeException / segfault

2016-06-08 Thread Andres Perez
We were able to reproduce it with a minimal example. I've opened a jira issue: https://issues.apache.org/jira/browse/SPARK-15825 On Wed, Jun 8, 2016 at 12:43 PM, Koert Kuipers wrote: > great! > > we weren't able to reproduce it because the unit tests use a > broadcast-join

Re: NegativeArraySizeException / segfault

2016-06-08 Thread Koert Kuipers
great! we weren't able to reproduce it because the unit tests use a broadcast-join while on the cluster it uses sort-merge-join. so the issue is in sort-merge-join. we are now able to reproduce it in tests using spark.sql.autoBroadcastJoinThreshold=-1 it produces weird looking garbled results in

Re: NegativeArraySizeException / segfault

2016-06-08 Thread Pete Robbins
I just raised https://issues.apache.org/jira/browse/SPARK-15822 for a similar looking issue. Analyzing the core dump from the segv with Memory Analyzer it looks very much like a UTF8String is very corrupt. Cheers, On Fri, 27 May 2016 at 21:00 Koert Kuipers wrote: > hello

Spark 2.0.0 preview docs uploaded

2016-06-08 Thread Sean Owen
OK, this is done: http://spark.apache.org/documentation.html http://spark.apache.org/docs/2.0.0-preview/ http://spark.apache.org/docs/preview/ On Tue, Jun 7, 2016 at 4:59 PM, Shivaram Venkataraman wrote: > As far as I know the process is just to copy docs/_site from