Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
Distributions X ...
May be one option is to have a minimum basic set (which I know is what we
are discussing) and move the rest to spark-packages.org. There the vendors
can add the latest downloads - for example when 1.4 is
Yeah, interesting question of what is the better default for the
single set of artifacts published to Maven. I think there's an
argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
and cons discussed more at
https://issues.apache.org/jira/browse/SPARK-5134
We probably want to revisit the way we do binaries in general for
1.4+. IMO, something worth forking a separate thread for.
I've been hesitating to add new binaries because people
(understandably) complain if you ever stop packaging older ones, but
on the other hand the ASF has complained that we
+1
Tested it on Mac OS X.
One small issue I noticed is that the Scala 2.11 build is using Hadoop 1
without Hive, which is kind of weird because people will more likely want
Hadoop 2 with Hive. So it would be good to publish a build for that
configuration instead. We can do it if we do a new
Can you paste the complete code?
Thanks
Best Regards
On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
Hi,
I've implemented class MyClass in MLlib that does some operation on
LabeledPoint. MyClass extends serializable, so I can map this operation on
data of
Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
Maven artifacts.
Patrick I see you just commented on SPARK-5134 and will follow up
there. Sounds like this may accidentally not be a problem.
On binary
I think that yes, longer term we want to have encryption of all
communicated data. However Jeff, can you open a JIRA to discuss the
design before opening a pull request (it's fine to link to a WIP
branch if you'd like)? I'd like to better understand the performance
and operational complexity of
Yeah, my concern is that people should get Apache Spark from *Apache*, not from
a vendor. It helps everyone use the latest features no matter where they are.
In the Hadoop distro case, Hadoop made all this effort to have standard APIs
(e.g. YARN), so it should be easy. But it is a problem if
I'm interested in seeing this data transfer occurring over encrypted
communication channels as well. Many customers require that all network
transfer occur encrypted to prevent the soft underbelly that's often
found inside a corporate network.
On Fri, Mar 6, 2015 at 4:20 PM, turp1twin
I have already written most of the code, just finishing up the unit tests
right now...
Jeff
On Sun, Mar 8, 2015 at 5:39 PM, Andrew Ash and...@andrewash.com wrote:
I'm interested in seeing this data transfer occurring over encrypted
communication channels as well. Many customers require that
I think it's important to separate the goals from the implementation.
I agree with Matei on the goal - I think the goal needs to be to allow
people to download Apache Spark and use it with CDH, HDP, MapR,
whatever... This is the whole reason why HDFS and YARN have stable
API's, so that other
Our goal is to let people use the latest Apache release even if vendors fall
behind or don't want to package everything, so that's why we put out releases
for vendors' versions. It's fairly low overhead.
Matei
On Mar 8, 2015, at 5:56 PM, Sean Owen so...@cloudera.com wrote:
Ah. I
Yeah it's not much overhead, but here's an example of where it causes
a little issue.
I like that reasoning. However, the released builds don't track the
later versions of Hadoop that vendors would be distributing -- there's
no Hadoop 2.6 build for example. CDH4 is here, but not the
far-more-used
13 matches
Mail list logo