Re: problems with build of latest the master

2015-07-15 Thread Steve Loughran
On 14 Jul 2015, at 12:22, Ted Yu yuzhih...@gmail.commailto:yuzhih...@gmail.com wrote: Looking at Jenkins, master branch compiles. Can you try the following command ? mvn -Phive -Phadoop-2.6 -DskipTests clean package What version of Java are you using ? Ted, Giles has stuck in

Expression.resolved unmatched with the correct values in catalyst?

2015-07-15 Thread Takeshi Yamamuro
Hi, devs I found that the case of 'Expression.resolved != (Expression.childrenResolved checkInputDataTypes().isSuccess)' occurs in the output of Analyzer. That is, some tests in o.a.s.sql.* fail if the codes below are added in CheckAnalysis:

Re: Should spark-ec2 get its own repo?

2015-07-15 Thread Sean Owen
The code can continue to be a good reference implementation, no matter where it lives. In fact, it can be a better more complete one, and easier to update. I agree that ec2/ needs to retain some kind of pointer to the new location. Yes, maybe a script as well that does the checkout as you say. We

Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
Hi all, I'm working on an ETL task with Spark. As part of this work, I'd like to mark records with some info such as: 1. Whether the record is good or bad (e.g, Either) 2. Originating file and lines Part of my motivation is to prevent errors with individual records from stopping the entire

Re: problems with build of latest the master

2015-07-15 Thread Sean Owen
You shouldn't get dependencies you need from Spark, right? you declare direct dependencies. Are we talking about re-scoping or excluding this dep from Hadoop transitively? On Wed, Jul 15, 2015 at 7:33 PM, Gil Vernik g...@il.ibm.com wrote: Right, it's not currently dependence in Spark. If we

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
I'm considering a few approaches -- one of which is to provide new functions like mapLeft, mapRight, filterLeft, etc. But this all falls shorts with DataFrames. RDDs can easily be extended from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add special columns? On Wed, Jul 15, 2015

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread Reynold Xin
Yea - I'd just add a bunch of columns. Doesn't seem like that big of a deal. On Wed, Jul 15, 2015 at 10:53 AM, RJ Nowling rnowl...@gmail.com wrote: I'm considering a few approaches -- one of which is to provide new functions like mapLeft, mapRight, filterLeft, etc. But this all falls shorts

Re: problems with build of latest the master

2015-07-15 Thread Gil Vernik
Right, it's not currently dependence in Spark. If we already mention it, is it possible to make it part of current dependence, but only for Hadoop profiles 2.4 and up? This will solve a lot of headache to those who use Spark + OpenStack Swift and need every time to manually edit pom.xml to add

Re: Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-15 Thread Reynold Xin
Hi Bob, Thanks for the email. You can select Spark as the project when you file a JIRA ticket at https://issues.apache.org/jira/browse/SPARK For select 1 from $table where 0=1 -- if the database's optimizer doesn't do constant folding and short-circuit execution, could the query end up

Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-15 Thread Bob Beauchemin
tableExists in spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses non-standard SQL (specifically, the LIMIT keyword) to determine whether a table exists in a JDBC data source. This will cause an exception in many/most JDBC databases that doesn't support LIMIT keyword.

Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2015-07-15 Thread daniel.mescheder
Hey everyone, Consider the following use of spark.sql.shuffle.partitions: case class Data(A:String = f${(math.random*1e8).toLong}%09.0f, B: String = f${(math.random*1e8).toLong}%09.0f) val dataFrame = (1 to 1000).map(_ = Data()).toDF dataFrame.registerTempTable(data) sqlContext.setConf(

Re: problems with build of latest the master

2015-07-15 Thread Sean Owen
Why does Spark need to depend on it? I'm missing that bit. If an openstack artifact is needed for openstack, shouldn't openstack add it? otherwise everybody gets it in their build. On Wed, Jul 15, 2015 at 7:52 PM, Gil Vernik g...@il.ibm.com wrote: I mean currently users that wish to use Spark

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Patrick Wendell
One related note here is that we have a Java version of this that is an abstract class - in the doc it says that it exists more or less to allow for binary compatibility (it says it's for Java users, but really Scala could use this also):

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Patrick Wendell
Actually the java one is a concrete class. On Wed, Jul 15, 2015 at 12:14 PM, Patrick Wendell pwend...@gmail.com wrote: One related note here is that we have a Java version of this that is an abstract class - in the doc it says that it exists more or less to allow for binary compatibility (it

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Marcelo Vanzin
Or, alternatively, the bus could catch that error and ignore / log it, instead of stopping the context... On Wed, Jul 15, 2015 at 12:20 PM, Marcelo Vanzin van...@cloudera.com wrote: Hmm, the Java listener was added in 1.3, so I think it will work for my needs. Might be worth it to make it

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Reynold Xin
It's bad that expose a trait - even though we want to mixin stuff. We should really audit all of these and expose only abstract classes for anything beyond an extremely simple interface. That itself however would break binary compatibility. On Wed, Jul 15, 2015 at 12:15 PM, Patrick Wendell

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Marcelo Vanzin
Hmm, the Java listener was added in 1.3, so I think it will work for my needs. Might be worth it to make it clear in the SparkListener documentation that people should avoid using it directly. Or follow Reynold's suggestion. On Wed, Jul 15, 2015 at 12:14 PM, Patrick Wendell pwend...@gmail.com

Re: problems with build of latest the master

2015-07-15 Thread Gil Vernik
I mean currently users that wish to use Spark and configure Spark to use OpenStack Swift need to manually edit pom.xml of Spark ( main, core, yarn ) and add hadoop-openstack.jar to it and then compile Spark. My question is why not to include this dependency in Spark for Hadoop profiles 2.4 and

Re: Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-15 Thread Bob Beauchemin
Granted the 1=0 thing is ugly and assumes constant-folding support or reads way too much data. Submitted JIRA SPARK-9078 (thanks for pointers) and expounded on possible solutions a little bit more there. Cheers, and thanks, Bob -- View this message in context:

Re: Are These Issues Suitable for our Senior Project?

2015-07-15 Thread Joseph Bradley
Per recent comments on SPARK-6442, I'd recommend not working on that one for now. Instead, even if tasks are not that interesting to you, you should try some small tasks at first to get used to contributing. I am quite sure we'll want to solve SPARK-3703 by May 2016; that's pretty far in the

Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
I haven't gotten a response on user@ yet for these questions, but these are probably better questions for dev@ anyway, aren't they? Could somebody on dev@ please respond? Thanks, Jonathan From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Date: Wednesday, July 15, 2015 at 12:18

Announcing Spark 1.4.1!

2015-07-15 Thread Patrick Wendell
Hi All, I'm happy to announce the Spark 1.4.1 maintenance release. We recommend all users on the 1.4 branch upgrade to this release, which contain several important bug fixes. Download Spark 1.4.1 - http://spark.apache.org/downloads.html Release notes -

RE: BlockMatrix multiplication

2015-07-15 Thread Ulanov, Alexander
Hi Burak, I’ve modified my code as you suggested, however it still leads to shuffling. Could you suggest what’s wrong with my code or provide an example code with block matrices multiplication that preserves data locality and does not cause shuffling? Modified code: import

Re: BlockMatrix multiplication

2015-07-15 Thread Burak Yavuz
Hi Alexander, I just noticed the error in my logic. There will always be a shuffle due to the `cogroup`. `join` also uses cogroup, therefore a shuffle is inevitable. However, the reduceByKey will not cause a shuffle. I forgot about how cogroup will try to match things, even if they don't exist.

Re: RestSubmissionClient Basic Auth

2015-07-15 Thread Joel Zambrano
Thanks Akhil! For the one where I change the rest client, how likely would it be that a change like that goes thru? Would it be rejected as an uncommon scenario? I really don't want to have this as a separate form of the branch. Thanks, Joel From: Akhil Das

Re: problems with build of latest the master

2015-07-15 Thread Ted Yu
I attached a patch for HADOOP-12235 BTW openstack was not mentioned in the first email from Gil. My email and Gil's second email were sent around the same moment. Cheers On Wed, Jul 15, 2015 at 2:06 AM, Steve Loughran ste...@hortonworks.com wrote: On 14 Jul 2015, at 12:22, Ted Yu

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
If the map-side-combine is not that necessary, given the fact that it cannot reduce the size of data for shuffling much (do need to serialized the key for each value), but can reduce the number of key-value pairs, and potential reduce the number of operations later (repartition and groupby). On

Re: problems with build of latest the master

2015-07-15 Thread Josh Rosen
We may be able to fix this from the Spark side by adding appropriate exclusions in our Hadoop dependencies, right? If possible, I think that we should do this. On Wed, Jul 15, 2015 at 7:10 AM, Ted Yu yuzhih...@gmail.com wrote: I attached a patch for HADOOP-12235 BTW openstack was not