Regarding KryoSerialization in Spark

2015-04-30 Thread twinkle sachdeva
Hi, As per the code, KryoSerialization used writeClassAndObject method, which internally calls writeClass method, which will write the class of the object while serilization. As per the documentation in tuning page of spark, it says that registering the class will avoid that. Am I missing

RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like SQLContext is NOT thread safe? I think this is not the desired behavior. == java.lang.RuntimeException: [1.1] failure: ``insert'' expected but

Re: Is SQLContext thread-safe?

2015-04-30 Thread Wangfei (X)
actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang hw...@qilinsoft.com 写道: Hi, in a test on SparkSQL 1.3.0, multiple threads are doing select on a same SQLContext instance, but below exception is thrown, so it looks like

Re: withColumn is very slow with datasets with large number of columns

2015-04-30 Thread alexandre Clement
I have reported the issue on JIRA: https://issues.apache.org/jira/browse/SPARK-7276 On Thu, Apr 30, 2015 at 4:36 PM, alexandre Clement a.p.clem...@gmail.com wrote: Hi all, I'm experimenting serious performance problem when using withColumn and dataset with large number of columns. It is

withColumn is very slow with datasets with large number of columns

2015-04-30 Thread alexandre Clement
Hi all, I'm experimenting serious performance problem when using withColumn and dataset with large number of columns. It is very slow: on a dataset with 100 columns it takes a few seconds. The code snippet demonstrates the problem. val custs = Seq( Row(1, Bob, 21, 80.5), Row(2, Bobby, 21,

Drop column/s in DataFrame

2015-04-30 Thread rakeshchalasani
Hi All: Is there any plan to add drop column/s functionality in the data frame? One can you select function to do so, but I find that tedious when only one or two columns in large dataframe are to be dropped. Pandas has this functionality, which I find handy when constructing feature vectors

Re: Drop column/s in DataFrame

2015-04-30 Thread Reynold Xin
I filed a ticket: https://issues.apache.org/jira/browse/SPARK-7280 Would you like to give it a shot? On Thu, Apr 30, 2015 at 10:22 AM, rakeshchalasani vnit.rak...@gmail.com wrote: Hi All: Is there any plan to add drop column/s functionality in the data frame? One can you select function to

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
Cody Koeninger-2 wrote What's your schema for the offset table, and what's the definition of writeOffset ? The schema is the same as the one in your post: topic | partition| offset The writeOffset is nearly identical: def writeOffset(osr: OffsetRange)(implicit session: DBSession): Unit = {

Re: Is SQLContext thread-safe?

2015-04-30 Thread Michael Armbrust
Unfortunately, I think the SQLParser is not threadsafe. I would recommend using HiveQL. On Thu, Apr 30, 2015 at 4:07 AM, Wangfei (X) wangf...@huawei.com wrote: actually this is a sql parse exception, are you sure your sql is right? 发自我的 iPhone 在 2015年4月30日,18:50,Haopu Wang

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread Cody Koeninger
What's your schema for the offset table, and what's the definition of writeOffset ? What key are you reducing on? Maybe I'm misreading the code, but it looks like the per-partition offset is part of the key. If that's true then you could just do your reduction on each partition, rather than

Re: [discuss] DataFrame function namespacing

2015-04-30 Thread Ted Yu
IMHO I would go with choice #1 Cheers On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin r...@databricks.com wrote: We definitely still have the name collision problem in SQL. On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Do we still have to keep the

practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
We're a group of experienced backend developers who are fairly new to Spark Streaming (and Scala) and very interested in using the new (in 1.3) DirectKafkaInputDStream impl as part of the metrics reporting service we're building. Our flow involves reading in metric events, lightly modifying some

Re: Drop column/s in DataFrame

2015-04-30 Thread Rakesh Chalasani
Sure, I will try sending a PR soon. On Thu, Apr 30, 2015 at 1:42 PM Reynold Xin r...@databricks.com wrote: I filed a ticket: https://issues.apache.org/jira/browse/SPARK-7280 Would you like to give it a shot? On Thu, Apr 30, 2015 at 10:22 AM, rakeshchalasani vnit.rak...@gmail.com wrote:

Re: Regarding KryoSerialization in Spark

2015-04-30 Thread Sandy Ryza
Hi Twinkle, Registering the class makes it so that writeClass only writes out a couple bytes, instead of a full String of the class name. -Sandy On Thu, Apr 30, 2015 at 4:13 AM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, As per the code, KryoSerialization used

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Koert Kuipers
i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nicholas Chammas
I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns

Re: [discuss] ending support for Java 6?

2015-04-30 Thread shane knapp
something to keep in mind: we can easily support java 6 for the build environment, particularly if there's a definite EOL. i'd like to fix our java versioning 'problem', and this could be a big instigator... right now we're hackily setting java_home in test invocation on jenkins, which really

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread Cody Koeninger
In fact, you're using the 2 arg form of reduce by key to shrink it down to 1 partition reduceByKey(sumFunc, 1) But you started with 4 kafka partitions? So they're definitely no longer 1:1 On Thu, Apr 30, 2015 at 1:58 PM, Cody Koeninger c...@koeninger.org wrote: This is what I'm suggesting,

Re: Pickling error when attempting to add a method in pyspark

2015-04-30 Thread Stephen Boesch
Bumping this. Anyone of you having some familiarity with py4j interface in pyspark? thanks 2015-04-27 22:09 GMT-07:00 Stephen Boesch java...@gmail.com: My intention is to add pyspark support for certain mllib spark methods. I have been unable to resolve pickling errors of the form

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Patrick Wendell
I'd also support this. In general, I think it's good that we try to have Spark support different versions of things (Hadoop, Hive, etc). But at some point you need to weigh the costs of doing so against the number of users affected. In the case of Java 6, we are seeing increasing cost from this.

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Punyashloka Biswal
I'm in favor of ending support for Java 6. We should also articulate a policy on how long we want to support current and future versions of Java after Oracle declares them EOL (Java 7 will be in that bucket in a matter of days). Punya On Thu, Apr 30, 2015 at 1:18 PM shane knapp

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Sree V
Hi Team, Should we take this opportunity to layout and evangelize a pattern for EOL of dependencies.I propose, we follow the official EOL of java, python, scala, .And add say 6-12-24 months depending on the popularity. Java 6 official EOL Feb 2013Add 6-12 monthsAug 2013 - Feb 2014 official

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
Cody Koeninger-2 wrote In fact, you're using the 2 arg form of reduce by key to shrink it down to 1 partition reduceByKey(sumFunc, 1) But you started with 4 kafka partitions? So they're definitely no longer 1:1 True. I added the second arg because we were seeing multiple threads

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Sean Owen
I'm firmly in favor of this. It would also fix https://issues.apache.org/jira/browse/SPARK-7009 and avoid any more of the long-standing 64K file limit thing that's still a problem for PySpark. As a point of reference, CDH5 has never supported Java 6, and it was released over a year ago. On Thu,

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Marcelo Vanzin
As for the idea, I'm +1. Spark is the only reason I still have jdk6 around - exactly because I don't want to cause the issue that started this discussion (inadvertently using JDK7 APIs). And as has been pointed out, even J7 is about to go EOL real soon. Even Hadoop is moving away (I think 2.7

[discuss] ending support for Java 6?

2015-04-30 Thread Reynold Xin
This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Koert Kuipers
nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM,

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ted Yu
+1 on ending support for Java 6. BTW from https://www.java.com/en/download/faq/java_7.xml : After April 2015, Oracle will no longer post updates of Java SE 7 to its public download sites. On Thu, Apr 30, 2015 at 1:34 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I'm in favor of ending

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nicholas Chammas
(On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) ​ On Thu, Apr 30, 2015 at 3:18 PM

Re: Regarding KryoSerialization in Spark

2015-04-30 Thread twinkle sachdeva
Thanks for the info. On Fri, May 1, 2015 at 12:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Twinkle, Registering the class makes it so that writeClass only writes out a couple bytes, instead of a full String of the class name. -Sandy On Thu, Apr 30, 2015 at 4:13 AM, twinkle

Uninitialized session in HiveContext?

2015-04-30 Thread Marcelo Vanzin
Hey all, We ran into some test failures in our internal branch (which builds against Hive 1.1), and I narrowed it down to the fix below. I'm not super familiar with the Hive integration code, but does this look like a bug for other versions of Hive too? This caused an error where some internal

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Vinod Kumar Vavilapalli
FYI, after enough consideration, we the Hadoop community dropped support for JDK 6 starting release Apache Hadoop 2.7.x. Thanks +Vinod On Apr 30, 2015, at 12:02 PM, Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for

Re: Uninitialized session in HiveContext?

2015-04-30 Thread Marcelo Vanzin
Hi Michael, It would be great to see changes to make hive integration less painful, and I can test them in our environment once you have a patch. But I guess my question is a little more geared towards the current code; doesn't the issue I ran into affect 1.4 and potentially earlier versions

Re: Mima test failure in the master branch?

2015-04-30 Thread zhazhan
Any PR open for this? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Mima-test-failure-in-the-master-branch-tp11949p11950.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Mima test failure in the master branch?

2015-04-30 Thread Ted Yu
Looks like this has been taken care of: commit beeafcfd6ee1e460c4d564cd1515d8781989b422 Author: Patrick Wendell patr...@databricks.com Date: Thu Apr 30 20:33:36 2015 -0700 Revert [SPARK-5213] [SQL] Pluggable SQL Parser Support On Thu, Apr 30, 2015 at 7:58 PM, zhazhan

Mima test failure in the master branch?

2015-04-30 Thread zhazhan
[info] spark-sql: found 1 potential binary incompatibilities (filtered 129) [error] * method sqlParser()org.apache.spark.sql.SparkSQLParser in class org.apache.spark.sql.SQLContext does not have a correspondent in new version [error] filter with: ProblemFilters.excludeMissingMethodProblem --

Re: Mima test failure in the master branch?

2015-04-30 Thread Patrick Wendell
I reverted the patch that I think was causing this: SPARK-5213 Thanks On Thu, Apr 30, 2015 at 7:59 PM, zhazhan zzh...@hortonworks.com wrote: Any PR open for this? -- View this message in context:

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ted Yu
But it is hard to know how long customers stay with their most recent download. Cheers On Thu, Apr 30, 2015 at 2:26 PM, Sree V sree_at_ch...@yahoo.com.invalid wrote: If there is any possibility of getting the download counts,then we can use it as EOS criteria as well.Say, if download counts

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Ram Sriharsha
+1 for end of support for Java 6 On Thursday, April 30, 2015 3:08 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: FYI, after enough consideration, we the Hadoop community dropped support for JDK 6 starting release Apache Hadoop 2.7.x. Thanks +Vinod On Apr 30, 2015, at

Re: Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos

2015-04-30 Thread Yang Lei
I finally isolated the issue to be related to the ActorSystem I reuse from SparkEnv.get.actorSystem. This ActorSystem will contain the configuration defined in my application jar's reference.conf in both local cluster case, and in the case I use it directly in an extension to BaseRelation's

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Sree V
If there is any possibility of getting the download counts,then we can use it as EOS criteria as well.Say, if download counts are lower than 30% (or another number) of Life time highest,then it qualifies for EOS. Thanking you. With Regards Sree On Thursday, April 30, 2015 2:22 PM, Sree

Re: Uninitialized session in HiveContext?

2015-04-30 Thread Michael Armbrust
Hey Marcelo, Thanks for the heads up! I'm currently in the process of refactoring all of this (to separate the metadata connection from the execution side) and as part of this I'm making the initialization of the session not lazy. It would be great to hear if this also works for your internal

Custom PersistanceEngine and LeaderAgent implementation in Java

2015-04-30 Thread Niranda Perera
Hi, this follows the following feature in this feature [1] I'm trying to implement a custom persistence engine and a leader agent in the Java environment. vis-a-vis scala, when I implement the PersistenceEngine trait in java, I would have to implement methods such as readPersistedData,

Re: Custom PersistanceEngine and LeaderAgent implementation in Java

2015-04-30 Thread Reynold Xin
We should change the trait to abstract class, and then your problem will go away. Do you want to submit a pull request? On Wed, Apr 29, 2015 at 11:02 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi, this follows the following feature in this feature [1] I'm trying to implement a