Re: Broadcast big dataset

2016-09-28 Thread WangJianfei
First thank you very much! My executor memeory is also 4G, but my spark version is 1.5. Does spark version make a trouble? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Broadcast-big-dataset-tp19127p19143.html Sent from the Apache Spark

[VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-28 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.1 [ ] -1 Do not release this package because ... The

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-28 Thread Michael Gummelt
+1 I know this is cancelled, but FYI, RC3 passes mesos/spark integration tests On Wed, Sep 28, 2016 at 2:52 AM, Sean Owen wrote: > (Process-wise there's no problem with that. The vote is open for at > least 3 days and ends when the RM says it ends. So it's valid anyway > as

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Jakob Odersky
I agree with Sean's answer, you can check out the relevant serializer here https://github.com/twitter/chill/blob/develop/chill-scala/src/main/scala/com/twitter/chill/Traversable.scala On Wed, Sep 28, 2016 at 3:11 AM, Sean Owen wrote: > My guess is that Kryo specially handles

Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Joseph Bradley
+1 for 4 months. With QA taking about a month, that's very reasonable. My main ask (especially for MLlib) is for contributors and committers to take extra care not to delay on updating the Programming Guide for new APIs. Documentation debt often collects and has to be paid off during QA, and a

Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Burak, you can configure what happens with corrupt records for the datasource using the parse mode. The parse will still fail, so we can't get any data out of it, but we do leave the JSON in another column for you to inspect. In the case of this function, we'll just return null if its unparable.

Re: Broadcast big dataset

2016-09-28 Thread Andrew Duffy
Have you tried upping executor memory? There's a separate spark conf for that: spark.executor.memory In general driver configurations don't automatically apply to executors. On Wed, Sep 28, 2016 at 7:03 AM -0700, "WangJianfei" wrote: Hi Devs In

Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Segel
Silly question? When you talk about ‘user specified schema’ do you mean for the user to supply an additional schema, or that you’re using the schema that’s described by the JSON string? (or both? [either/or] ) Thx On Sep 28, 2016, at 12:52 PM, Michael Armbrust

Re: Spark SQL JSON Column Support

2016-09-28 Thread Burak Yavuz
I would really love something like this! It would be great if it doesn't throw away corrupt_records like the Data Source. On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande wrote: > We are currently pulling out the JSON columns, passing them through > read.json, and then

Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions from_json

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Sean Owen
I guess I'm claiming the artifacts wouldn't even be different in the first place, because the Hadoop APIs that are used are all the same across these versions. That would be the thing that makes you need multiple versions of the artifact under multiple classifiers. On Wed, Sep 28, 2016 at 1:16

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Olivier Girardot
ok, don't you think it could be published with just different classifiers hadoop-2.6hadoop-2.4 hadoop-2.2 being the current default. So for now, I should just override spark 2.0.0's dependencies with the ones defined in the pom profile On Thu, Sep 22, 2016 11:17 AM, Sean Owen

Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Tom Graves
+1 to 4 months. Tom On Tuesday, September 27, 2016 2:07 PM, Reynold Xin wrote: We are 2 months past releasing Spark 2.0.0, an important milestone for the project. Spark 2.0.0 deviated (took 6 month from the regular release cadence we had for the 1.x line, and we

Broadcast big dataset

2016-09-28 Thread WangJianfei
Hi Devs In my application, i just broadcast a dataset(about 500M) to the ececutors(100+), I got a java heap error Jmartad-7219.hadoop.jd.local:53591 (size: 4.0 MB, free: 3.3 GB) 16/09/28 15:56:48 INFO BlockManagerInfo: Added broadcast_9_piece19 in memory on

Re: Spark Executor Lost issue

2016-09-28 Thread Aditya
Hi All, Any updates on this? On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote: Try with increasing the parallelism by repartitioning and also you may increase - spark.default.parallelism You can also try with decreasing num-executor cores. Basically, this happens when the executor

Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-28 Thread Marcin Tustin
I've solved this in the past by using a thread pool which runs clean up code on thread creation, to clear out stale values. On Wednesday, September 28, 2016, Grant Digby wrote: > Hi, > > We've received the following error a handful of times and once it's > occurred > all

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Sean Owen
My guess is that Kryo specially handles Maps generically or relies on some mechanism that does, and it happens to iterate over all key/values as part of that and of course there aren't actually any key/values in the map. The Java serialization is a much more literal (expensive) field-by-field

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-28 Thread Sean Owen
(Process-wise there's no problem with that. The vote is open for at least 3 days and ends when the RM says it ends. So it's valid anyway as the vote is still open.) On Tue, Sep 27, 2016 at 8:37 PM, Reynold Xin wrote: > So technically the vote has passed, but IMHO it does not

IllegalArgumentException: spark.sql.execution.id is already set

2016-09-28 Thread Grant Digby
Hi, We've received the following error a handful of times and once it's occurred all subsequent queries fail with the same exception until we bounce the instance: IllegalArgumentException: spark.sql.execution.id is already set at

java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Maciej Szymkiewicz
Hi everyone, I suspect there is no point in submitting a JIRA to fix this (not a Spark issue?) but I would like to know if this problem is documented anywhere. Somehow Kryo is loosing default value during serialization: scala> import org.apache.spark.{SparkContext, SparkConf} import

Re: Spark Executor Lost issue

2016-09-28 Thread Aditya
: Thanks Sushrut for the reply. Currently I have not defined spark.default.parallelism property. Can you let me know how much should I set it to? Regards, Aditya Calangutkar On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote: Try with increasing the parallelism by repartitioning

Re: Spark Executor Lost issue

2016-09-28 Thread Aditya
Thanks Sushrut for the reply. Currently I have not defined spark.default.parallelism property. Can you let me know how much should I set it to? Regards, Aditya Calangutkar On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote: Try with increasing the parallelism by repartitioning and

Spark Executor Lost issue

2016-09-28 Thread Aditya
I have a spark job which runs fine for small data. But when data increases it gives executor lost error.My executor and driver memory are set at its highest point. I have also tried increasing--conf spark.yarn.executor.memoryOverhead=600but still not able to fix the problem. Is there any other