Re: [MLLib]: Executor OutOfMemory in BlockMatrix Multiplication

2017-06-14 Thread Anthony Thomas
Interesting, thanks! That probably also explains why there seems to be a ton of shuffle for this operation. So what's the best option for truly scalable matrix multiplication on Spark then - implementing from scratch using the coordinate matrix ((i,j), k) format? On Wed, Jun 14, 2017 at 4:29 PM,

Re: [MLLib]: Executor OutOfMemory in BlockMatrix Multiplication

2017-06-14 Thread John Compitello
No problem. It was a big headache for my team as well. One of us already reimplemented it from scratch, as seen in this pending PR for our project. https://github.com/hail-is/hail/pull/1895 Hopefully you find that useful. We'll hopefully try to PR that into Spark at some point. Best, John

Create dataset from data frame with missing columns

2017-06-14 Thread tokeman24
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d

Assign Custom receiver to a scheduler pool

2017-06-14 Thread Rabin Banerjee

[Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-14 Thread RD
Hi Spark folks, Is there any plan to support the richer UDF API that Hive supports for Spark UDFs ? Hive supports the GenericUDF API which has, among others methods like initialize(), configure() (called once on the cluster) etc, which a lot of our users use. We have now a lot of UDFs in Hive

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-14 Thread Michael Armbrust
This a good question. I really like using Kafka as a centralized source for streaming data in an organization and, with Spark 2.2, we have full support for reading and writing data to/from Kafka in both streaming and batch

Exception when accessing Spark Web UI in yarn-client mode

2017-06-14 Thread satishjohn
When I click on the ApplicationMaster URL I see this exception. 500013 [qtp1921129349-1619] WARN o.s.j.server.AbstractHttpConnection - header full: java.lang.RuntimeException: Header>6144 500014 [qtp1921129349-1619] WARN o.s.j.server.AbstractHttpConnection -

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-14 Thread bo yang
Hi Nihed, Interesting to see envelope. The idea is same there! Thanks for the sharing :) Best, Bo On Wed, Jun 14, 2017 at 12:22 AM, nihed mbarek wrote: > Hi > > I already saw a project with the same idea. > https://github.com/cloudera-labs/envelope > > Regards, > > On Wed,

Configurable Task level time outs and task failures

2017-06-14 Thread AnilKumar B
Hi, In some of the data science use cases like Predictions etc, we are using Spark. Most of the times, we faced data skew ness issues and we have distributed them using Murmur hashing or round robin assignment and fixed skew ness issue across the partitions/tasks. But still, some of the tasks

Re: Java access to internal representation of DataTypes.DateType

2017-06-14 Thread Anton Kravchenko
I switched to java.sql.Date and converted milliseconds to days: while (it.hasNext()) { Row irow = it.next(); long t_long = irow.getAs("time_col").getTime()/(60*60*1000)))/24; int t_int = toIntExact(t_long); } Though if there is more efficient way to do

CFP FOR SPARK SUMMIT EUROPE CLOSES FRIDAY

2017-06-14 Thread Scott walent
Share your skills and expertise with the Apache Spark community by speaking at the upcoming Spark Summit Europe 2017 conference. The deadline to submit session proposals is quickly approaching – submit your ideas by this Friday, June 16 to be considered. spark-summit.org/eu-2017-cfp Taking place

Re: UDF percentile_approx

2017-06-14 Thread Takeshi Yamamuro
You can use the function w/o hive and you can try: scala> Seq(1.0, 8.0).toDF("a").selectExpr("percentile_approx(a, 0.5)").show ++ |percentile_approx(a, CAST(0.5 AS DOUBLE), 1)| ++ |

Re: Java access to internal representation of DataTypes.DateType

2017-06-14 Thread Kazuaki Ishizaki
Does this code help you? https://github.com/apache/spark/blob/master/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java#L156-L194 Kazuaki Ishizaki From: Anton Kravchenko To: "user @spark" Date: 2017/06/14

Re: UDF percentile_approx

2017-06-14 Thread Riccardo Ferrari
Hi Andres, I can't find the refrence, last time I searched for that I found that 'percentile_approx' is only available via hive context. You should register a temp table and use it from there. Best, On Tue, Jun 13, 2017 at 8:52 PM, Andrés Ivaldi wrote: > Hello, I`m trying

Re: Spark Streaming Design Suggestion

2017-06-14 Thread satish lalam
Agree with Jörn. Dynamically creating/deleting Topics is nontrivial to manage. With the limited knowledge about your scenario - it appears that you are using topics as some kind of message type enum. If that is the case - you might be better off with one (or just a few topics) and have a

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-14 Thread nihed mbarek
Hi I already saw a project with the same idea. https://github.com/cloudera-labs/envelope Regards, On Wed, 14 Jun 2017 at 04:32, bo yang wrote: > Thanks Benjamin and Ayan for the feedback! You kind of represent two group > of people who need such script tool or not.

Re: Read Local File

2017-06-14 Thread satish lalam
I guess you have already made sure that the paths for your file are exactly the same on each of your nodes. I'd also check the perms on your path. Believe the sample code you pasted is only for testing - and you are already aware that a distributed count on a local file has no benefits. Once I ran

having trouble using structured streaming with file sink (parquet)

2017-06-14 Thread AssafMendelson
Hi all, I have recently started assessing structured streaming and ran into a little snag from the beginning. Basically I wanted to read some data, do some basic aggregation and write the result to file: import org.apache.spark.sql.functions.avg import

Re: Spark Streaming Design Suggestion

2017-06-14 Thread Shashi Vishwakarma
I agree Jorn and Satish. I think I should starting grouping similar kind of messages into single topic with some kind of id attached to it which can be pulled from spark streaming application. I can try reducing no of topic to significant lower but still at the end I can expect 50+ topics in

[MLLib]: Executor OutOfMemory in BlockMatrix Multiplication

2017-06-14 Thread Anthony Thomas
I've been experimenting with MlLib's BlockMatrix for distributed matrix multiplication but consistently run into problems with executors being killed due to memory constrains. The linked gist (here ) has a short example of

Create dataset from dataframe with missing columns

2017-06-14 Thread Tokayer, Jason M.
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d

Create dataset from dataframe with missing columns

2017-06-14 Thread Tokayer, Jason M.
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d

Re: [MLLib]: Executor OutOfMemory in BlockMatrix Multiplication

2017-06-14 Thread John Compitello
Hey Anthony, You're the first person besides myself I've seen mention this. BlockMatrix multiply is not the best method. As far as me and my team can tell, the memory problem stems from the fact that when Spark tries to compute block (i, j) of the matrix, it tries to manifest all of row i from

Re: UDF percentile_approx

2017-06-14 Thread Andrés Ivaldi
Hello, Riccardo I was able to make it run, the problem is that HiveContext doesn't exists any more in Spark 2.0.2, as far I can see. But exists the method enableHiveSupport to add the hive functionality to SparkSession. To enable this the spark-hive_2.11 dependency is needed. In the Spark API

Re: Read Local File

2017-06-14 Thread Dirceu Semighini Filho
Hello Satish, Thanks for your answer *I guess you have already made sure that the paths for your file are exactly the same on each of your nodes* Yes * I'd also check the perms on your path* The files are all with the same permissions *Believe the sample code you pasted is only for testing - and