Re: confirm subscribe to user@spark.apache.org

2016-11-26 Thread Arthur Țițeică
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: type-safe join in the new DataSet API?

2016-11-26 Thread Koert Kuipers
although this is correct, KeyValueGroupedDataset.coGroup requires one to implement their own join logic with Iterator functions. its fun to do that, and i appreciate the flexibility it gives, but i would not consider it a good solution for someone that just wants to do a typed join On Thu, Nov

Re: Third party library

2016-11-26 Thread kant kodali
I would say instead of LD_LIBRARY_PATH you might want to use java.library. path in the following way java -Djava.library.path=/path/to/my/library or pass java.library.path along with spark-submit On Sat, Nov 26, 2016 at 6:44 PM, Gmail wrote: > Maybe you've already checked

Re: Third party library

2016-11-26 Thread Gmail
Maybe you've already checked these out. Some basic questions that come to my mind are: 1) is this library "foolib" or "foo-C-library" available on the worker node? 2) if yes, is it accessible by the user/program (rwx)? Thanks, Vasu. > On Nov 26, 2016, at 5:08 PM, kant kodali

Re: Third party library

2016-11-26 Thread kant kodali
If it is working for standalone program I would think you can apply the same settings across all the spark worker and client machines and give that a try. Lets start with that. On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha wrote: > Just subscribed to Spark User. So,

Re: Dataframe broadcast join hint not working

2016-11-26 Thread Anton Okolnychyi
Hi guys, I also experienced a situation when Spark 1.6.2 ignored my hint to do a broadcast join (i.e. broadcast(df)) with a small dataset. However, this happened only in 1 of 3 cases. Setting the "spark.sql.autoBroadcastJoinThreshold" property did not have any impact as well. All 3 cases work

Re: Dataframe broadcast join hint not working

2016-11-26 Thread Benyi Wang
I think your dataframes are converted from RDDs, Are those RDDs computed or read from files directly? I guess it might affect how spark compute the execution plan. Try this: save your data frame which will be broadcasted to HDFS, and read it back into a dataframe. Then do the join and check the

Re: Dataframe broadcast join hint not working

2016-11-26 Thread Swapnil Shinde
I am using Spark 1.6.3 and below is the real plan (a,b,c in above were just for illustration purpose) == Physical Plan == Project [ltt#3800 AS ltt#3814,CASE WHEN isnull(etv_demo_id#3813) THEN mr_demo_id#3801 ELSE etv_demo_id#3813 AS etv_demo_id#3815] +- SortMergeOuterJoin

Re: Third party library

2016-11-26 Thread vineet chadha
Just subscribed to Spark User. So, forwarding message again. On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha wrote: > Thanks Kant. Can you give me a sample program which allows me to call jni > from executor task ? I have jni working in standalone program in >

Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameReader.html#json(org.apache.spark.rdd.RDD) You can pass a rdd to spark.read.json. // Spark here is SparkSession Also it works completely fine with smaller dataset in a table but with 1B records it takes forever and more

Re: Third party library

2016-11-26 Thread kant kodali
Yes this is a Java JNI question. Nothing to do with Spark really. java.lang.UnsatisfiedLinkError typically would mean the way you setup LD_LIBRARY_PATH is wrong unless you tell us that it is working for other cases but not this one. On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin

Re: Dataframe broadcast join hint not working

2016-11-26 Thread Benyi Wang
Could you post the result of explain `c.explain`? If it is broadcast join, you will see it in explain. On Sat, Nov 26, 2016 at 10:51 AM, Swapnil Shinde wrote: > Hello > I am trying a broadcast join on dataframes but it is still doing > SortMergeJoin. I even try

Re: Third party library

2016-11-26 Thread Reynold Xin
That's just standard JNI and has nothing to do with Spark, does it? On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha wrote: > Thanks Reynold for quick reply. > > I have tried following: > > class MySimpleApp { > // ---Native methods > @native def fooMethod (foo:

Re: Dataframe broadcast join hint not working

2016-11-26 Thread Selvam Raman
Hi, Which version of spark you are using. Less than 10Mb automatically converted as broadcast join in spark. \Thanks, selvam R On Sat, Nov 26, 2016 at 6:51 PM, Swapnil Shinde wrote: > Hello > I am trying a broadcast join on dataframes but it is still doing >

Dataframe broadcast join hint not working

2016-11-26 Thread Swapnil Shinde
Hello I am trying a broadcast join on dataframes but it is still doing SortMergeJoin. I even try setting spark.sql.autoBroadcastJoinThreshold higher but still no luck. Related piece of code- val c = a.join(braodcast(b), "id") On a side note, if I do SizeEstimator.estimate(b) and it

Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread Anastasios Zouzias
Hi there, spark.read.json usually takes a filesystem path (usually HDFS) where there is a file containing JSON per new line. See also http://spark.apache.org/docs/latest/sql-programming-guide.html Hence, in your case val df4 = spark.read.json(rdd) // This line takes forever seems wrong. I

Re: UDF for gradient ascent

2016-11-26 Thread Meeraj Kunnumpurath
One thing I noticed inside the UDF is that original column names from the data frame have disappeared and the columns are called col1, col2 etc. Regards Meeraj On Sat, Nov 26, 2016 at 7:31 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote: > Hello, > > I have a dataset of features on

Re: OS killing Executor due to high (possibly off heap) memory usage

2016-11-26 Thread Koert Kuipers
i agree that offheap memory usage is unpredictable. when we used rdds the memory was mostly on heap and total usage predictable, and we almost never had yarn killing executors. now with dataframes the memory usage is both on and off heap, and we have no way of limiting the off heap memory usage

UDF for gradient ascent

2016-11-26 Thread Meeraj Kunnumpurath
Hello, I have a dataset of features on which I want to compute the likelihood value for implementing gradient ascent for estimating coefficients. I have written a UDF that compute the probability function on each feature as shown below. def getLikelihood(cfs : List[(String, Double)], df:

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
up vote down votefavorite Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel. Here is my code using