Re: How to add custom steps to Pipeline models?

2016-08-14 Thread Jacek Laskowski
Hi, If it's Python I can't help. I'm with Scala. Jacek On 14 Aug 2016 9:27 p.m., "Evan Zamir" wrote: > Thanks, but I should have been more clear that I'm trying to do this in > PySpark, not Scala. Using an example I found on SO, I was able to implement > a Pipeline step

Re: How to add custom steps to Pipeline models?

2016-08-14 Thread Evan Zamir
Thanks, but I should have been more clear that I'm trying to do this in PySpark, not Scala. Using an example I found on SO, I was able to implement a Pipeline step in Python, but it seems it is more difficult (perhaps currently impossible) to make it persist to disk (I tried implementing _to_java

spark ml : auc on extreme distributed data

2016-08-14 Thread Zhiliang Zhu
Hi All,  Here I have lot of data with around 1,000,000 rows, 97% of them are negative class and 3% of them are positive class .  I applied Random Forest algorithm to build the model and predict the testing data. For the data preparation,i. firstly randomly split all the data as training data

Re: Using spark package XGBoost

2016-08-14 Thread Brandon White
The XGBoost integration with Spark is currently only supported for RDDs, there is a ticket for dataframe and folks calm to be working on it. On Aug 14, 2016 8:15 PM, "Jacek Laskowski" wrote: > Hi, > > I've never worked with the library and speaking about sbt setup only. > > It

Re: Using spark package XGBoost

2016-08-14 Thread Jacek Laskowski
Hi, I've never worked with the library and speaking about sbt setup only. It appears that the project didn't release 2.11-compatible jars (only 2.10) [1] so you need to build the project yourself and uber-jar it (using sbt-assembly plugin). [1]

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-14 Thread Jacek Laskowski
Hi Jestin, You can find the docs of the latest and greatest Spark at http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/. The jars are at the ASF SNAPSHOT repo at http://repository.apache.org/snapshots/. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/

Re: Simulate serialization when running local

2016-08-14 Thread Jacek Laskowski
Hi Ashic, Yes, there is one - local-cluster[N, cores, memory] - that you can use for simulating a Spark cluster of [N, cores, memory] locally. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2478 Pozdrawiam, Jacek Laskowski

Re: How to add custom steps to Pipeline models?

2016-08-14 Thread Jacek Laskowski
Hi, It should just work if you followed the Transformer interface [1]. When you have the transformers, creating a Pipeline is a matter of setting them as additional stages (using Pipeline.setStages [2]). [1]

Re:

2016-08-14 Thread Jestin Ma
Hi Michael, Mich, and Jacek, thank you for providing good suggestions. I found some ways of getting rid of skew, such as the approaches you have suggested (filtering, broadcasting, joining, unioning), as well as salting my 0-value IDs. Thank you for the help! On Sun, Aug 14, 2016 at 11:33 AM,

Re: call a mysql stored procedure from spark

2016-08-14 Thread ayan guha
More than technical feasibility, I would ask why to invoke a stored procedure for every row? If not, jdbcRdd is moot point. In case stored procedure should be invoked from driver, it can be easily done. Or at most for each partition, at each executor. On 15 Aug 2016 03:06, "Mich Talebzadeh"

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Ted Yu
Looks like the proposed fix was reverted: Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB" This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf. Maybe this was fixed in some other JIRA ? On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Thank you very much sir. I forgot to mention that two of these Oracle tables are range partitioned. In that case what would be the optimum number of partitions if you can share? Warmest On Sunday, 14 August 2016, 21:37, Mich Talebzadeh wrote: If you have

Re: parallel processing with JDBC

2016-08-14 Thread Mich Talebzadeh
If you have primary keys on these tables then you can parallelise the process reading data. You have to be careful not to set the number of partitions too many. Certainly there is a balance between the number of partitions supplied to JDBC and the load on the network and the source DB. Assuming

Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Hi, There are 4 tables ranging from 10 million to 100 million rows but they all have primary keys. The network is fine but our Oracle is RAC and we can only connect to a designated Oracle node (where we have a DQ account only). We have a limited time window of few hours to get the required data

Re: Role-based S3 access outside of EMR

2016-08-14 Thread Steve Loughran
On 29 Jul 2016, at 00:07, Everett Anderson > wrote: Hey, Just wrapping this up -- I ended up following the instructions to build a custom Spark release with Hadoop 2.7.2,

Re: parallel processing with JDBC

2016-08-14 Thread Mich Talebzadeh
How big are your tables and is there any issue with the network between your Spark nodes and your Oracle DB that adds to issues? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Change nullable property in Dataset schema

2016-08-14 Thread Jacek Laskowski
On Wed, Aug 10, 2016 at 12:04 AM, Kazuaki Ishizaki wrote: > import testImplicits._ > test("test") { > val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), > Array(3, 3)), 1).toDS You should just Seq(...).toDS > val ds2 = ds1.map(e => e) Why are you

parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Hi Gurus, I have few large tables in rdbms (ours is Oracle). We want to access these tables through Spark JDBC What is the quickest way of getting data into Spark Dataframe say multiple connections from Spark thanking you

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Jacek Laskowski
All of them should be "provided". Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Aug 14, 2016 at 12:26 PM, Mich Talebzadeh

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
LOL well the issue here was the dependencies scripted in that shell script which was modified to add "provided" to it. The script itself still works just the content of one of functions had to be edited function create_sbt_file { SBT_FILE=${GEN_APPSDIR}/scala/${APPLICATION}/${FILE_NAME}.sbt [

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Jacek Laskowski
Hi Mich, Yeah, you don't have to worry about it...and that's why you're asking these questions ;-) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sun, Aug

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
The magic does all that(including compiling and submitting with the jar file. It is flexible as it does all this for any Sala program. it creates sub-directories, compiles, submits etc so I don't have to worry about it. HTH Dr Mich Talebzadeh LinkedIn *

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Jacek Laskowski
Hi, You should have all the deps being "provided" since they're provided by spark infra after you spark-submit the uber-jar for the app. What's the "magic" in local.ksh? Why don't you sbt assembly and do spark-submit with the uber-jar? Pozdrawiam, Jacek Laskowski

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
Thanks Jacek, I thought there was some dependency issue. This did the trick libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" *libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" % "provided"* I

Re:

2016-08-14 Thread Michael Armbrust
You can force a broadcast, but with tables that large its probably not a good idea. However, filtering and then broadcasting one of the joins is likely to get you the benefits of broadcasting (no shuffle on the larger table that will colocate all the skewed tuples to a single overloaded executor)

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Jacek Laskowski
Go to spark-shell and do :imports. You'll see all the imports and you could copy and paste them in your app. (but there are not many honestly and that won't help you much) HiveContext lives in spark-hive. You don't need spark-sql and spark-hive since the latter uses the former as a dependency

Re:

2016-08-14 Thread Jacek Laskowski
Hi Michael, As I understand broadcast joins, Jestin could also use broadcast function on a dataset to make it broadcast. Jestin could force the brodcast without the trick hoping it's gonna kick off brodcast. Correct? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering

Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-14 Thread Jacek Laskowski
Thanks Michael for a prompt response! All you said make sense (glad to have received it from the most trusted source!) spark.read.format("michael").option("header", true).write("notes.adoc") :-) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0

Re: call a mysql stored procedure from spark

2016-08-14 Thread Mich Talebzadeh
Hi, The link deals with JDBC and states: [image: Inline images 1] So it is only SQL. It lacks functionality on Stored procedures with returning result set. This is on an Oracle table scala> var _ORACLEserver = "jdbc:oracle:thin:@rhes564:1521:mydb12" _ORACLEserver: String =

Re:

2016-08-14 Thread Michael Armbrust
Have you tried doing the join in two parts (id == 0 and id != 0) and then doing a union of the results? It is possible that with this technique, that the join which only contains skewed data would be filtered enough to allow broadcasting of one side. On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma

Re: call a mysql stored procedure from spark

2016-08-14 Thread Michael Armbrust
As described here , you can use the DataSource API to connect to an external database using JDBC. While the dbtable option is usually just a table name, it can also be any valid SQL command that returns a

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Michael Armbrust
Anytime you see JaninoRuntimeException you are seeing a bug in our code generation. If you can come up with a small example that causes the problem it would be very helpful if you could open a JIRA. On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar wrote: > I see a similar

Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-14 Thread Michael Armbrust
There are two type systems in play here. Spark SQL's and Scala's. >From the Scala side, this is type-safe. After calling as[String]the Dataset will only return Strings. It is impossible to ever get a class cast exception unless you do your own incorrect casting after the fact. Underneath the

Re: Does Spark SQL support indexes?

2016-08-14 Thread Michael Armbrust
Using df.write.partitionBy is similar to a coarse-grained, clustered index in a traditional database. You can't use it on temporary tables, but it will let you efficiently select small parts of a much larger table. On Sat, Aug 13, 2016 at 11:13 PM, Jörn Franke wrote: >

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
The issue is on Spark shell this works OK Spark context Web UI available at http://50.140.197.217:5 Spark context available as 'sc' (master = local, app id = local-1471191662017). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Koert Kuipers
HiveContext is gone SparkSession now combines functionality of SqlContext and HiveContext (if hive support is available) On Sun, Aug 14, 2016 at 12:12 PM, Mich Talebzadeh wrote: > Thanks Koert, > > I did that before as well. Anyway this is dependencies > >

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
Thanks Koert, I did that before as well. Anyway this is dependencies libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" and the error [info]

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Koert Kuipers
you cannot mix spark 1 and spark 2 jars change this libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" to libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" On Sun, Aug 14, 2016 at 11:58 AM, Mich Talebzadeh wrote: > Hi, > > In Spark

Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
Hi, In Spark 2 I am using sbt or mvn to compile my scala program. This used to compile and run perfectly with Spark 1.6.1 but now it is throwing error I believe the problem is here. I have name := "scala" version := "1.0" scalaVersion := "2.11.7" libraryDependencies += "org.apache.spark" %%

Re: Flattening XML in a DataFrame

2016-08-14 Thread Sreekanth Jella
Hi Hyukjin Kwon, Thank you for reply. There are several types of XML documents with different schema which needs to be parsed and tag names do not know in hand. All we know is the XSD for the given XML. Is it possible to get the same results even when we do not know the xml tags like

Re:

2016-08-14 Thread Mich Talebzadeh
Hi Jestin, You already have the skewed column in the join condition correct? This is basically what you are doing assuming rs is your result set below val rs = df1.join(df2,df1("id")===df2("id"), "fullouter") What is the percentage of df1.id = 0? Can you register both tables as temporary and

Re:

2016-08-14 Thread Jestin Ma
Hi Mich, do you mean using the skewed column as a join condition? I tried repartition(skewed column, unique column) but had no success, possibly because the join was still hash-partitioning on just the skewed column after I called repartition. On Sun, Aug 14, 2016 at 1:49 AM, Mich Talebzadeh

Re: Using spark package XGBoost

2016-08-14 Thread janardhan shetty
Any leads how to do acheive this? On Aug 12, 2016 6:33 PM, "janardhan shetty" wrote: > I tried using *sparkxgboost package *in build.sbt file but it failed. > Spark 2.0 > Scala 2.11.8 > > Error: > [warn] http://dl.bintray.com/spark-packages/maven/ >

Re: mesos or kubernetes ?

2016-08-14 Thread Gurvinder Singh
On 08/13/2016 08:24 PM, guyoh wrote: > My company is trying to decide whether to use kubernetes or mesos. Since we > are planning to use Spark in the near future, I was wandering what is the > best choice for us. > Thanks, > Guy > Both Kubernetes and Mesos enables you to share your

Re: How Spark sql query optimisation work if we are using .rdd action ?

2016-08-14 Thread Mich Talebzadeh
There are two distinct parts here. Optimisation + execution. Spark does not have a Cost Based Optimizer (CBO) yet but that does not matter for now. When we do such operation say outer join between (s) and (t) DFs below, we see scala> val rs = s.join(t,s("time_id")===t("time_id"),

Re: How Spark sql query optimisation work if we are using .rdd action ?

2016-08-14 Thread ayan guha
I do not think so. What I understand Spark will still use Catalyst to join. DF always has an RDD underneath, but that does not mean any action will force less optimal path. On Sun, Aug 14, 2016 at 3:04 PM, mayur bhole wrote: > HI All, > > Lets say, we have > > val df =

Re:

2016-08-14 Thread Mich Talebzadeh
Can you make the join more selective by using the skewed column ID + another column that has valid unique vales( Repartitioning according to column I know contains unique values)? HTH Dr Mich Talebzadeh LinkedIn *

[no subject]

2016-08-14 Thread Jestin Ma
Hi, I'm currently trying to perform an outer join between two DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id. df1.id is skewed in that there are many 0's, the rest being unique IDs. df2.id is not skewed. If I filter df1.id != 0, then the join works well. If I don't, then the

Re: Does Spark SQL support indexes?

2016-08-14 Thread Jörn Franke
Use a format that has built-in indexes, such as Parquet or Orc. Do not forget to sort the data on the columns that your filter on. > On 14 Aug 2016, at 05:03, Taotao.Li wrote: > > > hi, guys, does Spark SQL support indexes? if so, how can I create an index > on my