Re: How to add custom steps to Pipeline models?

2016-08-14 Thread Jacek Laskowski
Hi,

If it's Python I can't help. I'm with Scala.

Jacek

On 14 Aug 2016 9:27 p.m., "Evan Zamir"  wrote:

> Thanks, but I should have been more clear that I'm trying to do this in
> PySpark, not Scala. Using an example I found on SO, I was able to implement
> a Pipeline step in Python, but it seems it is more difficult (perhaps
> currently impossible) to make it persist to disk (I tried implementing
> _to_java method to no avail). Any ideas about that?
>
> On Sun, Aug 14, 2016 at 6:02 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> It should just work if you followed the Transformer interface [1].
>> When you have the transformers, creating a Pipeline is a matter of
>> setting them as additional stages (using Pipeline.setStages [2]).
>>
>> [1] https://github.com/apache/spark/blob/master/mllib/src/
>> main/scala/org/apache/spark/ml/Transformer.scala
>> [2] https://github.com/apache/spark/blob/master/mllib/src/
>> main/scala/org/apache/spark/ml/Pipeline.scala#L107
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Aug 12, 2016 at 9:19 AM, evanzamir  wrote:
>> > I'm building an LDA Pipeline, currently with 4 steps, Tokenizer,
>> > StopWordsRemover, CountVectorizer, and LDA. I would like to add more
>> steps,
>> > for example, stemming and lemmatization, and also 1-gram and 2-grams
>> (which
>> > I believe is not supported by the default NGram class). Is there a way
>> to
>> > add these steps? In sklearn, you can create classes with fit() and
>> > transform() methods, and that should be enough. Is that true in Spark
>> ML as
>> > well (or something similar)?
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-to-add-custom-steps-to-Pipeline-
>> models-tp27522.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>


Re: How to add custom steps to Pipeline models?

2016-08-14 Thread Evan Zamir
Thanks, but I should have been more clear that I'm trying to do this in
PySpark, not Scala. Using an example I found on SO, I was able to implement
a Pipeline step in Python, but it seems it is more difficult (perhaps
currently impossible) to make it persist to disk (I tried implementing
_to_java method to no avail). Any ideas about that?

On Sun, Aug 14, 2016 at 6:02 PM Jacek Laskowski  wrote:

> Hi,
>
> It should just work if you followed the Transformer interface [1].
> When you have the transformers, creating a Pipeline is a matter of
> setting them as additional stages (using Pipeline.setStages [2]).
>
> [1]
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala
> [2]
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L107
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Aug 12, 2016 at 9:19 AM, evanzamir  wrote:
> > I'm building an LDA Pipeline, currently with 4 steps, Tokenizer,
> > StopWordsRemover, CountVectorizer, and LDA. I would like to add more
> steps,
> > for example, stemming and lemmatization, and also 1-gram and 2-grams
> (which
> > I believe is not supported by the default NGram class). Is there a way to
> > add these steps? In sklearn, you can create classes with fit() and
> > transform() methods, and that should be enough. Is that true in Spark ML
> as
> > well (or something similar)?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-custom-steps-to-Pipeline-models-tp27522.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>


spark ml : auc on extreme distributed data

2016-08-14 Thread Zhiliang Zhu
Hi All, 
Here I have lot of data with around 1,000,000 rows, 97% of them are negative 
class and 3% of them are positive class .  I applied Random Forest algorithm to 
build the model and predict the testing data.
For the data preparation,i. firstly randomly split all the data as training 
data and testing data by 0.7 : 0.3ii. let the testing data unchanged, its 
negative and positive class ratio would still be 97:3iii. try to make the 
training data negative and positive class ratio as 50:50, by way of sample 
algorithm in the different classesiv. get RF model by training data and predict 
testing data
by modifying algorithm parameters and feature work (PCA etc ), it seems that 
the auc on the testing data is always above 0.8, or much more higher ...
Then I lose into some confusion... It seems that the model or auc depends a lot 
on the original data distribution...In effect, I would like to know, for this 
data distribution, how its auc would be for random guess?What the auc would be 
for any kind of data distribution?
Thanks in advance~~

Re: Using spark package XGBoost

2016-08-14 Thread Brandon White
The XGBoost integration with Spark is currently only supported for RDDs,
there is a ticket for dataframe and folks calm to be working on it.

On Aug 14, 2016 8:15 PM, "Jacek Laskowski"  wrote:

> Hi,
>
> I've never worked with the library and speaking about sbt setup only.
>
> It appears that the project didn't release 2.11-compatible jars (only
> 2.10) [1] so you need to build the project yourself and uber-jar it
> (using sbt-assembly plugin).
>
> [1] https://spark-packages.org/package/rotationsymmetry/sparkxgboost
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Aug 14, 2016 at 7:13 AM, janardhan shetty
>  wrote:
> > Any leads how to do acheive this?
> >
> > On Aug 12, 2016 6:33 PM, "janardhan shetty" 
> wrote:
> >>
> >> I tried using  sparkxgboost package in build.sbt file but it failed.
> >> Spark 2.0
> >> Scala 2.11.8
> >>
> >> Error:
> >>  [warn]
> >> http://dl.bintray.com/spark-packages/maven/
> rotationsymmetry/sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.
> 1-s_2.10-javadoc.jar
> >>[warn] ::
> >>[warn] ::  FAILED DOWNLOADS::
> >>[warn] :: ^ see resolution messages for details  ^ ::
> >>[warn] ::
> >>[warn] ::
> >> rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(src)
> >>[warn] ::
> >> rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(doc)
> >>
> >> build.sbt:
> >>
> >> scalaVersion := "2.11.8"
> >>
> >> libraryDependencies ++= {
> >>   val sparkVersion = "2.0.0-preview"
> >>   Seq(
> >> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
> >> "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
> >> "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
> >> "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
> >>   )
> >> }
> >>
> >> resolvers += "Spark Packages Repo" at
> >> "http://dl.bintray.com/spark-packages/maven;
> >>
> >> libraryDependencies += "rotationsymmetry" % "sparkxgboost" %
> >> "0.2.1-s_2.10"
> >>
> >> assemblyMergeStrategy in assembly := {
> >>   case PathList("META-INF", "MANIFEST.MF")   =>
> >> MergeStrategy.discard
> >>   case PathList("javax", "servlet", xs @ _*) =>
> >> MergeStrategy.first
> >>   case PathList(ps @ _*) if ps.last endsWith ".html" =>
> >> MergeStrategy.first
> >>   case "application.conf"=>
> >> MergeStrategy.concat
> >>   case "unwanted.txt"=>
> >> MergeStrategy.discard
> >>
> >>   case x => val oldStrategy = (assemblyMergeStrategy in assembly).value
> >> oldStrategy(x)
> >>
> >> }
> >>
> >>
> >>
> >>
> >> On Fri, Aug 12, 2016 at 3:35 PM, janardhan shetty <
> janardhan...@gmail.com>
> >> wrote:
> >>>
> >>> Is there a dataframe version of XGBoost in spark-ml ?.
> >>> Has anyone used sparkxgboost package ?
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Using spark package XGBoost

2016-08-14 Thread Jacek Laskowski
Hi,

I've never worked with the library and speaking about sbt setup only.

It appears that the project didn't release 2.11-compatible jars (only
2.10) [1] so you need to build the project yourself and uber-jar it
(using sbt-assembly plugin).

[1] https://spark-packages.org/package/rotationsymmetry/sparkxgboost

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Aug 14, 2016 at 7:13 AM, janardhan shetty
 wrote:
> Any leads how to do acheive this?
>
> On Aug 12, 2016 6:33 PM, "janardhan shetty"  wrote:
>>
>> I tried using  sparkxgboost package in build.sbt file but it failed.
>> Spark 2.0
>> Scala 2.11.8
>>
>> Error:
>>  [warn]
>> http://dl.bintray.com/spark-packages/maven/rotationsymmetry/sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.1-s_2.10-javadoc.jar
>>[warn] ::
>>[warn] ::  FAILED DOWNLOADS::
>>[warn] :: ^ see resolution messages for details  ^ ::
>>[warn] ::
>>[warn] ::
>> rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(src)
>>[warn] ::
>> rotationsymmetry#sparkxgboost;0.2.1-s_2.10!sparkxgboost.jar(doc)
>>
>> build.sbt:
>>
>> scalaVersion := "2.11.8"
>>
>> libraryDependencies ++= {
>>   val sparkVersion = "2.0.0-preview"
>>   Seq(
>> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
>> "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
>> "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
>> "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
>>   )
>> }
>>
>> resolvers += "Spark Packages Repo" at
>> "http://dl.bintray.com/spark-packages/maven;
>>
>> libraryDependencies += "rotationsymmetry" % "sparkxgboost" %
>> "0.2.1-s_2.10"
>>
>> assemblyMergeStrategy in assembly := {
>>   case PathList("META-INF", "MANIFEST.MF")   =>
>> MergeStrategy.discard
>>   case PathList("javax", "servlet", xs @ _*) =>
>> MergeStrategy.first
>>   case PathList(ps @ _*) if ps.last endsWith ".html" =>
>> MergeStrategy.first
>>   case "application.conf"=>
>> MergeStrategy.concat
>>   case "unwanted.txt"=>
>> MergeStrategy.discard
>>
>>   case x => val oldStrategy = (assemblyMergeStrategy in assembly).value
>> oldStrategy(x)
>>
>> }
>>
>>
>>
>>
>> On Fri, Aug 12, 2016 at 3:35 PM, janardhan shetty 
>> wrote:
>>>
>>> Is there a dataframe version of XGBoost in spark-ml ?.
>>> Has anyone used sparkxgboost package ?
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-14 Thread Jacek Laskowski
Hi Jestin,

You can find the docs of the latest and greatest Spark at
http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/.

The jars are at the ASF SNAPSHOT repo at
http://repository.apache.org/snapshots/.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma  wrote:
> If we want to use versions of Spark beyond the official 2.0.0 release,
> specifically on Maven + Java, what steps should we take to upgrade? I can't
> find the newer versions on Maven central.
>
> Thank you!
> Jestin

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Simulate serialization when running local

2016-08-14 Thread Jacek Laskowski
Hi Ashic,

Yes, there is one - local-cluster[N, cores, memory] - that you can use
for simulating a Spark cluster of [N, cores, memory] locally.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2478

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Aug 10, 2016 at 10:24 AM, Ashic Mahtab  wrote:
> Hi,
> Is there a way to simulate "networked" spark when running local (i.e.
> master=local[4])? Ideally, some setting that'll ensure any "Task not
> serializable" errors are caught during local testing? I seem to vaguely
> remember something, but am having trouble pinpointing it.
>
> Cheers,
> Ashic.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to add custom steps to Pipeline models?

2016-08-14 Thread Jacek Laskowski
Hi,

It should just work if you followed the Transformer interface [1].
When you have the transformers, creating a Pipeline is a matter of
setting them as additional stages (using Pipeline.setStages [2]).

[1] 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala
[2] 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L107

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Aug 12, 2016 at 9:19 AM, evanzamir  wrote:
> I'm building an LDA Pipeline, currently with 4 steps, Tokenizer,
> StopWordsRemover, CountVectorizer, and LDA. I would like to add more steps,
> for example, stemming and lemmatization, and also 1-gram and 2-grams (which
> I believe is not supported by the default NGram class). Is there a way to
> add these steps? In sklearn, you can create classes with fit() and
> transform() methods, and that should be enough. Is that true in Spark ML as
> well (or something similar)?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-custom-steps-to-Pipeline-models-tp27522.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re:

2016-08-14 Thread Jestin Ma
Hi Michael, Mich, and Jacek, thank you for providing good suggestions. I
found some ways of getting rid of skew, such as the approaches you have
suggested (filtering, broadcasting, joining, unioning), as well as salting
my 0-value IDs.

Thank you for the help!


On Sun, Aug 14, 2016 at 11:33 AM, Michael Armbrust 
wrote:

> You can force a broadcast, but with tables that large its probably not a
> good idea.  However, filtering and then broadcasting one of the joins is
> likely to get you the benefits of broadcasting (no shuffle on the larger
> table that will colocate all the skewed tuples to a single overloaded
> executor) without attempting to broadcast something thats too large.
>
> On Sun, Aug 14, 2016 at 11:02 AM, Jacek Laskowski  wrote:
>
>> Hi Michael,
>>
>> As I understand broadcast joins, Jestin could also use broadcast
>> function on a dataset to make it broadcast. Jestin could force the
>> brodcast without the trick hoping it's gonna kick off brodcast.
>> Correct?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Aug 14, 2016 at 9:51 AM, Michael Armbrust
>>  wrote:
>> > Have you tried doing the join in two parts (id == 0 and id != 0) and
>> then
>> > doing a union of the results?  It is possible that with this technique,
>> that
>> > the join which only contains skewed data would be filtered enough to
>> allow
>> > broadcasting of one side.
>> >
>> > On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma 
>> > wrote:
>> >>
>> >> Hi, I'm currently trying to perform an outer join between two
>> >> DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id.
>> >>
>> >> df1.id is skewed in that there are many 0's, the rest being unique
>> IDs.
>> >>
>> >> df2.id is not skewed. If I filter df1.id != 0, then the join works
>> well.
>> >> If I don't, then the join does not complete for a very, very long time.
>> >>
>> >> I have diagnosed this problem due to the hashpartitioning on IDs,
>> >> resulting in one partition containing many values due to data skew. One
>> >> executor ends up reading most of the shuffle data, and writing all of
>> the
>> >> shuffle data, as shown below.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Shown above is the task in question assigned to one executor.
>> >>
>> >>
>> >>
>> >> This screenshot comes from one of the executors, showing one single
>> thread
>> >> spilling sort data since the executor cannot hold 90%+ of the ~200 GB
>> result
>> >> in memory.
>> >>
>> >> Moreover, looking at the event timeline, I find that the executor on
>> that
>> >> task spends about 20% time reading shuffle data, 70% computation, and
>> 10%
>> >> writing output data.
>> >>
>> >> I have tried the following:
>> >>
>> >> "Salting" the 0-value keys by monotonically_increasing_id().mod(N)
>> >> - This doesn't seem to have an effect since now I have
>> hundreds/thousands
>> >> of keys with tens of thousands of occurrences.
>> >> - Should I increase N? Is there a way to just do random.mod(N) instead
>> of
>> >> monotonically_increasing_id()?
>> >>
>> >> Repartitioning according to column I know contains unique values
>> >>
>> >> - This is overridden by Spark's sort-based shuffle manager which hash
>> >> repartitions on the skewed column
>> >>
>> >> - Is it possible to change this? Or will the join column need to be
>> hashed
>> >> and partitioned on for joins to work
>> >>
>> >> Broadcasting does not work for my large tables
>> >>
>> >> Increasing/decreasing spark.sql.shuffle.partitions does not remedy the
>> >> skewed data problem as 0-product values are still being hashed to the
>> same
>> >> partition.
>> >>
>> >>
>> >> --
>> >>
>> >> What I am considering currently is doing the join at the RDD level,
>> but is
>> >> there any level of control which can solve my skewed data problem?
>> Other
>> >> than that, see the bolded question.
>> >>
>> >> I would appreciate any suggestions/tips/experience with this. Thank
>> you!
>> >>
>> >
>>
>
>


Re: call a mysql stored procedure from spark

2016-08-14 Thread ayan guha
More than technical feasibility, I would ask why to invoke a stored
procedure for every row? If not, jdbcRdd is moot point.

In case stored procedure should be invoked from driver, it can be easily
done. Or at most for each partition, at each executor.
On 15 Aug 2016 03:06, "Mich Talebzadeh"  wrote:

> Hi,
>
> The link deals with JDBC and states:
>
> [image: Inline images 1]
>
> So it is only SQL. It lacks functionality on Stored procedures with
> returning result set.
>
> This is on an Oracle table
>
> scala>  var _ORACLEserver = "jdbc:oracle:thin:@rhes564:1521:mydb12"
> _ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12
> scala>  var _username = "scratchpad"
> _username: String = scratchpad
> scala> var _password = "xxx"
> _password: String = oracle
>
> scala> val s = HiveContext.read.format("jdbc").options(
>  | Map("url" -> _ORACLEserver,
>  | *"dbtable" -> "exec weights_sp",*
>  | "user" -> _username,
>  | "password" -> _password)).load
> java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
>
>
> and that stored procedure exists in Oracle
>
> scratch...@mydb12.mich.LOCAL> desc weights_sp
> PROCEDURE weights_sp
>
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 August 2016 at 17:42, Michael Armbrust 
> wrote:
>
>> As described here
>> ,
>> you can use the DataSource API to connect to an external database using
>> JDBC.  While the dbtable option is usually just a table name, it can
>> also be any valid SQL command that returns a table when enclosed in
>> (parentheses).  I'm not certain, but I'd expect you could use this feature
>> to invoke a stored procedure and return the results as a DataFrame.
>>
>> On Sat, Aug 13, 2016 at 10:40 AM, sujeet jog 
>> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to call a stored procedure using spark ?
>>>
>>>
>>> thanks,
>>> Sujeet
>>>
>>
>>
>


Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Ted Yu
Looks like the proposed fix was reverted:

Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply
method grows beyond 64 KB"

This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.

Maybe this was fixed in some other JIRA ?

On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar  wrote:

> I see a similar issue being resolved recently: https://issues.
> apache.org/jira/browse/SPARK-15285
>
> On Fri, Aug 12, 2016 at 3:33 PM, Aris  wrote:
>
>> Hello folks,
>>
>> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
>> smaller data unit tests work on my laptop, when I'm on a cluster, I get
>> cryptic error messages:
>>
>> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
>>> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/
>>> spark/sql/catalyst/InternalRow;)I" of class
>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering"
>>> grows beyond 64 KB
>>>
>>
>> Unfortunately I'm not clear on how to even isolate the source of this
>> problem. I didn't have this problem in Spark 1.6.1.
>>
>> Any clues?
>>
>
>
>
> --
> -Dhruve Ashar
>
>


Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Thank you very much sir.
I forgot to mention that two of these Oracle tables are range partitioned. In 
that case what would be the optimum number of partitions if you can share?
Warmest 

On Sunday, 14 August 2016, 21:37, Mich Talebzadeh 
 wrote:
 

 If you have primary keys on these tables then you can parallelise the process 
reading data.
You have to be careful not to set the number of partitions too many. Certainly 
there is a balance between the number of partitions supplied to JDBC and the 
load on the network and the source DB.
Assuming that your underlying table has primary key ID, then this will create 
20 parallel processes to Oracle DB
 val d = HiveContext.read.format("jdbc").options(
 Map("url" -> _ORACLEserver,
 "dbtable" -> "(SELECT , , FROM )",
 "partitionColumn" -> "ID",
 "lowerBound" -> "1",
 "upperBound" -> "maxID",
 "numPartitions" -> "20",
 "user" -> _username,
 "password" -> _password)).load
assuming your upper bound on ID is maxID

This will open multiple connections to RDBMS, each getting a subset of data 
that you want.
You need to test it to ensure that you get the numPartitions optimum and you 
don't overload any component.
HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 14 August 2016 at 21:15, Ashok Kumar  wrote:

Hi,
There are 4 tables ranging from 10 million to 100 million rows but they all 
have primary keys.
The network is fine but our Oracle is RAC and we can only connect to a 
designated Oracle node (where we have a DQ account only).
We have a limited time window of few hours to get the required data out.
Thanks 

On Sunday, 14 August 2016, 21:07, Mich Talebzadeh 
 wrote:
 

 How big are your tables and is there any issue with the network between your 
Spark nodes and your Oracle DB that adds to issues?
HTH
Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/ profile/view?id= 
AAEWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw http://talebzadehmich. wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 14 August 2016 at 20:50, Ashok Kumar  wrote:

Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these 
tables through Spark JDBC
What is the quickest way of getting data into Spark Dataframe say multiple 
connections from Spark
thanking you





   



  

Re: parallel processing with JDBC

2016-08-14 Thread Mich Talebzadeh
If you have primary keys on these tables then you can parallelise the
process reading data.

You have to be careful not to set the number of partitions too many.
Certainly there is a balance between the number of partitions supplied to
JDBC and the load on the network and the source DB.

Assuming that your underlying table has primary key ID, then this will
create 20 parallel processes to Oracle DB

 val d = HiveContext.read.format("jdbc").options(
 Map("url" -> _ORACLEserver,
 "dbtable" -> "(SELECT , , FROM )",
 "partitionColumn" -> "ID",
 "lowerBound" -> "1",
 "upperBound" -> "maxID",
 "numPartitions" -> "20",
 "user" -> _username,
 "password" -> _password)).load

assuming your upper bound on ID is maxID


This will open multiple connections to RDBMS, each getting a subset of data
that you want.

You need to test it to ensure that you get the numPartitions optimum and
you don't overload any component.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 21:15, Ashok Kumar 
wrote:

> Hi,
>
> There are 4 tables ranging from 10 million to 100 million rows but they
> all have primary keys.
>
> The network is fine but our Oracle is RAC and we can only connect to a
> designated Oracle node (where we have a DQ account only).
>
> We have a limited time window of few hours to get the required data out.
>
> Thanks
>
>
> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> How big are your tables and is there any issue with the network between
> your Spark nodes and your Oracle DB that adds to issues?
>
> HTH
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
> On 14 August 2016 at 20:50, Ashok Kumar 
> wrote:
>
> Hi Gurus,
>
> I have few large tables in rdbms (ours is Oracle). We want to access these
> tables through Spark JDBC
>
> What is the quickest way of getting data into Spark Dataframe say multiple
> connections from Spark
>
> thanking you
>
>
>
>
>
>


Re: parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Hi,
There are 4 tables ranging from 10 million to 100 million rows but they all 
have primary keys.
The network is fine but our Oracle is RAC and we can only connect to a 
designated Oracle node (where we have a DQ account only).
We have a limited time window of few hours to get the required data out.
Thanks 

On Sunday, 14 August 2016, 21:07, Mich Talebzadeh 
 wrote:
 

 How big are your tables and is there any issue with the network between your 
Spark nodes and your Oracle DB that adds to issues?
HTH
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction.  
On 14 August 2016 at 20:50, Ashok Kumar  wrote:

Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these 
tables through Spark JDBC
What is the quickest way of getting data into Spark Dataframe say multiple 
connections from Spark
thanking you





  

Re: Role-based S3 access outside of EMR

2016-08-14 Thread Steve Loughran

On 29 Jul 2016, at 00:07, Everett Anderson 
> wrote:

Hey,

Just wrapping this up --

I ended up following the 
instructions to build 
a custom Spark release with Hadoop 2.7.2, stealing from Steve's SPARK-7481 PR a 
bit, in order to get Spark 1.6.2 + Hadoop 2.7.2 + the hadoop-aws library (which 
pulls in the proper AWS Java SDK dependency).

Now that there's an official Spark 2.0 + Hadoop 2.7.x release, this is probably 
no longer necessary, but I haven't tried it, yet.


you still need need to get the hadoop-aws and compatible JARs into your lib 
dir; the SPARK-7481 patch does that and gets the hadoop-aws JAR it into 
spark-assembly JAR, something which isn't directly relevant for spark 2.

The PR is still tagged as WiP pending the release of Hadoop 2.7.3, which will 
swallow classload exceptions when enumerating filesystem clients declared in 
JARs ... without that the presence of hadoop-aws or hadoop-azure on the 
classpath *without the matching amazon or azure JARs* will cause startup to 
fail.


With the custom release, s3a paths work fine with EC2 role credentials without 
doing anything special. The only thing I had to do was to add this extra --conf 
flag to spark-submit in order to write to encrypted S3 buckets --

--conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256


I'd really like to know what performance difference you are seeing working with 
server-side encryption and different file formats; can you do any tests using 
encrypted and unencrypted copies of the same datasets and see how the times 
come out?


Full instructions for building on Mac are here:

1) Download the Spark 1.6.2 source from https://spark.apache.org/downloads.html

2) Install R

brew tap homebrew/science
brew install r

3) Set JAVA_HOME and the MAVEN_OPTS as in the instructions

4) Modify the root pom.xml to add a hadoop-2.7 profile (mostly stolen from 
Spark 2.0)


  hadoop-2.7
  
2.7.2
0.9.3
3.4.6
2.6.0
  
  

  
org.apache.hadoop
hadoop-aws
${hadoop.version}
${hadoop.deps.scope}

  
org.apache.hadoop
hadoop-common
  
  
commons-logging
commons-logging
  

  

  


5) Modify core/pom.xml to include the corresponding hadoop-aws and AWS SDK libs


  org.apache.hadoop
  hadoop-client


  org.apache.hadoop
  hadoop-aws
  

  org.apache.hadoop
  hadoop-common


  commons-logging
  commons-logging

  


6) Build with

./make-distribution.sh --name custom-hadoop-2.7-2-aws-s3a --tgz -Psparkr 
-Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn





Re: parallel processing with JDBC

2016-08-14 Thread Mich Talebzadeh
How big are your tables and is there any issue with the network between
your Spark nodes and your Oracle DB that adds to issues?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 20:50, Ashok Kumar 
wrote:

> Hi Gurus,
>
> I have few large tables in rdbms (ours is Oracle). We want to access these
> tables through Spark JDBC
>
> What is the quickest way of getting data into Spark Dataframe say multiple
> connections from Spark
>
> thanking you
>
>
>


Re: Change nullable property in Dataset schema

2016-08-14 Thread Jacek Laskowski
On Wed, Aug 10, 2016 at 12:04 AM, Kazuaki Ishizaki  wrote:

>   import testImplicits._
>   test("test") {
> val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2),
> Array(3, 3)), 1).toDS

You should just Seq(...).toDS

> val ds2 = ds1.map(e => e)

Why are you e => e (since it's identity) and does nothing?

>   .as(RowEncoder(new StructType()
>  .add("value", ArrayType(IntegerType, false), nullable = false)))

I didn't know it's possible but looks like it's toDF where you could
replace the schema too (in a less involved way).

I learnt quite a lot from just a single email. Thanks!

Pozdrawiam,
Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



parallel processing with JDBC

2016-08-14 Thread Ashok Kumar
Hi Gurus,
I have few large tables in rdbms (ours is Oracle). We want to access these 
tables through Spark JDBC
What is the quickest way of getting data into Spark Dataframe say multiple 
connections from Spark
thanking you



Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Jacek Laskowski
All of them should be "provided".

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Aug 14, 2016 at 12:26 PM, Mich Talebzadeh
 wrote:
> LOL
>
> well the issue here was the dependencies scripted in that shell script which
> was modified to add "provided" to it.
>
> The script itself still works just the content of one of functions had to be
> edited
>
> function create_sbt_file {
> SBT_FILE=${GEN_APPSDIR}/scala/${APPLICATION}/${FILE_NAME}.sbt
> [ -f ${SBT_FILE} ] && rm -f ${SBT_FILE}
> cat >> $SBT_FILE << !
> name := "scala"
> version := "1.0"
> scalaVersion := "2.11.7"
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
> "provided"
> .
> .
> !
> }
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 14 August 2016 at 20:17, Jacek Laskowski  wrote:
>>
>> Hi Mich,
>>
>> Yeah, you don't have to worry about it...and that's why you're asking
>> these questions ;-)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Aug 14, 2016 at 12:06 PM, Mich Talebzadeh
>>  wrote:
>> > The magic does all that(including compiling and submitting with the jar
>> > file. It is flexible as it does all this for any Sala program. it
>> > creates
>> > sub-directories, compiles, submits etc so I don't have to worry about
>> > it.
>> >
>> > HTH
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> > arise
>> > from relying on this email's technical content is explicitly disclaimed.
>> > The
>> > author will in no case be liable for any monetary damages arising from
>> > such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>> > On 14 August 2016 at 20:01, Jacek Laskowski  wrote:
>> >>
>> >> Hi,
>> >>
>> >> You should have all the deps being "provided" since they're provided
>> >> by spark infra after you spark-submit the uber-jar for the app.
>> >>
>> >> What's the "magic" in local.ksh? Why don't you sbt assembly and do
>> >> spark-submit with the uber-jar?
>> >>
>> >> Pozdrawiam,
>> >> Jacek Laskowski
>> >> 
>> >> https://medium.com/@jaceklaskowski/
>> >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> >> Follow me at https://twitter.com/jaceklaskowski
>> >>
>> >>
>> >> On Sun, Aug 14, 2016 at 11:52 AM, Mich Talebzadeh
>> >>  wrote:
>> >> > Thanks Jacek,
>> >> >
>> >> > I thought there was some dependency issue. This did the trick
>> >> >
>> >> > libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
>> >> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
>> >> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
>> >> > "provided"
>> >> >
>> >> > I use a shell script that builds the jar file depending on type (sbt,
>> >> > mvn,
>> >> > assembly)  and submits it via spark-submit ..
>> >> >
>> >> > ./local.ksh -A ETL_scratchpad_dummy -T sbt
>> >> >
>> >> > As I understand "provided" means that the dependencies will be
>> >> > provided
>> >> > at
>> >> > run-time (spark-submit) through the jar files but they are not needed
>> >> > at
>> >> > compile time.
>> >> >
>> >> > Having said that am I correct that error message below
>> >> >
>> >> > [error] bad symbolic reference. A signature in HiveContext.class
>> >> > refers
>> >> > to
>> >> > type Logging
>> >> > [error] in package org.apache.spark which is not available.
>> >> > [error] It may be completely missing from the current classpath, or
>> >> > the
>> >> > version on
>> >> > [error] the classpath might be incompatible with the version used
>> >> > when
>> >> > compiling HiveContext.class.
>> >> > [error] one error found
>> >> > [error] (compile:compileIncremental) Compilation failed
>> >> >
>> >> > meant that some form of libraries incompatibility was 

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
LOL

well the issue here was the dependencies scripted in that shell
script which was modified to add "provided" to it.

The script itself still works just the content of one of functions had to
be edited

function create_sbt_file {
SBT_FILE=${GEN_APPSDIR}/scala/${APPLICATION}/${FILE_NAME}.sbt
[ -f ${SBT_FILE} ] && rm -f ${SBT_FILE}
cat >> $SBT_FILE << !
name := "scala"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
"provided"
.
.
!
}


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 20:17, Jacek Laskowski  wrote:

> Hi Mich,
>
> Yeah, you don't have to worry about it...and that's why you're asking
> these questions ;-)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Aug 14, 2016 at 12:06 PM, Mich Talebzadeh
>  wrote:
> > The magic does all that(including compiling and submitting with the jar
> > file. It is flexible as it does all this for any Sala program. it creates
> > sub-directories, compiles, submits etc so I don't have to worry about it.
> >
> > HTH
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
> >
> >
> > On 14 August 2016 at 20:01, Jacek Laskowski  wrote:
> >>
> >> Hi,
> >>
> >> You should have all the deps being "provided" since they're provided
> >> by spark infra after you spark-submit the uber-jar for the app.
> >>
> >> What's the "magic" in local.ksh? Why don't you sbt assembly and do
> >> spark-submit with the uber-jar?
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>
> >> On Sun, Aug 14, 2016 at 11:52 AM, Mich Talebzadeh
> >>  wrote:
> >> > Thanks Jacek,
> >> >
> >> > I thought there was some dependency issue. This did the trick
> >> >
> >> > libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
> >> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
> >> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
> >> > "provided"
> >> >
> >> > I use a shell script that builds the jar file depending on type (sbt,
> >> > mvn,
> >> > assembly)  and submits it via spark-submit ..
> >> >
> >> > ./local.ksh -A ETL_scratchpad_dummy -T sbt
> >> >
> >> > As I understand "provided" means that the dependencies will be
> provided
> >> > at
> >> > run-time (spark-submit) through the jar files but they are not needed
> at
> >> > compile time.
> >> >
> >> > Having said that am I correct that error message below
> >> >
> >> > [error] bad symbolic reference. A signature in HiveContext.class
> refers
> >> > to
> >> > type Logging
> >> > [error] in package org.apache.spark which is not available.
> >> > [error] It may be completely missing from the current classpath, or
> the
> >> > version on
> >> > [error] the classpath might be incompatible with the version used when
> >> > compiling HiveContext.class.
> >> > [error] one error found
> >> > [error] (compile:compileIncremental) Compilation failed
> >> >
> >> > meant that some form of libraries incompatibility was happening at
> >> > compile
> >> > time?
> >> >
> >> > Cheers
> >> >
> >> >
> >> > Dr Mich Talebzadeh
> >> >
> >> >
> >> >
> >> > LinkedIn
> >> >
> >> > https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >
> >> >
> >> >
> >> > http://talebzadehmich.wordpress.com
> >> >
> >> >
> >> > Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >> > loss, damage or destruction of data or any 

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Jacek Laskowski
Hi Mich,

Yeah, you don't have to worry about it...and that's why you're asking
these questions ;-)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Aug 14, 2016 at 12:06 PM, Mich Talebzadeh
 wrote:
> The magic does all that(including compiling and submitting with the jar
> file. It is flexible as it does all this for any Sala program. it creates
> sub-directories, compiles, submits etc so I don't have to worry about it.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 14 August 2016 at 20:01, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> You should have all the deps being "provided" since they're provided
>> by spark infra after you spark-submit the uber-jar for the app.
>>
>> What's the "magic" in local.ksh? Why don't you sbt assembly and do
>> spark-submit with the uber-jar?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Aug 14, 2016 at 11:52 AM, Mich Talebzadeh
>>  wrote:
>> > Thanks Jacek,
>> >
>> > I thought there was some dependency issue. This did the trick
>> >
>> > libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
>> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
>> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
>> > "provided"
>> >
>> > I use a shell script that builds the jar file depending on type (sbt,
>> > mvn,
>> > assembly)  and submits it via spark-submit ..
>> >
>> > ./local.ksh -A ETL_scratchpad_dummy -T sbt
>> >
>> > As I understand "provided" means that the dependencies will be provided
>> > at
>> > run-time (spark-submit) through the jar files but they are not needed at
>> > compile time.
>> >
>> > Having said that am I correct that error message below
>> >
>> > [error] bad symbolic reference. A signature in HiveContext.class refers
>> > to
>> > type Logging
>> > [error] in package org.apache.spark which is not available.
>> > [error] It may be completely missing from the current classpath, or the
>> > version on
>> > [error] the classpath might be incompatible with the version used when
>> > compiling HiveContext.class.
>> > [error] one error found
>> > [error] (compile:compileIncremental) Compilation failed
>> >
>> > meant that some form of libraries incompatibility was happening at
>> > compile
>> > time?
>> >
>> > Cheers
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> > arise
>> > from relying on this email's technical content is explicitly disclaimed.
>> > The
>> > author will in no case be liable for any monetary damages arising from
>> > such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>> > On 14 August 2016 at 19:11, Jacek Laskowski  wrote:
>> >>
>> >> Go to spark-shell and do :imports. You'll see all the imports and you
>> >> could copy and paste them in your app. (but there are not many
>> >> honestly and that won't help you much)
>> >>
>> >> HiveContext lives in spark-hive. You don't need spark-sql and
>> >> spark-hive since the latter uses the former as a dependency (unless
>> >> you're using types that come from the other dependencies). You don't
>> >> need spark-core either. Make the dependencies simpler by:
>> >>
>> >> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>> >>
>> >> and mark it % Provided.
>> >>
>> >> The reason for provided is that you don't need that for uber-jar that
>> >> you're going to spark-submit.
>> >>
>> >> Don't forget to reload your session of sbt you're compiling in. Unsure
>> >> how you do it so quit your sbt session and do `sbt compile`.
>> >>
>> >> Ask away if you need more details.
>> >>
>> >> Pozdrawiam,
>> >> Jacek Laskowski
>> >> 
>> >> https://medium.com/@jaceklaskowski/
>> >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> >> Follow me at https://twitter.com/jaceklaskowski
>> >>
>> >>
>> >> On Sun, Aug 14, 2016 at 9:26 AM, Mich Talebzadeh

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
The magic does all that(including compiling and submitting with the jar
file. It is flexible as it does all this for any Sala program. it creates
sub-directories, compiles, submits etc so I don't have to worry about it.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 20:01, Jacek Laskowski  wrote:

> Hi,
>
> You should have all the deps being "provided" since they're provided
> by spark infra after you spark-submit the uber-jar for the app.
>
> What's the "magic" in local.ksh? Why don't you sbt assembly and do
> spark-submit with the uber-jar?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Aug 14, 2016 at 11:52 AM, Mich Talebzadeh
>  wrote:
> > Thanks Jacek,
> >
> > I thought there was some dependency issue. This did the trick
> >
> > libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
> > libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
> > libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
> > "provided"
> >
> > I use a shell script that builds the jar file depending on type (sbt,
> mvn,
> > assembly)  and submits it via spark-submit ..
> >
> > ./local.ksh -A ETL_scratchpad_dummy -T sbt
> >
> > As I understand "provided" means that the dependencies will be provided
> at
> > run-time (spark-submit) through the jar files but they are not needed at
> > compile time.
> >
> > Having said that am I correct that error message below
> >
> > [error] bad symbolic reference. A signature in HiveContext.class refers
> to
> > type Logging
> > [error] in package org.apache.spark which is not available.
> > [error] It may be completely missing from the current classpath, or the
> > version on
> > [error] the classpath might be incompatible with the version used when
> > compiling HiveContext.class.
> > [error] one error found
> > [error] (compile:compileIncremental) Compilation failed
> >
> > meant that some form of libraries incompatibility was happening at
> compile
> > time?
> >
> > Cheers
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
> >
> >
> > On 14 August 2016 at 19:11, Jacek Laskowski  wrote:
> >>
> >> Go to spark-shell and do :imports. You'll see all the imports and you
> >> could copy and paste them in your app. (but there are not many
> >> honestly and that won't help you much)
> >>
> >> HiveContext lives in spark-hive. You don't need spark-sql and
> >> spark-hive since the latter uses the former as a dependency (unless
> >> you're using types that come from the other dependencies). You don't
> >> need spark-core either. Make the dependencies simpler by:
> >>
> >> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
> >>
> >> and mark it % Provided.
> >>
> >> The reason for provided is that you don't need that for uber-jar that
> >> you're going to spark-submit.
> >>
> >> Don't forget to reload your session of sbt you're compiling in. Unsure
> >> how you do it so quit your sbt session and do `sbt compile`.
> >>
> >> Ask away if you need more details.
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>
> >> On Sun, Aug 14, 2016 at 9:26 AM, Mich Talebzadeh
> >>  wrote:
> >> > The issue is on Spark shell this works OK
> >> >
> >> > Spark context Web UI available at http://50.140.197.217:5
> >> > Spark context available as 'sc' (master = local, app id =
> >> > local-1471191662017).
> >> > Spark session available as 'spark'.
> >> > Welcome to
> >> >     __
> >> >  / __/__  ___ _/ /__
> >> > _\ \/ _ \/ _ `/ __/  '_/
> >> >/___/ .__/\_,_/_/ 

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Jacek Laskowski
Hi,

You should have all the deps being "provided" since they're provided
by spark infra after you spark-submit the uber-jar for the app.

What's the "magic" in local.ksh? Why don't you sbt assembly and do
spark-submit with the uber-jar?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Aug 14, 2016 at 11:52 AM, Mich Talebzadeh
 wrote:
> Thanks Jacek,
>
> I thought there was some dependency issue. This did the trick
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
> "provided"
>
> I use a shell script that builds the jar file depending on type (sbt, mvn,
> assembly)  and submits it via spark-submit ..
>
> ./local.ksh -A ETL_scratchpad_dummy -T sbt
>
> As I understand "provided" means that the dependencies will be provided at
> run-time (spark-submit) through the jar files but they are not needed at
> compile time.
>
> Having said that am I correct that error message below
>
> [error] bad symbolic reference. A signature in HiveContext.class refers to
> type Logging
> [error] in package org.apache.spark which is not available.
> [error] It may be completely missing from the current classpath, or the
> version on
> [error] the classpath might be incompatible with the version used when
> compiling HiveContext.class.
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
>
> meant that some form of libraries incompatibility was happening at compile
> time?
>
> Cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 14 August 2016 at 19:11, Jacek Laskowski  wrote:
>>
>> Go to spark-shell and do :imports. You'll see all the imports and you
>> could copy and paste them in your app. (but there are not many
>> honestly and that won't help you much)
>>
>> HiveContext lives in spark-hive. You don't need spark-sql and
>> spark-hive since the latter uses the former as a dependency (unless
>> you're using types that come from the other dependencies). You don't
>> need spark-core either. Make the dependencies simpler by:
>>
>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>>
>> and mark it % Provided.
>>
>> The reason for provided is that you don't need that for uber-jar that
>> you're going to spark-submit.
>>
>> Don't forget to reload your session of sbt you're compiling in. Unsure
>> how you do it so quit your sbt session and do `sbt compile`.
>>
>> Ask away if you need more details.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Aug 14, 2016 at 9:26 AM, Mich Talebzadeh
>>  wrote:
>> > The issue is on Spark shell this works OK
>> >
>> > Spark context Web UI available at http://50.140.197.217:5
>> > Spark context available as 'sc' (master = local, app id =
>> > local-1471191662017).
>> > Spark session available as 'spark'.
>> > Welcome to
>> >     __
>> >  / __/__  ___ _/ /__
>> > _\ \/ _ \/ _ `/ __/  '_/
>> >/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>> >   /_/
>> > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>> > 1.8.0_77)
>> > Type in expressions to have them evaluated.
>> > Type :help for more information.
>> > scala> import org.apache.spark.SparkContext
>> > scala> import org.apache.spark.SparkConf
>> > scala> import org.apache.spark.sql.Row
>> > scala> import org.apache.spark.sql.hive.HiveContext
>> > scala> import org.apache.spark.sql.types._
>> > scala> import org.apache.spark.sql.SparkSession
>> > scala> import org.apache.spark.sql.functions._
>> >
>> > The code itself
>> >
>> >
>> > scala>   val conf = new SparkConf().
>> >  |setAppName("ETL_scratchpad_dummy").
>> >  |set("spark.driver.allowMultipleContexts", "true").
>> >  |set("enableHiveSupport","true")
>> > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@33215ffb
>> >
>> > scala>   val sc = new SparkContext(conf)
>> > sc: org.apache.spark.SparkContext =
>> > org.apache.spark.SparkContext@3cbfdf5c
>> >
>> > scala>   val HiveContext = new 

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
Thanks Jacek,

I thought there was some dependency issue. This did the trick

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
*libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
"provided"*

I use a shell script that builds the jar file depending on type (sbt, mvn,
assembly)  and submits it via spark-submit ..

./local.ksh -A ETL_scratchpad_dummy -T sbt

As I understand "provided" means that the dependencies will be provided at
run-time (spark-submit) through the jar files but they are not needed at
compile time.

Having said that am I correct that error message below

[error] bad symbolic reference. A signature in HiveContext.class refers to
type Logging
[error] in package org.apache.spark which is not available.
[error] It may be completely missing from the current classpath, or the
version on
[error] the classpath might be incompatible with the version used when
compiling HiveContext.class.
[error] one error found
[error] (compile:compileIncremental) Compilation failed

meant that some form of libraries incompatibility was happening at compile
time?

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 19:11, Jacek Laskowski  wrote:

> Go to spark-shell and do :imports. You'll see all the imports and you
> could copy and paste them in your app. (but there are not many
> honestly and that won't help you much)
>
> HiveContext lives in spark-hive. You don't need spark-sql and
> spark-hive since the latter uses the former as a dependency (unless
> you're using types that come from the other dependencies). You don't
> need spark-core either. Make the dependencies simpler by:
>
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>
> and mark it % Provided.
>
> The reason for provided is that you don't need that for uber-jar that
> you're going to spark-submit.
>
> Don't forget to reload your session of sbt you're compiling in. Unsure
> how you do it so quit your sbt session and do `sbt compile`.
>
> Ask away if you need more details.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Aug 14, 2016 at 9:26 AM, Mich Talebzadeh
>  wrote:
> > The issue is on Spark shell this works OK
> >
> > Spark context Web UI available at http://50.140.197.217:5
> > Spark context available as 'sc' (master = local, app id =
> > local-1471191662017).
> > Spark session available as 'spark'.
> > Welcome to
> >     __
> >  / __/__  ___ _/ /__
> > _\ \/ _ \/ _ `/ __/  '_/
> >/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
> >   /_/
> > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> > 1.8.0_77)
> > Type in expressions to have them evaluated.
> > Type :help for more information.
> > scala> import org.apache.spark.SparkContext
> > scala> import org.apache.spark.SparkConf
> > scala> import org.apache.spark.sql.Row
> > scala> import org.apache.spark.sql.hive.HiveContext
> > scala> import org.apache.spark.sql.types._
> > scala> import org.apache.spark.sql.SparkSession
> > scala> import org.apache.spark.sql.functions._
> >
> > The code itself
> >
> >
> > scala>   val conf = new SparkConf().
> >  |setAppName("ETL_scratchpad_dummy").
> >  |set("spark.driver.allowMultipleContexts", "true").
> >  |set("enableHiveSupport","true")
> > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@33215ffb
> >
> > scala>   val sc = new SparkContext(conf)
> > sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@
> 3cbfdf5c
> >
> > scala>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> > warning: there was one deprecation warning; re-run with -deprecation for
> > details
> > HiveContext: org.apache.spark.sql.hive.HiveContext =
> > org.apache.spark.sql.hive.HiveContext@2152fde5
> >
> > scala>   HiveContext.sql("use oraclehadoop")
> > res0: org.apache.spark.sql.DataFrame = []
> >
> > I think I am getting something missing here a dependency
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> 

Re:

2016-08-14 Thread Michael Armbrust
You can force a broadcast, but with tables that large its probably not a
good idea.  However, filtering and then broadcasting one of the joins is
likely to get you the benefits of broadcasting (no shuffle on the larger
table that will colocate all the skewed tuples to a single overloaded
executor) without attempting to broadcast something thats too large.

On Sun, Aug 14, 2016 at 11:02 AM, Jacek Laskowski  wrote:

> Hi Michael,
>
> As I understand broadcast joins, Jestin could also use broadcast
> function on a dataset to make it broadcast. Jestin could force the
> brodcast without the trick hoping it's gonna kick off brodcast.
> Correct?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Aug 14, 2016 at 9:51 AM, Michael Armbrust
>  wrote:
> > Have you tried doing the join in two parts (id == 0 and id != 0) and then
> > doing a union of the results?  It is possible that with this technique,
> that
> > the join which only contains skewed data would be filtered enough to
> allow
> > broadcasting of one side.
> >
> > On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma 
> > wrote:
> >>
> >> Hi, I'm currently trying to perform an outer join between two
> >> DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id.
> >>
> >> df1.id is skewed in that there are many 0's, the rest being unique IDs.
> >>
> >> df2.id is not skewed. If I filter df1.id != 0, then the join works
> well.
> >> If I don't, then the join does not complete for a very, very long time.
> >>
> >> I have diagnosed this problem due to the hashpartitioning on IDs,
> >> resulting in one partition containing many values due to data skew. One
> >> executor ends up reading most of the shuffle data, and writing all of
> the
> >> shuffle data, as shown below.
> >>
> >>
> >>
> >>
> >>
> >> Shown above is the task in question assigned to one executor.
> >>
> >>
> >>
> >> This screenshot comes from one of the executors, showing one single
> thread
> >> spilling sort data since the executor cannot hold 90%+ of the ~200 GB
> result
> >> in memory.
> >>
> >> Moreover, looking at the event timeline, I find that the executor on
> that
> >> task spends about 20% time reading shuffle data, 70% computation, and
> 10%
> >> writing output data.
> >>
> >> I have tried the following:
> >>
> >> "Salting" the 0-value keys by monotonically_increasing_id().mod(N)
> >> - This doesn't seem to have an effect since now I have
> hundreds/thousands
> >> of keys with tens of thousands of occurrences.
> >> - Should I increase N? Is there a way to just do random.mod(N) instead
> of
> >> monotonically_increasing_id()?
> >>
> >> Repartitioning according to column I know contains unique values
> >>
> >> - This is overridden by Spark's sort-based shuffle manager which hash
> >> repartitions on the skewed column
> >>
> >> - Is it possible to change this? Or will the join column need to be
> hashed
> >> and partitioned on for joins to work
> >>
> >> Broadcasting does not work for my large tables
> >>
> >> Increasing/decreasing spark.sql.shuffle.partitions does not remedy the
> >> skewed data problem as 0-product values are still being hashed to the
> same
> >> partition.
> >>
> >>
> >> --
> >>
> >> What I am considering currently is doing the join at the RDD level, but
> is
> >> there any level of control which can solve my skewed data problem? Other
> >> than that, see the bolded question.
> >>
> >> I would appreciate any suggestions/tips/experience with this. Thank you!
> >>
> >
>


Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Jacek Laskowski
Go to spark-shell and do :imports. You'll see all the imports and you
could copy and paste them in your app. (but there are not many
honestly and that won't help you much)

HiveContext lives in spark-hive. You don't need spark-sql and
spark-hive since the latter uses the former as a dependency (unless
you're using types that come from the other dependencies). You don't
need spark-core either. Make the dependencies simpler by:

libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"

and mark it % Provided.

The reason for provided is that you don't need that for uber-jar that
you're going to spark-submit.

Don't forget to reload your session of sbt you're compiling in. Unsure
how you do it so quit your sbt session and do `sbt compile`.

Ask away if you need more details.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Aug 14, 2016 at 9:26 AM, Mich Talebzadeh
 wrote:
> The issue is on Spark shell this works OK
>
> Spark context Web UI available at http://50.140.197.217:5
> Spark context available as 'sc' (master = local, app id =
> local-1471191662017).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_77)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.apache.spark.SparkContext
> scala> import org.apache.spark.SparkConf
> scala> import org.apache.spark.sql.Row
> scala> import org.apache.spark.sql.hive.HiveContext
> scala> import org.apache.spark.sql.types._
> scala> import org.apache.spark.sql.SparkSession
> scala> import org.apache.spark.sql.functions._
>
> The code itself
>
>
> scala>   val conf = new SparkConf().
>  |setAppName("ETL_scratchpad_dummy").
>  |set("spark.driver.allowMultipleContexts", "true").
>  |set("enableHiveSupport","true")
> conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@33215ffb
>
> scala>   val sc = new SparkContext(conf)
> sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@3cbfdf5c
>
> scala>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> warning: there was one deprecation warning; re-run with -deprecation for
> details
> HiveContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.HiveContext@2152fde5
>
> scala>   HiveContext.sql("use oraclehadoop")
> res0: org.apache.spark.sql.DataFrame = []
>
> I think I am getting something missing here a dependency
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 14 August 2016 at 17:16, Koert Kuipers  wrote:
>>
>> HiveContext is gone
>>
>> SparkSession now combines functionality of SqlContext and HiveContext (if
>> hive support is available)
>>
>> On Sun, Aug 14, 2016 at 12:12 PM, Mich Talebzadeh
>>  wrote:
>>>
>>> Thanks Koert,
>>>
>>> I did that before as well. Anyway this is dependencies
>>>
>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
>>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
>>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>>>
>>>
>>> and the error
>>>
>>>
>>> [info] Compiling 1 Scala source to
>>> /data6/hduser/scala/ETL_scratchpad_dummy/target/scala-2.10/classes...
>>> [error]
>>> /data6/hduser/scala/ETL_scratchpad_dummy/src/main/scala/ETL_scratchpad_dummy.scala:4:
>>> object hive is not a member of package org.apache.spark.sql
>>> [error] import org.apache.spark.sql.hive.HiveContext
>>> [error] ^
>>> [error]
>>> /data6/hduser/scala/ETL_scratchpad_dummy/src/main/scala/ETL_scratchpad_dummy.scala:20:
>>> object hive is not a member of package org.apache.spark.sql
>>> [error]   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's 

Re:

2016-08-14 Thread Jacek Laskowski
Hi Michael,

As I understand broadcast joins, Jestin could also use broadcast
function on a dataset to make it broadcast. Jestin could force the
brodcast without the trick hoping it's gonna kick off brodcast.
Correct?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Aug 14, 2016 at 9:51 AM, Michael Armbrust
 wrote:
> Have you tried doing the join in two parts (id == 0 and id != 0) and then
> doing a union of the results?  It is possible that with this technique, that
> the join which only contains skewed data would be filtered enough to allow
> broadcasting of one side.
>
> On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma 
> wrote:
>>
>> Hi, I'm currently trying to perform an outer join between two
>> DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id.
>>
>> df1.id is skewed in that there are many 0's, the rest being unique IDs.
>>
>> df2.id is not skewed. If I filter df1.id != 0, then the join works well.
>> If I don't, then the join does not complete for a very, very long time.
>>
>> I have diagnosed this problem due to the hashpartitioning on IDs,
>> resulting in one partition containing many values due to data skew. One
>> executor ends up reading most of the shuffle data, and writing all of the
>> shuffle data, as shown below.
>>
>>
>>
>>
>>
>> Shown above is the task in question assigned to one executor.
>>
>>
>>
>> This screenshot comes from one of the executors, showing one single thread
>> spilling sort data since the executor cannot hold 90%+ of the ~200 GB result
>> in memory.
>>
>> Moreover, looking at the event timeline, I find that the executor on that
>> task spends about 20% time reading shuffle data, 70% computation, and 10%
>> writing output data.
>>
>> I have tried the following:
>>
>> "Salting" the 0-value keys by monotonically_increasing_id().mod(N)
>> - This doesn't seem to have an effect since now I have hundreds/thousands
>> of keys with tens of thousands of occurrences.
>> - Should I increase N? Is there a way to just do random.mod(N) instead of
>> monotonically_increasing_id()?
>>
>> Repartitioning according to column I know contains unique values
>>
>> - This is overridden by Spark's sort-based shuffle manager which hash
>> repartitions on the skewed column
>>
>> - Is it possible to change this? Or will the join column need to be hashed
>> and partitioned on for joins to work
>>
>> Broadcasting does not work for my large tables
>>
>> Increasing/decreasing spark.sql.shuffle.partitions does not remedy the
>> skewed data problem as 0-product values are still being hashed to the same
>> partition.
>>
>>
>> --
>>
>> What I am considering currently is doing the join at the RDD level, but is
>> there any level of control which can solve my skewed data problem? Other
>> than that, see the bolded question.
>>
>> I would appreciate any suggestions/tips/experience with this. Thank you!
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-14 Thread Jacek Laskowski
Thanks Michael for a prompt response! All you said make sense (glad to
have received it from the most trusted source!)

spark.read.format("michael").option("header", true).write("notes.adoc")

:-)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Sun, Aug 14, 2016 at 9:36 AM, Michael Armbrust
 wrote:
> There are two type systems in play here.  Spark SQL's and Scala's.
>
> From the Scala side, this is type-safe.  After calling as[String]the Dataset
> will only return Strings. It is impossible to ever get a class cast
> exception unless you do your own incorrect casting after the fact.
>
> Underneath the covers, calling as[String] will cause Spark SQL to implicitly
> insert an "upcast".  An upcast will automatically perform safe (lossless)
> casts (i.e., Int -> Long, Number -> String).  In the case where there is no
> safe conversion, we'll throw an AnalysisException and require you to
> explicitly do the conversion.  This upcasting happens when you specify a
> primitive type or when you specify a more complicated class that is mapping
> multiple columns to fields.
>
> On Sat, Aug 13, 2016 at 1:17 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Just ran into it and can't explain why it works. Please help me understand
>> it.
>>
>> Q1: Why can I `as[String]` with Ints? Is this type safe?
>>
>> scala> (0 to 9).toDF("num").as[String]
>> res12: org.apache.spark.sql.Dataset[String] = [num: int]
>>
>> Q2: Why can I map over strings even though there are really ints?
>>
>> scala> (0 to 9).toDF("num").as[String].map(_.toUpperCase)
>> res11: org.apache.spark.sql.Dataset[String] = [value: string]
>>
>> Why are the two lines possible?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: call a mysql stored procedure from spark

2016-08-14 Thread Mich Talebzadeh
Hi,

The link deals with JDBC and states:

[image: Inline images 1]

So it is only SQL. It lacks functionality on Stored procedures with
returning result set.

This is on an Oracle table

scala>  var _ORACLEserver = "jdbc:oracle:thin:@rhes564:1521:mydb12"
_ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12
scala>  var _username = "scratchpad"
_username: String = scratchpad
scala> var _password = "xxx"
_password: String = oracle

scala> val s = HiveContext.read.format("jdbc").options(
 | Map("url" -> _ORACLEserver,
 | *"dbtable" -> "exec weights_sp",*
 | "user" -> _username,
 | "password" -> _password)).load
java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist


and that stored procedure exists in Oracle

scratch...@mydb12.mich.LOCAL> desc weights_sp
PROCEDURE weights_sp


HTH






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 17:42, Michael Armbrust  wrote:

> As described here
> ,
> you can use the DataSource API to connect to an external database using
> JDBC.  While the dbtable option is usually just a table name, it can also
> be any valid SQL command that returns a table when enclosed in
> (parentheses).  I'm not certain, but I'd expect you could use this feature
> to invoke a stored procedure and return the results as a DataFrame.
>
> On Sat, Aug 13, 2016 at 10:40 AM, sujeet jog  wrote:
>
>> Hi,
>>
>> Is there a way to call a stored procedure using spark ?
>>
>>
>> thanks,
>> Sujeet
>>
>
>


Re:

2016-08-14 Thread Michael Armbrust
Have you tried doing the join in two parts (id == 0 and id != 0) and then
doing a union of the results?  It is possible that with this technique,
that the join which only contains skewed data would be filtered enough to
allow broadcasting of one side.

On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma 
wrote:

> Hi, I'm currently trying to perform an outer join between two
> DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id.
>
> df1.id is skewed in that there are many 0's, the rest being unique IDs.
>
> df2.id is not skewed. If I filter df1.id != 0, then the join works well.
> If I don't, then the join does not complete for a very, very long time.
>
> I have diagnosed this problem due to the hashpartitioning on IDs,
> resulting in one partition containing many values due to data skew. One
> executor ends up reading most of the shuffle data, and writing all of the
> shuffle data, as shown below.
>
>
>
>
>
> Shown above is the task in question assigned to one executor.
>
>
>
> This screenshot comes from one of the executors, showing one single thread
> spilling sort data since the executor cannot hold 90%+ of the ~200 GB
> result in memory.
>
> Moreover, looking at the event timeline, I find that the executor on that
> task spends about 20% time reading shuffle data, 70% computation, and 10%
> writing output data.
>
> I have tried the following:
>
>
>- "Salting" the 0-value keys by monotonically_increasing_id().mod(N)
>- - This doesn't seem to have an effect since now I have
>hundreds/thousands of keys with tens of thousands of occurrences.
>- - Should I increase N? Is there a way to just do random.mod(N)
>instead of monotonically_increasing_id()?
>-
>- Repartitioning according to column I know contains unique values
>-
>- - This is overridden by Spark's sort-based shuffle manager which
>hash repartitions on the skewed column
>-
>- - Is it possible to change this? Or will the join column need to be
>hashed and partitioned on for joins to work
>-
>- Broadcasting does not work for my large tables
>-
>- Increasing/decreasing spark.sql.shuffle.partitions does not remedy
>the skewed data problem as 0-product values are still being hashed to the
>same partition.
>
>
> --
>
> What I am considering currently is doing the join at the RDD level, but is
> there any level of control which can solve my skewed data problem? Other
> than that, see the bolded question.
>
> I would appreciate any suggestions/tips/experience with this. Thank you!
>
>


Re: call a mysql stored procedure from spark

2016-08-14 Thread Michael Armbrust
As described here
,
you can use the DataSource API to connect to an external database using
JDBC.  While the dbtable option is usually just a table name, it can also
be any valid SQL command that returns a table when enclosed in
(parentheses).  I'm not certain, but I'd expect you could use this feature
to invoke a stored procedure and return the results as a DataFrame.

On Sat, Aug 13, 2016 at 10:40 AM, sujeet jog  wrote:

> Hi,
>
> Is there a way to call a stored procedure using spark ?
>
>
> thanks,
> Sujeet
>


Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Michael Armbrust
Anytime you see JaninoRuntimeException you are seeing a bug in our code
generation.  If you can come up with a small example that causes the
problem it would be very helpful if you could open a JIRA.

On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar  wrote:

> I see a similar issue being resolved recently: https://issues.
> apache.org/jira/browse/SPARK-15285
>
> On Fri, Aug 12, 2016 at 3:33 PM, Aris  wrote:
>
>> Hello folks,
>>
>> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
>> smaller data unit tests work on my laptop, when I'm on a cluster, I get
>> cryptic error messages:
>>
>> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
>>> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/
>>> spark/sql/catalyst/InternalRow;)I" of class
>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering"
>>> grows beyond 64 KB
>>>
>>
>> Unfortunately I'm not clear on how to even isolate the source of this
>> problem. I didn't have this problem in Spark 1.6.1.
>>
>> Any clues?
>>
>
>
>
> --
> -Dhruve Ashar
>
>


Re: [SQL] Why does (0 to 9).toDF("num").as[String] work?

2016-08-14 Thread Michael Armbrust
There are two type systems in play here.  Spark SQL's and Scala's.

>From the Scala side, this is type-safe.  After calling as[String]the
Dataset will only return Strings. It is impossible to ever get a class cast
exception unless you do your own incorrect casting after the fact.

Underneath the covers, calling as[String] will cause Spark SQL to
implicitly insert an "upcast".  An upcast will automatically perform safe
(lossless) casts (i.e., Int -> Long, Number -> String).  In the case where
there is no safe conversion, we'll throw an AnalysisException and require
you to explicitly do the conversion.  This upcasting happens when you
specify a primitive type or when you specify a more complicated class that
is mapping multiple columns to fields.

On Sat, Aug 13, 2016 at 1:17 PM, Jacek Laskowski  wrote:

> Hi,
>
> Just ran into it and can't explain why it works. Please help me understand
> it.
>
> Q1: Why can I `as[String]` with Ints? Is this type safe?
>
> scala> (0 to 9).toDF("num").as[String]
> res12: org.apache.spark.sql.Dataset[String] = [num: int]
>
> Q2: Why can I map over strings even though there are really ints?
>
> scala> (0 to 9).toDF("num").as[String].map(_.toUpperCase)
> res11: org.apache.spark.sql.Dataset[String] = [value: string]
>
> Why are the two lines possible?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Does Spark SQL support indexes?

2016-08-14 Thread Michael Armbrust
Using df.write.partitionBy is similar to a coarse-grained, clustered index
in a traditional database.  You can't use it on temporary tables, but it
will let you efficiently select small parts of a much larger table.

On Sat, Aug 13, 2016 at 11:13 PM, Jörn Franke  wrote:

> Use a format that has built-in indexes, such as Parquet or Orc. Do not
> forget to sort the data on the columns that your filter on.
>
> On 14 Aug 2016, at 05:03, Taotao.Li  wrote:
>
>
> hi, guys, does Spark SQL support indexes?  if so, how can I create an
> index on my temp table? if not, how can I handle some specific queries on a
> very large table? it would iterate all the table even though all I want is
> just a small piece of that table.
>
> great thanks,
>
>
> *___*
> Quant | Engineer | Boy
> *___*
> *blog*:http://litaotao.github.io
> 
> *github*: www.github.com/litaotao
>
>
>


Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
The issue is on Spark shell this works OK

Spark context Web UI available at http://50.140.197.217:5
Spark context available as 'sc' (master = local, app id =
local-1471191662017).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.SparkContext
scala> import org.apache.spark.SparkConf
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.hive.HiveContext
scala> import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.SparkSession
scala> import org.apache.spark.sql.functions._

The code itself






*scala>   val conf = new SparkConf(). |
setAppName("ETL_scratchpad_dummy"). |
set("spark.driver.allowMultipleContexts", "true"). |
set("enableHiveSupport","true")*conf: org.apache.spark.SparkConf =
org.apache.spark.SparkConf@33215ffb


*scala>   val sc = new SparkContext(conf)*sc: org.apache.spark.SparkContext
= org.apache.spark.SparkContext@3cbfdf5c


*scala>   val HiveContext = new
org.apache.spark.sql.hive.HiveContext(sc)*warning:
there was one deprecation warning; re-run with -deprecation for details
HiveContext: org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@2152fde5


*scala>   HiveContext.sql("use oraclehadoop")*res0:
org.apache.spark.sql.DataFrame = []

I think I am getting something missing here a dependency


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 17:16, Koert Kuipers  wrote:

> HiveContext is gone
>
> SparkSession now combines functionality of SqlContext and HiveContext (if
> hive support is available)
>
> On Sun, Aug 14, 2016 at 12:12 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Koert,
>>
>> I did that before as well. Anyway this is dependencies
>>
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>>
>>
>> and the error
>>
>>
>> [info] Compiling 1 Scala source to /data6/hduser/scala/ETL_scratc
>> hpad_dummy/target/scala-2.10/classes...
>> [error] 
>> /data6/hduser/scala/ETL_scratchpad_dummy/src/main/scala/ETL_scratchpad_dummy.scala:4:
>> object hive is not a member of package org.apache.spark.sql
>> [error] import org.apache.spark.sql.hive.HiveContext
>> [error] ^
>> [error] 
>> /data6/hduser/scala/ETL_scratchpad_dummy/src/main/scala/ETL_scratchpad_dummy.scala:20:
>> object hive is not a member of package org.apache.spark.sql
>> [error]   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 August 2016 at 17:00, Koert Kuipers  wrote:
>>
>>> you cannot mix spark 1 and spark 2 jars
>>>
>>> change this
>>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>>> to
>>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>>>
>>> On Sun, Aug 14, 2016 at 11:58 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 In Spark 2 I am using sbt or mvn to compile my scala program. This used
 to compile and run perfectly with Spark 1.6.1 but now it is throwing error


 I believe the problem is here. I have

 name := "scala"
 version := "1.0"
 scalaVersion := "2.11.7"
 libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
 libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
 libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"


Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Koert Kuipers
HiveContext is gone

SparkSession now combines functionality of SqlContext and HiveContext (if
hive support is available)

On Sun, Aug 14, 2016 at 12:12 PM, Mich Talebzadeh  wrote:

> Thanks Koert,
>
> I did that before as well. Anyway this is dependencies
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>
>
> and the error
>
>
> [info] Compiling 1 Scala source to /data6/hduser/scala/ETL_
> scratchpad_dummy/target/scala-2.10/classes...
> [error] /data6/hduser/scala/ETL_scratchpad_dummy/src/main/
> scala/ETL_scratchpad_dummy.scala:4: object hive is not a member of
> package org.apache.spark.sql
> [error] import org.apache.spark.sql.hive.HiveContext
> [error] ^
> [error] /data6/hduser/scala/ETL_scratchpad_dummy/src/main/
> scala/ETL_scratchpad_dummy.scala:20: object hive is not a member of
> package org.apache.spark.sql
> [error]   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 August 2016 at 17:00, Koert Kuipers  wrote:
>
>> you cannot mix spark 1 and spark 2 jars
>>
>> change this
>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>> to
>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>>
>> On Sun, Aug 14, 2016 at 11:58 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> In Spark 2 I am using sbt or mvn to compile my scala program. This used
>>> to compile and run perfectly with Spark 1.6.1 but now it is throwing error
>>>
>>>
>>> I believe the problem is here. I have
>>>
>>> name := "scala"
>>> version := "1.0"
>>> scalaVersion := "2.11.7"
>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
>>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
>>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>>>
>>> However the error I am getting is
>>>
>>> [error] bad symbolic reference. A signature in HiveContext.class refers
>>> to type Logging
>>> [error] in package org.apache.spark which is not available.
>>> [error] It may be completely missing from the current classpath, or the
>>> version on
>>> [error] the classpath might be incompatible with the version used when
>>> compiling HiveContext.class.
>>> [error] one error found
>>> [error] (compile:compileIncremental) Compilation failed
>>>
>>>
>>> And this is the code
>>>
>>> import org.apache.spark.SparkContext
>>> import org.apache.spark.SparkConf
>>> import org.apache.spark.sql.Row
>>> import org.apache.spark.sql.hive.HiveContext
>>> import org.apache.spark.sql.types._
>>> import org.apache.spark.sql.SparkSession
>>> import org.apache.spark.sql.functions._
>>> object ETL_scratchpad_dummy {
>>>   def main(args: Array[String]) {
>>>   val conf = new SparkConf().
>>>setAppName("ETL_scratchpad_dummy").
>>>set("spark.driver.allowMultipleContexts", "true").
>>>set("enableHiveSupport","true")
>>>   val sc = new SparkContext(conf)
>>>   //import sqlContext.implicits._
>>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>   HiveContext.sql("use oraclehadoop")
>>>
>>>
>>> Anyone has come across this
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>


Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
Thanks Koert,

I did that before as well. Anyway this is dependencies

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"


and the error


[info] Compiling 1 Scala source to
/data6/hduser/scala/ETL_scratchpad_dummy/target/scala-2.10/classes...
[error]
/data6/hduser/scala/ETL_scratchpad_dummy/src/main/scala/ETL_scratchpad_dummy.scala:4:
object hive is not a member of package org.apache.spark.sql
[error] import org.apache.spark.sql.hive.HiveContext
[error] ^
[error]
/data6/hduser/scala/ETL_scratchpad_dummy/src/main/scala/ETL_scratchpad_dummy.scala:20:
object hive is not a member of package org.apache.spark.sql
[error]   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 17:00, Koert Kuipers  wrote:

> you cannot mix spark 1 and spark 2 jars
>
> change this
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
> to
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
>
> On Sun, Aug 14, 2016 at 11:58 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> In Spark 2 I am using sbt or mvn to compile my scala program. This used
>> to compile and run perfectly with Spark 1.6.1 but now it is throwing error
>>
>>
>> I believe the problem is here. I have
>>
>> name := "scala"
>> version := "1.0"
>> scalaVersion := "2.11.7"
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>>
>> However the error I am getting is
>>
>> [error] bad symbolic reference. A signature in HiveContext.class refers
>> to type Logging
>> [error] in package org.apache.spark which is not available.
>> [error] It may be completely missing from the current classpath, or the
>> version on
>> [error] the classpath might be incompatible with the version used when
>> compiling HiveContext.class.
>> [error] one error found
>> [error] (compile:compileIncremental) Compilation failed
>>
>>
>> And this is the code
>>
>> import org.apache.spark.SparkContext
>> import org.apache.spark.SparkConf
>> import org.apache.spark.sql.Row
>> import org.apache.spark.sql.hive.HiveContext
>> import org.apache.spark.sql.types._
>> import org.apache.spark.sql.SparkSession
>> import org.apache.spark.sql.functions._
>> object ETL_scratchpad_dummy {
>>   def main(args: Array[String]) {
>>   val conf = new SparkConf().
>>setAppName("ETL_scratchpad_dummy").
>>set("spark.driver.allowMultipleContexts", "true").
>>set("enableHiveSupport","true")
>>   val sc = new SparkContext(conf)
>>   //import sqlContext.implicits._
>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>   HiveContext.sql("use oraclehadoop")
>>
>>
>> Anyone has come across this
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Koert Kuipers
you cannot mix spark 1 and spark 2 jars

change this
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
to
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"

On Sun, Aug 14, 2016 at 11:58 AM, Mich Talebzadeh  wrote:

> Hi,
>
> In Spark 2 I am using sbt or mvn to compile my scala program. This used to
> compile and run perfectly with Spark 1.6.1 but now it is throwing error
>
>
> I believe the problem is here. I have
>
> name := "scala"
> version := "1.0"
> scalaVersion := "2.11.7"
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
>
> However the error I am getting is
>
> [error] bad symbolic reference. A signature in HiveContext.class refers to
> type Logging
> [error] in package org.apache.spark which is not available.
> [error] It may be completely missing from the current classpath, or the
> version on
> [error] the classpath might be incompatible with the version used when
> compiling HiveContext.class.
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
>
>
> And this is the code
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> object ETL_scratchpad_dummy {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().
>setAppName("ETL_scratchpad_dummy").
>set("spark.driver.allowMultipleContexts", "true").
>set("enableHiveSupport","true")
>   val sc = new SparkContext(conf)
>   //import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   HiveContext.sql("use oraclehadoop")
>
>
> Anyone has come across this
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Issue with compiling Scala with Spark 2

2016-08-14 Thread Mich Talebzadeh
Hi,

In Spark 2 I am using sbt or mvn to compile my scala program. This used to
compile and run perfectly with Spark 1.6.1 but now it is throwing error


I believe the problem is here. I have

name := "scala"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"

However the error I am getting is

[error] bad symbolic reference. A signature in HiveContext.class refers to
type Logging
[error] in package org.apache.spark which is not available.
[error] It may be completely missing from the current classpath, or the
version on
[error] the classpath might be incompatible with the version used when
compiling HiveContext.class.
[error] one error found
[error] (compile:compileIncremental) Compilation failed


And this is the code

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object ETL_scratchpad_dummy {
  def main(args: Array[String]) {
  val conf = new SparkConf().
   setAppName("ETL_scratchpad_dummy").
   set("spark.driver.allowMultipleContexts", "true").
   set("enableHiveSupport","true")
  val sc = new SparkContext(conf)
  //import sqlContext.implicits._
  val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  HiveContext.sql("use oraclehadoop")


Anyone has come across this



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Flattening XML in a DataFrame

2016-08-14 Thread Sreekanth Jella
Hi Hyukjin Kwon,

Thank you for reply.

There are several types of XML documents with different schema which needs to 
be parsed and tag names do not know in hand. All we know is the XSD for the 
given XML. 

Is it possible to get the same results even when we do not know the xml tags 
like manager.id, manager.name or is it possible to read the tag names from XSD 
and use?

Thanks, 
Sreekanth

 

On Aug 12, 2016 9:58 PM, "Hyukjin Kwon"  > wrote:

Hi Sreekanth,

 

Assuming you are using Spark 1.x,

 

I believe this code below:

sqlContext.read.format("com.databricks.spark.xml").option("rowTag", 
"emp").load("/tmp/sample.xml")
  .selectExpr("manager.id  ", "manager.name 
 ", "explode(manager.subordinates.clerk) as clerk")
  .selectExpr("id", "name", "clerk.cid", "clerk.cname")
  .show()

would print the results below as you want:

+---++---+-+
| id|name|cid|cname|
+---++---+-+
|  1| foo|  1|  foo|
|  1| foo|  1|  foo|
+---++---+-+

​

 

I hope this is helpful.

 

Thanks!

 

 

 

 

2016-08-13 9:33 GMT+09:00 Sreekanth Jella  >:

Hi Folks,

 

I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml 
package which is automatically inferring my schema and creating a DataFrame. 

 

I do not want to hard code any column names in DataFrame as I have lot of 
varieties of XML documents and each might be lot more depth of child nodes. I 
simply want to flatten any type of XML and then write output data to a hive 
table. Can you please give some expert advice for the same.

 

Example XML and expected output is given below.

 

Sample XML:





   

   1

   foo



  

1

foo

  

  

1

foo

  



   





 

Expected output:

id, name, clerk.cid, clerk.cname

1, foo, 2, cname2

1, foo, 3, cname3

 

Thanks,

Sreekanth Jella

 

 



Re:

2016-08-14 Thread Mich Talebzadeh
Hi Jestin,

You already have the skewed column in the join condition correct?

This is basically what you are doing assuming rs is your result set below

val rs = df1.join(df2,df1("id")===df2("id"), "fullouter")

What is the percentage of df1.id = 0?

Can you register both tables as temporary and use UNION ALL to join both
parts (df1.id !=0 UNION ALL df1.id ===0)?


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 15:14, Jestin Ma  wrote:

> Hi Mich, do you mean using the skewed column as a join condition? I tried
> repartition(skewed column, unique column) but had no success, possibly
> because the join was still hash-partitioning on just the skewed column
> after I called repartition.
>
> On Sun, Aug 14, 2016 at 1:49 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Can you make the join more selective by using the skewed column ID  +
>> another column that has valid unique vales( Repartitioning according to
>> column I know contains unique values)?
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 August 2016 at 07:17, Jestin Ma  wrote:
>>
>>> Attached are screenshots mentioned, apologies for that.
>>>
>>> On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma 
>>> wrote:
>>>
 Hi, I'm currently trying to perform an outer join between two
 DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id.

 df1.id is skewed in that there are many 0's, the rest being unique IDs.

 df2.id is not skewed. If I filter df1.id != 0, then the join works
 well. If I don't, then the join does not complete for a very, very long
 time.

 I have diagnosed this problem due to the hashpartitioning on IDs,
 resulting in one partition containing many values due to data skew. One
 executor ends up reading most of the shuffle data, and writing all of the
 shuffle data, as shown below.





 Shown above is the task in question assigned to one executor.



 This screenshot comes from one of the executors, showing one single
 thread spilling sort data since the executor cannot hold 90%+ of the ~200
 GB result in memory.

 Moreover, looking at the event timeline, I find that the executor on
 that task spends about 20% time reading shuffle data, 70% computation, and
 10% writing output data.

 I have tried the following:


- "Salting" the 0-value keys by monotonically_increasing_id().mod(N)
- - This doesn't seem to have an effect since now I have
hundreds/thousands of keys with tens of thousands of occurrences.
- - Should I increase N? Is there a way to just do random.mod(N)
instead of monotonically_increasing_id()?
-
- Repartitioning according to column I know contains unique values
-
- - This is overridden by Spark's sort-based shuffle manager which
hash repartitions on the skewed column
-
- - Is it possible to change this? Or will the join column need to
be hashed and partitioned on for joins to work
-
- Broadcasting does not work for my large tables
-
- Increasing/decreasing spark.sql.shuffle.partitions does not
remedy the skewed data problem as 0-product values are still being 
 hashed
to the same partition.


 --

 What I am considering currently is doing the join at the RDD level, but
 is there any level of control which can solve my skewed data problem? Other
 than that, see the bolded question.

 I would appreciate any suggestions/tips/experience with this. Thank you!


>>>
>>>
>>> -

Re:

2016-08-14 Thread Jestin Ma
Hi Mich, do you mean using the skewed column as a join condition? I tried
repartition(skewed column, unique column) but had no success, possibly
because the join was still hash-partitioning on just the skewed column
after I called repartition.

On Sun, Aug 14, 2016 at 1:49 AM, Mich Talebzadeh 
wrote:

> Can you make the join more selective by using the skewed column ID  +
> another column that has valid unique vales( Repartitioning according to
> column I know contains unique values)?
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 August 2016 at 07:17, Jestin Ma  wrote:
>
>> Attached are screenshots mentioned, apologies for that.
>>
>> On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma 
>> wrote:
>>
>>> Hi, I'm currently trying to perform an outer join between two
>>> DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id.
>>>
>>> df1.id is skewed in that there are many 0's, the rest being unique IDs.
>>>
>>> df2.id is not skewed. If I filter df1.id != 0, then the join works
>>> well. If I don't, then the join does not complete for a very, very long
>>> time.
>>>
>>> I have diagnosed this problem due to the hashpartitioning on IDs,
>>> resulting in one partition containing many values due to data skew. One
>>> executor ends up reading most of the shuffle data, and writing all of the
>>> shuffle data, as shown below.
>>>
>>>
>>>
>>>
>>>
>>> Shown above is the task in question assigned to one executor.
>>>
>>>
>>>
>>> This screenshot comes from one of the executors, showing one single
>>> thread spilling sort data since the executor cannot hold 90%+ of the ~200
>>> GB result in memory.
>>>
>>> Moreover, looking at the event timeline, I find that the executor on
>>> that task spends about 20% time reading shuffle data, 70% computation, and
>>> 10% writing output data.
>>>
>>> I have tried the following:
>>>
>>>
>>>- "Salting" the 0-value keys by monotonically_increasing_id().mod(N)
>>>- - This doesn't seem to have an effect since now I have
>>>hundreds/thousands of keys with tens of thousands of occurrences.
>>>- - Should I increase N? Is there a way to just do random.mod(N)
>>>instead of monotonically_increasing_id()?
>>>-
>>>- Repartitioning according to column I know contains unique values
>>>-
>>>- - This is overridden by Spark's sort-based shuffle manager which
>>>hash repartitions on the skewed column
>>>-
>>>- - Is it possible to change this? Or will the join column need to
>>>be hashed and partitioned on for joins to work
>>>-
>>>- Broadcasting does not work for my large tables
>>>-
>>>- Increasing/decreasing spark.sql.shuffle.partitions does not remedy
>>>the skewed data problem as 0-product values are still being hashed to the
>>>same partition.
>>>
>>>
>>> --
>>>
>>> What I am considering currently is doing the join at the RDD level, but
>>> is there any level of control which can solve my skewed data problem? Other
>>> than that, see the bolded question.
>>>
>>> I would appreciate any suggestions/tips/experience with this. Thank you!
>>>
>>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>


Re: Using spark package XGBoost

2016-08-14 Thread janardhan shetty
Any leads how to do acheive this?
On Aug 12, 2016 6:33 PM, "janardhan shetty"  wrote:

> I tried using  *sparkxgboost package *in build.sbt file but it failed.
> Spark 2.0
> Scala 2.11.8
>
> Error:
>  [warn]   http://dl.bintray.com/spark-packages/maven/
> rotationsymmetry/sparkxgboost/0.2.1-s_2.10/sparkxgboost-0.2.
> 1-s_2.10-javadoc.jar
>[warn] ::
>[warn] ::  FAILED DOWNLOADS::
>[warn] :: ^ see resolution messages for details  ^ ::
>[warn] ::
>[warn] :: rotationsymmetry#sparkxgboost;
> 0.2.1-s_2.10!sparkxgboost.jar(src)
>[warn] :: rotationsymmetry#sparkxgboost;
> 0.2.1-s_2.10!sparkxgboost.jar(doc)
>
> build.sbt:
>
> scalaVersion := "2.11.8"
>
> libraryDependencies ++= {
>   val sparkVersion = "2.0.0-preview"
>   Seq(
> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
>   )
> }
>
>
>
> *resolvers += "Spark Packages Repo" at
> "http://dl.bintray.com/spark-packages/maven
> "libraryDependencies +=
> "rotationsymmetry" % "sparkxgboost" % "0.2.1-s_2.10"*
>
> assemblyMergeStrategy in assembly := {
>   case PathList("META-INF", "MANIFEST.MF")   =>
> MergeStrategy.discard
>   case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
>   case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
>   case "application.conf"=>
> MergeStrategy.concat
>   case "unwanted.txt"=>
> MergeStrategy.discard
>
>   case x => val oldStrategy = (assemblyMergeStrategy in assembly).value
> oldStrategy(x)
>
> }
>
>
>
>
> On Fri, Aug 12, 2016 at 3:35 PM, janardhan shetty 
> wrote:
>
>> Is there a dataframe version of XGBoost in spark-ml ?.
>> Has anyone used sparkxgboost package ?
>>
>
>


Re: mesos or kubernetes ?

2016-08-14 Thread Gurvinder Singh
On 08/13/2016 08:24 PM, guyoh wrote:
> My company is trying to decide whether to use kubernetes or mesos. Since we
> are planning to use Spark in the near future, I was wandering what is the
> best choice for us. 
> Thanks, 
> Guy
> 
Both Kubernetes and Mesos enables you to share your infrastructure with
other workloads. In K8s it is mainly standalone mode and to provide
failure recovery you can use the options mentioned here
https://spark.apache.org/docs/latest/spark-standalone.html#high-availability
I have tested file system recovery with K8s and it works fine, as K8s
restart the master with in few seconds. In K8s you can dynamically scale
your cluster as K8s support horizontal pod autoscaling
http://blog.kubernetes.io/2016/03/using-Spark-and-Zeppelin-to-process-Big-Data-on-Kubernetes.html
which helps you spin down your spark cluster when not is use and you can
run other workloads using docker/rkt containers on your resources and
scale up spark cluster when needed given you have free resources. You
don't need static partitioning for your spark cluster when running on
k8s, as spark will run in containers and will share the same underlying
resources as others.

You can soonish (https://github.com/apache/spark/pull/13950) use the
authentication based on Github,Google,FB etc to access Spark master UI
when deployed in standalone e.g. on K8s and only expose the master UI to
access all other UIs (worker logs, app).

As usual choice is yours, I just wanted to give you the info about k8s side.

- Gurvinder
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/mesos-or-kubernetes-tp27530.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How Spark sql query optimisation work if we are using .rdd action ?

2016-08-14 Thread Mich Talebzadeh
There are two distinct parts here.  Optimisation + execution.

Spark does not have a Cost Based Optimizer (CBO) yet but that does not
matter for now.

When we do such operation say outer join between (s) and (t)  DFs below, we
see

 scala> val rs = s.join(t,s("time_id")===t("time_id"), "fullouter")
rs: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID:
timestamp ... 3 more fields]

But the optimizer knows the access path at the dataframe level itself
before converting DF to rdd

scala>  val rs = s.join(t,s("time_id")===t("time_id"), "fullouter").
*explain*

== Physical Plan ==
SortMergeJoin [time_id#5], [time_id#40], FullOuter
:- *Sort [time_id#5 ASC], false, 0
:  +- Exchange hashpartitioning(time_id#5, 200)
: +- HiveTableScan [AMOUNT_SOLD#9, TIME_ID#5, CHANNEL_ID#6L],
MetastoreRelation oraclehadoop, sales
+- *Sort [time_id#40 ASC], false, 0
   +- Exchange hashpartitioning(time_id#40, 200)
  +- HiveTableScan [TIME_ID#40, CALENDAR_MONTH_DESC#50],
MetastoreRelation oraclehadoop, times
rs: Unit = ()
scala> val rs = s.join(t,s("time_id")===t("time_id"), "fullouter").rdd
rs: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] =
MapPartitionsRDD[1310] at rdd at :27


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 09:56, ayan guha  wrote:

> I do not think so. What I understand Spark will still use Catalyst to
> join. DF always has an RDD underneath, but that does not mean any action
> will force less optimal path.
>
> On Sun, Aug 14, 2016 at 3:04 PM, mayur bhole 
> wrote:
>
>> HI All,
>>
>> Lets say, we have
>>
>> val df = bigTableA.join(bigTableB,bigTableA("A")===bigTableB("A"),"left")
>> val rddFromDF = df.rdd
>> println(rddFromDF.count)
>>
>> My understanding is that spark will convert all data frame operations
>> before "rddFromDF.count" into RDD equivalent operation as we are not
>> performing any action on dataframe directly. In that case, spark will not
>> be using optimization engine. Is my assumption right? Please point me to
>> right resources.
>>
>> [ Note : I have posted same question on so : http://stackoverflow.com/que
>> stions/38889812/how-spark-dataframe-optimization-engine-works-with-dag ]
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: How Spark sql query optimisation work if we are using .rdd action ?

2016-08-14 Thread ayan guha
I do not think so. What I understand Spark will still use Catalyst to join.
DF always has an RDD underneath, but that does not mean any action will
force less optimal path.

On Sun, Aug 14, 2016 at 3:04 PM, mayur bhole 
wrote:

> HI All,
>
> Lets say, we have
>
> val df = bigTableA.join(bigTableB,bigTableA("A")===bigTableB("A"),"left")
> val rddFromDF = df.rdd
> println(rddFromDF.count)
>
> My understanding is that spark will convert all data frame operations
> before "rddFromDF.count" into RDD equivalent operation as we are not
> performing any action on dataframe directly. In that case, spark will not
> be using optimization engine. Is my assumption right? Please point me to
> right resources.
>
> [ Note : I have posted same question on so : http://stackoverflow.com/
> questions/38889812/how-spark-dataframe-optimization-engine-works-with-dag
> ]
>
> Thanks
>



-- 
Best Regards,
Ayan Guha


Re:

2016-08-14 Thread Mich Talebzadeh
Can you make the join more selective by using the skewed column ID  +
another column that has valid unique vales( Repartitioning according to
column I know contains unique values)?


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 August 2016 at 07:17, Jestin Ma  wrote:

> Attached are screenshots mentioned, apologies for that.
>
> On Sat, Aug 13, 2016 at 11:15 PM, Jestin Ma 
> wrote:
>
>> Hi, I'm currently trying to perform an outer join between two
>> DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id.
>>
>> df1.id is skewed in that there are many 0's, the rest being unique IDs.
>>
>> df2.id is not skewed. If I filter df1.id != 0, then the join works well.
>> If I don't, then the join does not complete for a very, very long time.
>>
>> I have diagnosed this problem due to the hashpartitioning on IDs,
>> resulting in one partition containing many values due to data skew. One
>> executor ends up reading most of the shuffle data, and writing all of the
>> shuffle data, as shown below.
>>
>>
>>
>>
>>
>> Shown above is the task in question assigned to one executor.
>>
>>
>>
>> This screenshot comes from one of the executors, showing one single
>> thread spilling sort data since the executor cannot hold 90%+ of the ~200
>> GB result in memory.
>>
>> Moreover, looking at the event timeline, I find that the executor on that
>> task spends about 20% time reading shuffle data, 70% computation, and 10%
>> writing output data.
>>
>> I have tried the following:
>>
>>
>>- "Salting" the 0-value keys by monotonically_increasing_id().mod(N)
>>- - This doesn't seem to have an effect since now I have
>>hundreds/thousands of keys with tens of thousands of occurrences.
>>- - Should I increase N? Is there a way to just do random.mod(N)
>>instead of monotonically_increasing_id()?
>>-
>>- Repartitioning according to column I know contains unique values
>>-
>>- - This is overridden by Spark's sort-based shuffle manager which
>>hash repartitions on the skewed column
>>-
>>- - Is it possible to change this? Or will the join column need to be
>>hashed and partitioned on for joins to work
>>-
>>- Broadcasting does not work for my large tables
>>-
>>- Increasing/decreasing spark.sql.shuffle.partitions does not remedy
>>the skewed data problem as 0-product values are still being hashed to the
>>same partition.
>>
>>
>> --
>>
>> What I am considering currently is doing the join at the RDD level, but
>> is there any level of control which can solve my skewed data problem? Other
>> than that, see the bolded question.
>>
>> I would appreciate any suggestions/tips/experience with this. Thank you!
>>
>>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


[no subject]

2016-08-14 Thread Jestin Ma
Hi, I'm currently trying to perform an outer join between two
DataFrames/Sets, one is ~150GB, one is about ~50 GB on a column, id.

df1.id is skewed in that there are many 0's, the rest being unique IDs.

df2.id is not skewed. If I filter df1.id != 0, then the join works well. If
I don't, then the join does not complete for a very, very long time.

I have diagnosed this problem due to the hashpartitioning on IDs, resulting
in one partition containing many values due to data skew. One executor ends
up reading most of the shuffle data, and writing all of the shuffle data,
as shown below.





Shown above is the task in question assigned to one executor.



This screenshot comes from one of the executors, showing one single thread
spilling sort data since the executor cannot hold 90%+ of the ~200 GB
result in memory.

Moreover, looking at the event timeline, I find that the executor on that
task spends about 20% time reading shuffle data, 70% computation, and 10%
writing output data.

I have tried the following:


   - "Salting" the 0-value keys by monotonically_increasing_id().mod(N)
   - - This doesn't seem to have an effect since now I have
   hundreds/thousands of keys with tens of thousands of occurrences.
   - - Should I increase N? Is there a way to just do random.mod(N) instead
   of monotonically_increasing_id()?
   -
   - Repartitioning according to column I know contains unique values
   -
   - - This is overridden by Spark's sort-based shuffle manager which hash
   repartitions on the skewed column
   -
   - - Is it possible to change this? Or will the join column need to be
   hashed and partitioned on for joins to work
   -
   - Broadcasting does not work for my large tables
   -
   - Increasing/decreasing spark.sql.shuffle.partitions does not remedy the
   skewed data problem as 0-product values are still being hashed to the same
   partition.


--

What I am considering currently is doing the join at the RDD level, but is
there any level of control which can solve my skewed data problem? Other
than that, see the bolded question.

I would appreciate any suggestions/tips/experience with this. Thank you!


Re: Does Spark SQL support indexes?

2016-08-14 Thread Jörn Franke
Use a format that has built-in indexes, such as Parquet or Orc. Do not forget 
to sort the data on the columns that your filter on.

> On 14 Aug 2016, at 05:03, Taotao.Li  wrote:
> 
> 
> hi, guys, does Spark SQL support indexes?  if so, how can I create an index 
> on my temp table? if not, how can I handle some specific queries on a very 
> large table? it would iterate all the table even though all I want is just a 
> small piece of that table.
> 
> great thanks, 
> 
> 
> ___
> Quant | Engineer | Boy
> ___
> blog:http://litaotao.github.io
> github: www.github.com/litaotao
> 
>