Re: If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread ayan guha
I kind of think "Thats the whole point" :) Sorry it is Friday here :) :)

On Fri, Jul 7, 2017 at 1:09 PM, Michael Armbrust 
wrote:

> It goes through the same optimization pipeline.  More in this video
> .
>
> On Thu, Jul 6, 2017 at 5:28 PM, kant kodali  wrote:
>
>> HI All,
>>
>> I am wondering If I pass a raw SQL string to dataframe do I still get the
>> Spark SQL optimizations? why or why not?
>>
>> Thanks!
>>
>
>


-- 
Best Regards,
Ayan Guha


Re: If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread Michael Armbrust
It goes through the same optimization pipeline.  More in this video
.

On Thu, Jul 6, 2017 at 5:28 PM, kant kodali  wrote:

> HI All,
>
> I am wondering If I pass a raw SQL string to dataframe do I still get the
> Spark SQL optimizations? why or why not?
>
> Thanks!
>


RE: Do we anything for Deep Learning in Spark?

2017-07-06 Thread hosur narahari
Thank you.

Best Regards,
Hari

On 7 Jul 2017 3:59 a.m., "Roope Astala"  wrote:

> You can use an attached GPU VM for DNN training, and do other processing
> on regular CPU nodes. You can even deallocate the GPU VM to save costs when
> not using it. The GPU branch has instructions how to set up such compute
> environment:
>
>
>
> https://github.com/Azure/mmlspark/tree/gpu#gpu-vm-setup
>
>
>
> Cheers,
>
> Roope – Microsoft Cloud AI Team
>
>
>
> *From:* hosur narahari [mailto:hnr1...@gmail.com]
> *Sent:* Thursday, July 6, 2017 1:54 AM
> *To:* Gaurav1809 
> *Cc:* user 
> *Subject:* Re: Do we anything for Deep Learning in Spark?
>
>
>
> Hi Roope,
>
>
>
> Does this mmlspark project uses GPGPU for processing and just CPU cores
> since DL models are computationally very intensive.
>
>
>
> Best Regards,
>
> Hari
>
>
>
> On 6 Jul 2017 9:33 a.m., "Gaurav1809"  wrote:
>
> Thanks Roope for the inputs.
>
>
>
> On Wed, Jul 5, 2017 at 11:41 PM, Roope [via Apache Spark User List] <[hidden
> email] > wrote:
>
> Microsoft Machine Learning Library for Apache Spark lets you run CNTK deep
> learning models on Spark.
>
> https://github.com/Azure/mmlspark
> 
>
> The library APIs are focused on image processing scenarios, and are
> compatible with SparkML Pipelines.
>
> Cheers,
> Roope - Microsoft Cloud AI Team
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-
> anything-for-Deep-Learning-in-Spark-tp28772p28824.html
> 
>
> To start a new topic under Apache Spark User List, email [hidden email]
> 
> To unsubscribe from Do we anything for Deep Learning in Spark?, click here.
> NAML
> 
>
>
>
>
> --
>
> View this message in context: Re: Do we anything for Deep Learning in
> Spark?
> 
> Sent from the Apache Spark User List mailing list archive
> 
> at Nabble.com.
>
>


If I pass raw SQL string to dataframe do I still get the Spark SQL optimizations?

2017-07-06 Thread kant kodali
HI All,

I am wondering If I pass a raw SQL string to dataframe do I still get the
Spark SQL optimizations? why or why not?

Thanks!


GraphQL to Spark SQL

2017-07-06 Thread kant kodali
Hi All,

I wonder if anyone had experience exposing Spark SQL interface through
GraphQL? The main benefit I see is that we could send Spark SQL query
through REST so clients can express their own transformations over REST. I
understand the final outcome is probably the same as what one would achieve
with a BI tool or Databricks UI but I tend to think lot of people like REST
and GraphQL is very expressive and can enable lot of people.

Please provide some feedback

Thanks


Spark 2.0.2 - JdbcRelationProvider does not allow create table as select

2017-07-06 Thread Kanagha Kumar
Hi,

I'm running spark 2.0.2 version and I'm noticing an issue with
DataFrameWriter.save()

Code:

ds.write().format("jdbc").mode("overwrite").options(ImmutableMap.of(

"driver", "org.apache.phoenix.jdbc.PhoenixDriver",

"url", urlWithTenant,

"dbtable", "tableName")).save();


I found this was reported in prior spark versions but seems to have been
fixed with 2.0.x versions. Please let me know if this issue still exists
with 2.0.2 version.

17/07/06 15:53:12 ERROR ApplicationMaster: User class threw exception:
java.lang.RuntimeException:
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does
not allow create table as select.
java.lang.RuntimeException:
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does
not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:530)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at TenantPhoenix.main(TenantPhoenix.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
17/07/06 15:53:12 INFO ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
java.lang.RuntimeException:
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider does
not allow create table as select.)
17/07/06 15:53:12 INFO SparkContext: Invoking stop() from shutdown hook
17/07/06 15:53:12 INFO SparkUI: Stopped Spark web UI at
http://10.3.9.95:50461
17/07/06 15:53:12 INFO YarnAllocator: Driver requested a total number of 0
executor(s).
17/07/06 15:53:12 INFO YarnClusterSchedulerBackend: Shutting down all
executors
17/07/06 15:53:12 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each
executor to shut down
17/07/06 15:53:12 INFO SchedulerExtensionServices: Stopping
SchedulerExtensionServices

Thanks


RE: Do we anything for Deep Learning in Spark?

2017-07-06 Thread Roope Astala
You can use an attached GPU VM for DNN training, and do other processing on 
regular CPU nodes. You can even deallocate the GPU VM to save costs when not 
using it. The GPU branch has instructions how to set up such compute 
environment:

https://github.com/Azure/mmlspark/tree/gpu#gpu-vm-setup

Cheers,
Roope – Microsoft Cloud AI Team

From: hosur narahari [mailto:hnr1...@gmail.com]
Sent: Thursday, July 6, 2017 1:54 AM
To: Gaurav1809 
Cc: user 
Subject: Re: Do we anything for Deep Learning in Spark?

Hi Roope,

Does this mmlspark project uses GPGPU for processing and just CPU cores since 
DL models are computationally very intensive.

Best Regards,
Hari

On 6 Jul 2017 9:33 a.m., "Gaurav1809" 
> wrote:
Thanks Roope for the inputs.

On Wed, Jul 5, 2017 at 11:41 PM, Roope [via Apache Spark User List] <[hidden 
email]> wrote:
Microsoft Machine Learning Library for Apache Spark lets you run CNTK deep 
learning models on Spark.

https://github.com/Azure/mmlspark

The library APIs are focused on image processing scenarios, and are compatible 
with SparkML Pipelines.

Cheers,
Roope - Microsoft Cloud AI Team

If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-anything-for-Deep-Learning-in-Spark-tp28772p28824.html
To start a new topic under Apache Spark User List, email [hidden 
email]
To unsubscribe from Do we anything for Deep Learning in Spark?, click here.
NAML



View this message in context: Re: Do we anything for Deep Learning in 
Spark?
Sent from the Apache Spark User List mailing list 
archive
 at Nabble.com.


Re: Is there "EXCEPT ALL" in Spark SQL?

2017-07-06 Thread jeff saremi
EXCEPT is not the same as EXCEPT ALL

Had they implemented EXCEPT ALL in SparkSQL one could have easily obtained 
EXCEPT by adding a disctint() to the results



From: hareesh makam 
Sent: Thursday, July 6, 2017 12:48:18 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: Is there "EXCEPT ALL" in Spark SQL?

There is Except in DataFrame API.

df1.except(df2)

Same can be used in SQL as well.


public 
DataFrame
 
except(DataFrame
 other)

Returns a new 
DataFrame
 containing rows in this frame but not in another frame. This is equivalent to 
EXCEPT in SQL.


-Hareesh


On 6 July 2017 at 12:22, jeff saremi 
> wrote:

I tried this query in 1.6 and it failed:


SELECT * FROM Table1 EXCEPT ALL SELECT * FROM Table2



Exception in thread "main" java.lang.RuntimeException: [1.32] failure: ``('' 
expected but `all' found


thanks

Jeff



Logging in lSpark streaming application

2017-07-06 Thread anna stax
Do I need to include the log4j dependencies in my pom.xml of the spark
streaming application or it is already included in spark libraries?

I am running Spark in standalone mode on AWS EC2.

Thanks


Re: Is there "EXCEPT ALL" in Spark SQL?

2017-07-06 Thread hareesh makam
There is Except in DataFrame API.

df1.except(df2)

Same can be used in SQL as well.

public DataFrame

except(DataFrame

other)

Returns a new DataFrame

containing
rows in this frame but not in another frame. This is equivalent to EXCEPT in
SQL.


-Hareesh


On 6 July 2017 at 12:22, jeff saremi  wrote:

> I tried this query in 1.6 and it failed:
>
> SELECT * FROM Table1 EXCEPT ALL SELECT * FROM Table2
>
>
> Exception in thread "main" java.lang.RuntimeException: [1.32] failure:
> ``('' expected but `all' found
>
>
> thanks
>
> Jeff
>


Re: Is there "EXCEPT ALL" in Spark SQL?

2017-07-06 Thread upendra 1991
To add to it, is there any specific documentation or reference where we could 
check out what SQL functions and features are available in spark spl for a 
specific sparksql version.?

Thanks,Upendra 
 
  On Thu, Jul 6, 2017 at 2:22 PM, jeff saremi wrote:

I tried this query in 1.6 and it failed:




SELECT * FROM Table1 EXCEPT ALL SELECT * FROM Table2


Exception in thread "main" java.lang.RuntimeException: [1.32] failure: ``('' 
expected but `all' found





thanks

Jeff

  


Structured Streaming: consumerGroupId

2017-07-06 Thread aravias
Hi,
Is there a way to get the *consumerGroupId* assigned to a  structured
streaming application when its consuming from kafka?

regards
Aravind



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-consumerGroupId-tp28828.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is there "EXCEPT ALL" in Spark SQL?

2017-07-06 Thread jeff saremi
I tried this query in 1.6 and it failed:


SELECT * FROM Table1 EXCEPT ALL SELECT * FROM Table2



Exception in thread "main" java.lang.RuntimeException: [1.32] failure: ``('' 
expected but `all' found


thanks

Jeff


Unsubscribe

2017-07-06 Thread Kun Liu
Unsubscribe


Partitions cached by updatStateByKey does not seem to be getting evicted forever

2017-07-06 Thread SRK
Hi,

We use updateStateByKey in our Spark streaming application. The partitions
cached by updateStateByKey does not seem to be getting evicted. It was
getting evicted fine with spark.cleaner.ttl in 1.5.1. I am facing issues
with partitions not getting evicted with Stateful Streaming after Spark 2.1.
upgrade.

Any idea as to why this happens?


Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-cached-by-updatStateByKey-does-not-seem-to-be-getting-evicted-forever-tp28827.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark | Window Function |

2017-07-06 Thread Julien CHAMP
Thx a lot for your answer Radhwane :)


I have some (many) use case with such needs of Long in window function. As
said in the bug report, I can store events in ms in a dataframe, and want
to count the number of events in past 10 years ( requiring a Long value )

-> *Let's imagine that this window is used on timestamp values in ms : I
can ask for a window with a range between [-216000L, 0] and only have a
few values inside, not necessarily 216000L. I can understand the
limitaion for the rowBetween() method but the rangeBetween() method is nice
for this kind of usage.*


The solution with self join seems nice, but 2 questions :

- regarding performances, will it be as fast as window function ?

- can I use my own aggregate function ( for example a Geometric Mean ) with
your solution ? ( using this :
https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html ?



Thanks again,

Regards,


Julien



Le mer. 5 juil. 2017 à 19:18, Radhwane Chebaane 
a écrit :

> Hi Julien,
>
>
> Although this is a strange bug in Spark, it's rare to need more than
> Integer max value size for a window.
>
> Nevertheless, most of the window functions can be expressed with
> self-joins. Hence, your problem may be solved with this example:
>
> If input data as follow:
>
> +---+-+-+
> | id|timestamp|value|
> +---+-+-+
> |  B|1|  100|
> |  B|10010|   50|
> |  B|10020|  200|
> |  B|25000|  500|
> +---+-+-+
>
> And the window is  (-20L, 0)
>
> Then this code will give the wanted result:
>
> df.as("df1").join(df.as("df2"),
>   $"df2.timestamp" between($"df1.timestamp" - 20L, $"df1.timestamp"))
>   .groupBy($"df1.id", $"df1.timestamp", $"df1.value")
>   .agg( functions.min($"df2.value").as("min___value"))
>   .orderBy($"df1.timestamp")
>   .show()
>
> +---+-+-+---+
> | id|timestamp|value|min___value|
> +---+-+-+---+
> |  B|1|  100|100|
> |  B|10010|   50| 50|
> |  B|10020|  200| 50|
> |  B|25000|  500|500|
> +---+-+-+---+
>
> Or by SparkSQL:
>
> SELECT c.id as id, c.timestamp as timestamp, c.value, min(c._value) as 
> min___value FROM
> (
>   SELECT a.id as id, a.timestamp as timestamp, a.value as value, b.timestamp 
> as _timestamp, b.value as _value
>   FROM df a CROSS JOIN df b
>   ON b.timestamp >= a.timestamp - 20L and b.timestamp <= a.timestamp
> ) c
> GROUP BY c.id, c.timestamp, c.value ORDER BY c.timestamp
>
>
> This must be also possible also on Spark Streaming however don't expect high 
> performance.
>
>
> Cheers,
> Radhwane
>
>
>
> 2017-07-05 10:41 GMT+02:00 Julien CHAMP :
>
>> Hi there !
>>
>> Let me explain my problem to see if you have a good solution to help me :)
>>
>> Let's imagine that I have all my data in a DB or a file, that I load in a
>> dataframe DF with the following columns :
>> *id | timestamp(ms) | value*
>> A | 100 |  100
>> A | 110 |  50
>> B | 100 |  100
>> B | 110 |  50
>> B | 120 |  200
>> B | 250 |  500
>> C | 100 |  200
>> C | 110 |  500
>>
>> The timestamp is a *long value*, so as to be able to express date in ms
>> from -01-01 to today !
>>
>> I want to compute operations such as min, max, average on the *value
>> column*, for a given window function, and grouped by id ( Bonus :  if
>> possible for only some timestamps... )
>>
>> For example if I have 3 tuples :
>>
>> id | timestamp(ms) | value
>> B | 100 |  100
>> B | 110 |  50
>> B | 120 |  200
>> B | 250 |  500
>>
>> I would like to be able to compute the min value for windows of time =
>> 20. This would result in such a DF :
>>
>> id | timestamp(ms) | value | min___value
>> B | 100 |  100 | 100
>> B | 110 |  50  | 50
>> B | 120 |  200 | 50
>> B | 250 |  500 | 500
>>
>> This seems the perfect use case for window function in spark  ( cf :
>> https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
>>  )
>> I can use :
>>
>> Window.orderBy("timestamp").partitionBy("id").rangeBetween(-20,0)
>> df.withColumn("min___value", min(df.col("value")).over(tw))
>>
>> This leads to the perfect answer !
>>
>> However, there is a big bug with window functions as reported here (
>> https://issues.apache.org/jira/browse/SPARK-19451 ) when working with
>> Long values !!! So I can't use this
>>
>> So my question is ( of course ) how can I resolve my problem ?
>> If I use spark streaming I will face the same issue ?
>>
>> I'll be glad to discuss this problem with you, feel free to answer :)
>>
>> Regards,
>>
>> Julien
>> --
>>
>>
>> Julien CHAMP — Data Scientist
>>
>>
>> *Web : **www.tellmeplus.com*  — *Email : 
>> **jch...@tellmeplus.com
>> *
>>
>> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* : 

Re: Spark, S3A, and 503 SlowDown / rate limit issues

2017-07-06 Thread Steve Loughran

On 5 Jul 2017, at 14:40, Vadim Semenov 
> wrote:

Are you sure that you use S3A?
Because EMR says that they do not support S3A

https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/
> Amazon EMR does not currently support use of the Apache Hadoop S3A file 
> system.

I think that the HEAD requests come from the `createBucketIfNotExists` in the 
AWS S3 library that checks if the bucket exists every time you do a PUT 
request, i.e. creates a HEAD request.

You can disable that by setting `fs.s3.buckets.create.enabled` to `false`
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-upload-s3.html



Yeah, I'd like to see the stack traces before blaming S3A and the ASF codebase

One thing I do know is that the shipping S3A client doesn't have any explicit 
handling of 503/retry events. I know that: 
https://issues.apache.org/jira/browse/HADOOP-14531

There is some retry logic in bits of the AWS SDK related to file upload: that 
may log and retry, but in all the operations listing files, getting their 
details, etc: no resilience to throttling.

If it is surfacing against s3a, there isn't anything which can immediately be 
done to fix it, other than "spread your data around more buckets". Do attach 
the stack trace you get under 
https://issues.apache.org/jira/browse/HADOOP-14381 though: I'm about half-way 
through the resilience code (& fault injection needed to test it). The more 
where I can see problems arise, the more confident I can be that those 
codepaths will be resilient.


On Thu, Jun 29, 2017 at 4:56 PM, Everett Anderson 
> wrote:
Hi,

We're using Spark 2.0.2 + Hadoop 2.7.3 on AWS EMR with S3A for direct I/O 
from/to S3 from our Spark jobs. We set 
mapreduce.fileoutputcommitter.algorithm.version=2 and are using encrypted S3 
buckets.

This has been working fine for us, but perhaps as we've been running more jobs 
in parallel, we've started getting errors like

Status Code: 503, AWS Service: Amazon S3, AWS Request ID: ..., AWS Error Code: 
SlowDown, AWS Error Message: Please reduce your request rate., S3 Extended 
Request ID: ...

We enabled CloudWatch S3 request metrics for one of our buckets and I was a 
little alarmed to see spikes of over 800k S3 requests over a minute or so, with 
the bulk of them HEAD requests.

We read and write Parquet files, and most tables have around 50 shards/parts, 
though some have up to 200. I imagine there's additional parallelism when 
reading a shard in Parquet, though.

Has anyone else encountered this? How did you solve it?

I'd sure prefer to avoid copying all our data in and out of HDFS for each job, 
if possible.

Thanks!





Re: Collecting matrix's entries raises an error only when run inside a test

2017-07-06 Thread Yanbo Liang
Hi Simone,

Would you mind to share the minimized code to reproduce this issue?

Yanbo

On Wed, Jul 5, 2017 at 10:52 PM, Simone Robutti 
wrote:

> Hello, I have this problem and  Google is not helping. Instead, it looks
> like an unreported bug and there are no hints to possible workarounds.
>
> the error is the following:
>
> Traceback (most recent call last):
>   File 
> "/home/simone/motionlogic/trip-labeler/test/trip_labeler_test/model_test.py",
> line 43, in test_make_trip_matrix
> entries = trip_matrix.entries.map(lambda entry: (entry.i, entry.j,
> entry.value)).collect()
>   File "/opt/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
> line 770, in collect
> with SCCallSiteSync(self.context) as css:
>   File 
> "/opt/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/traceback_utils.py",
> line 72, in __enter__
> self._context._jsc.setCallSite(self._call_site)
> AttributeError: 'NoneType' object has no attribute 'setCallSite'
>
> and it is raised when I try to collect a 
> pyspark.mllib.linalg.distributed.CoordinateMatrix
> entries with .collect() and it happens only when this run in a test suite
> with more than one class, so it's probably related to the creation and
> destruction of SparkContexts but I cannot understand how.
>
> Spark version is 1.6.2
>
> I saw multiple references to this error for other classses in the pyspark
> ml library but none of them contained hints toward the solution.
>
> I'm running tests through nosetests when it breaks. Running a single
> TestCase in Intellij works fine.
>
> Is there a known solution? Is it a known problem?
>
> Thank you,
>
> Simone
>


Re: UDAFs for sketching Dataset columns with T-Digests

2017-07-06 Thread Sam Bessalah
This is interesting and very useful.
Thanks.

On Thu, Jul 6, 2017 at 2:33 AM, Erik Erlandson  wrote:

> After my talk on T-Digests in Spark at Spark Summit East, there were some
> requests for a UDAF-based interface for working with Datasets.   I'm
> pleased to announce that I released a library for doing T-Digest sketching
> with UDAFs:
>
> https://github.com/isarn/isarn-sketches-spark
>
> This initial release provides support for Scala. Future releases will
> support PySpark bindings, and additional tools for leveraging T-Digests in
> ML pipelines.
>
> Cheers!
> Erik
>