Re: Problem with Generalized Regression Model

2017-01-09 Thread sethah
This likely indicates that the IRLS solver for GLR has encountered a singular
matrix. Can you check if you have linearly dependent columns in your data?
Also, this error message should be fixed in the latest version of Spark,
after:  https://issues.apache.org/jira/browse/SPARK-11918
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-Generalized-Regression-Model-tp28273p28294.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Palash Gupta
Hello Mr. Burton,
Can you share example code how did you implement for other user to see?
"So I think what we did is did a repartition too large and now we ran out of 
memory in spark shell.  "
Thanks!
P.Gupta
Sent from Yahoo Mail on Android 
 
  On Tue, 10 Jan, 2017 at 8:20 am, Kevin Burton wrote:   
Ah.. ok. I think I know what's happening now. I think we found this problem 
when running a job and doing a repartition()
Spark is just way way way too sensitive to memory configuration.
The 2GB per shuffle limit is also insanely silly in 2017.
So I think what we did is did a repartition too large and now we ran out of 
memory in spark shell.  
On Mon, Jan 9, 2017 at 5:53 PM, Steven Ruppert  wrote:

The spark-shell process alone shouldn't take up that much memory, at least in 
my experience. Have you dumped the heap to see what's all in there? What 
environment are you running spark in?

Doing stuff like RDD.collect() or .countByKey will pull potentially a lot of 
data the spark-shell heap. Another thing thing that can fill up the spark 
master process heap (which is also run in the spark-shell process) is running 
lots of jobs, the logged SparkEvents of which stick around in order for the UI 
to render. There are some options under `spark.ui.retained*` to limit that if 
it's a problem.


On Mon, Jan 9, 2017 at 6:00 PM, Kevin Burton  wrote:

We've had various OOM issues with spark and have been trying to track them down 
one by one.
Now we have one in spark-shell which is super surprising.
We currently allocate 6GB to spark shell, as confirmed via 'ps'
Why the heck would the *shell* need that much memory.
I'm going to try to give it more of course but would be nice to know if this is 
a legitimate memory constraint or there is a bug somewhere.
PS: One thought I had was that it would be nice to have spark keep track of 
where an OOM was encountered, in what component.
Kevin

-- 


We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpr ess.com… or check out my Google+ profile





CONFIDENTIALITY NOTICE: This email message, and any documents, files or 
previous e-mail messages attached to it is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are 
not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message.



-- 


We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com… or check out my Google+ profile


  


Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
Ah.. ok. I think I know what's happening now. I think we found this problem
when running a job and doing a repartition()

Spark is just way way way too sensitive to memory configuration.

The 2GB per shuffle limit is also insanely silly in 2017.

So I think what we did is did a repartition too large and now we ran out of
memory in spark shell.

On Mon, Jan 9, 2017 at 5:53 PM, Steven Ruppert 
wrote:

> The spark-shell process alone shouldn't take up that much memory, at least
> in my experience. Have you dumped the heap to see what's all in there? What
> environment are you running spark in?
>
> Doing stuff like RDD.collect() or .countByKey will pull potentially a lot
> of data the spark-shell heap. Another thing thing that can fill up the
> spark master process heap (which is also run in the spark-shell process) is
> running lots of jobs, the logged SparkEvents of which stick around in order
> for the UI to render. There are some options under `spark.ui.retained*` to
> limit that if it's a problem.
>
>
> On Mon, Jan 9, 2017 at 6:00 PM, Kevin Burton  wrote:
>
>> We've had various OOM issues with spark and have been trying to track
>> them down one by one.
>>
>> Now we have one in spark-shell which is super surprising.
>>
>> We currently allocate 6GB to spark shell, as confirmed via 'ps'
>>
>> Why the heck would the *shell* need that much memory.
>>
>> I'm going to try to give it more of course but would be nice to know if
>> this is a legitimate memory constraint or there is a bug somewhere.
>>
>> PS: One thought I had was that it would be nice to have spark keep track
>> of where an OOM was encountered, in what component.
>>
>> Kevin
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> 
>>
>>
>
> *CONFIDENTIALITY NOTICE: This email message, and any documents, files or
> previous e-mail messages attached to it is for the sole use of the intended
> recipient(s) and may contain confidential and privileged information. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.*




-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Steven Ruppert
The spark-shell process alone shouldn't take up that much memory, at least
in my experience. Have you dumped the heap to see what's all in there? What
environment are you running spark in?

Doing stuff like RDD.collect() or .countByKey will pull potentially a lot
of data the spark-shell heap. Another thing thing that can fill up the
spark master process heap (which is also run in the spark-shell process) is
running lots of jobs, the logged SparkEvents of which stick around in order
for the UI to render. There are some options under `spark.ui.retained*` to
limit that if it's a problem.


On Mon, Jan 9, 2017 at 6:00 PM, Kevin Burton  wrote:

> We've had various OOM issues with spark and have been trying to track them
> down one by one.
>
> Now we have one in spark-shell which is super surprising.
>
> We currently allocate 6GB to spark shell, as confirmed via 'ps'
>
> Why the heck would the *shell* need that much memory.
>
> I'm going to try to give it more of course but would be nice to know if
> this is a legitimate memory constraint or there is a bug somewhere.
>
> PS: One thought I had was that it would be nice to have spark keep track
> of where an OOM was encountered, in what component.
>
> Kevin
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>

-- 
*CONFIDENTIALITY NOTICE: This email message, and any documents, files or 
previous e-mail messages attached to it is for the sole use of the intended 
recipient(s) and may contain confidential and privileged information. Any 
unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email 
and destroy all copies of the original message.*


Re: How to connect Tableau to databricks spark?

2017-01-09 Thread Raymond Xie
To exclude firewall blocking the port,  I added a rule in windows firewall
to enable all inbound and outbound port 1. I then tried telnet
ec2-35-160-128-113.us-west-2.compute.amazonaws.com 1 in putty and it
still doesn't work,


**
*Sincerely yours,*


*Raymond*

On Mon, Jan 9, 2017 at 4:53 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Also, meant to add the link to the docs: https://docs.databricks.com/
> user-guide/faq/tableau.html
>
>
>
>
>
> *From: *Silvio Fiorito 
> *Date: *Monday, January 9, 2017 at 2:59 PM
> *To: *Raymond Xie , user 
> *Subject: *Re: How to connect Tableau to databricks spark?
>
>
>
> Hi Raymond,
>
>
>
> Are you using a Spark 2.0 or 1.6 cluster? With Spark 2.0 it’s just a
> matter of entering the hostname of your Databricks environment, the HTTP
> path from the cluster page, and your Databricks credentials.
>
>
>
> Thanks,
>
> Silvio
>
>
>
> *From: *Raymond Xie 
> *Date: *Sunday, January 8, 2017 at 10:30 PM
> *To: *user 
> *Subject: *How to connect Tableau to databricks spark?
>
>
>
> I want to do some data analytics work by leveraging Databricks spark
> platform and connect my Tableau desktop to it for data visualization.
>
>
>
> Does anyone ever make it? I've trying to follow the instruction below but
> not successful?
>
>
>
> https://docs.cloud.databricks.com/docs/latest/databricks_
> guide/01%20Databricks%20Overview/14%20Third%20Party%
> 20Integrations/01%20Setup%20JDBC%20or%20ODBC.html
>
>
>
>
>
> I got an error message in Tableau's attempt to connect:
>
>
>
> Unable to connect to the server "ec2-35-160-128-113.us-west-2.
> compute.amazonaws.com". Check that the server is running and that you
> have access privileges to the requested database.
>
>
>
> "ec2-35-160-128-113.us-west-2.compute.amazonaws.com" is the hostname of a
> EC2 instance I just created on AWS, I may have some missing there though as
> I am new to AWS.
>
>
>
> I am not sure that is related to account issue, I was using my Databricks
> account in Tableau to connect it.
>
>
>
> Thank you very much. Any clue is appreciated.
>
>
> **
>
> *Sincerely yours,*
>
>
>
> *Raymond*
>


spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
We've had various OOM issues with spark and have been trying to track them
down one by one.

Now we have one in spark-shell which is super surprising.

We currently allocate 6GB to spark shell, as confirmed via 'ps'

Why the heck would the *shell* need that much memory.

I'm going to try to give it more of course but would be nice to know if
this is a legitimate memory constraint or there is a bug somewhere.

PS: One thought I had was that it would be nice to have spark keep track of
where an OOM was encountered, in what component.

Kevin


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: unsubscribe

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org
HTH!
 





On Mon, Jan 9, 2017 4:40 PM, william tellme williamtellme...@gmail.com
wrote:

Re: UNSUBSCRIBE

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org
HTH!
 





On Mon, Jan 9, 2017 4:41 PM, Chris Murphy - ChrisSMurphy.com 
cont...@chrissmurphy.com
wrote:
PLEASE!!

UNSUBSCRIBE

2017-01-09 Thread Chris Murphy - ChrisSMurphy.com
PLEASE!!


unsubscribe

2017-01-09 Thread william tellme



Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Abhishek Bhandari
Glad that you found it.
ᐧ

On Mon, Jan 9, 2017 at 3:29 PM, Richard Siebeling 
wrote:

> Probably found it, it turns out that Mesos should be explicitly added
> while building Spark, I assumed I could use the old build command that I
> used for building Spark 2.0.0... Didn't see the two lines added in the
> documentation...
>
> Maybe these kind of changes could be added in the changelog under changes
> of behaviour or changes in the build process or something like that,
>
> kind regards,
> Richard
>
>
> On 9 January 2017 at 22:55, Richard Siebeling 
> wrote:
>
>> Hi,
>>
>> I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not
>> parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.
>> Mesos is running fine (both the master as the slave, it's a single
>> machine configuration).
>>
>> I really don't understand why this is happening since the same
>> configuration but using a Spark 2.0.0 is running fine within Vagrant.
>> Could someone please help?
>>
>> thanks in advance,
>> Richard
>>
>>
>>
>>
>


-- 
*Abhishek J Bhandari*
Mobile No. +1 510 493 6205 (USA)
Mobile No. +91 96387 93021 (IND)
*R & D Department*
*Valent Software Inc. CA*
Email: *abhis...@valent-software.com *


Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Richard Siebeling
Probably found it, it turns out that Mesos should be explicitly added while
building Spark, I assumed I could use the old build command that I used for
building Spark 2.0.0... Didn't see the two lines added in the
documentation...

Maybe these kind of changes could be added in the changelog under changes
of behaviour or changes in the build process or something like that,

kind regards,
Richard


On 9 January 2017 at 22:55, Richard Siebeling  wrote:

> Hi,
>
> I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not
> parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.
> Mesos is running fine (both the master as the slave, it's a single machine
> configuration).
>
> I really don't understand why this is happening since the same
> configuration but using a Spark 2.0.0 is running fine within Vagrant.
> Could someone please help?
>
> thanks in advance,
> Richard
>
>
>
>


Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-09 Thread Richard Siebeling
Hi,

I'm setting up Apache Spark 2.1.0 on Mesos and I am getting a "Could not
parse Master URL: 'mesos://xx.xx.xxx.xxx:5050'" error.
Mesos is running fine (both the master as the slave, it's a single machine
configuration).

I really don't understand why this is happening since the same
configuration but using a Spark 2.0.0 is running fine within Vagrant.
Could someone please help?

thanks in advance,
Richard


Re: How to connect Tableau to databricks spark?

2017-01-09 Thread Silvio Fiorito
Also, meant to add the link to the docs: 
https://docs.databricks.com/user-guide/faq/tableau.html


From: Silvio Fiorito 
Date: Monday, January 9, 2017 at 2:59 PM
To: Raymond Xie , user 
Subject: Re: How to connect Tableau to databricks spark?

Hi Raymond,

Are you using a Spark 2.0 or 1.6 cluster? With Spark 2.0 it’s just a matter of 
entering the hostname of your Databricks environment, the HTTP path from the 
cluster page, and your Databricks credentials.

Thanks,
Silvio

From: Raymond Xie 
Date: Sunday, January 8, 2017 at 10:30 PM
To: user 
Subject: How to connect Tableau to databricks spark?

I want to do some data analytics work by leveraging Databricks spark platform 
and connect my Tableau desktop to it for data visualization.

Does anyone ever make it? I've trying to follow the instruction below but not 
successful?

https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks%20Overview/14%20Third%20Party%20Integrations/01%20Setup%20JDBC%20or%20ODBC.html


I got an error message in Tableau's attempt to connect:

Unable to connect to the server 
"ec2-35-160-128-113.us-west-2.compute.amazonaws.com".
 Check that the server is running and that you have access privileges to the 
requested database.

"ec2-35-160-128-113.us-west-2.compute.amazonaws.com"
 is the hostname of a EC2 instance I just created on AWS, I may have some 
missing there though as I am new to AWS.

I am not sure that is related to account issue, I was using my Databricks 
account in Tableau to connect it.

Thank you very much. Any clue is appreciated.


Sincerely yours,


Raymond


Re: Spark 2.x OFF_HEAP persistence

2017-01-09 Thread Gene Pang
Yes, as far as I can tell, your description is accurate.

Thanks,
Gene

On Wed, Jan 4, 2017 at 9:37 PM, Vin J  wrote:

> Thanks for the reply Gene. Looks like this means, with Spark 2.x, one has
> to change from rdd.persist(StorageLevel.OFF_HEAP) to 
> rdd.saveAsTextFile(alluxioPath)
> / rdd.saveAsObjectFile (alluxioPath) for guarantees like persisted rdd
> surviving a Spark JVM crash etc,  as also the other benefits you mention.
>
> Vin.
>
> On Thu, Jan 5, 2017 at 2:50 AM, Gene Pang  wrote:
>
>> Hi Vin,
>>
>> From Spark 2.x, OFF_HEAP was changed to no longer directly interface with
>> an external block store. The previous tight dependency was restrictive and
>> reduced flexibility. It looks like the new version uses the executor's off
>> heap memory to allocate direct byte buffers, and does not interface with
>> any external system for the data storage. I am not aware of a way to
>> connect the new version of OFF_HEAP to Alluxio.
>>
>> You can experience similar benefits of the old OFF_HEAP <-> Tachyon mode
>> as well as additional benefits like unified namespace
>> 
>>  or
>> sharing in-memory data across applications, by using the Alluxio
>> filesystem API
>> .
>>
>> I hope this helps!
>>
>> Thanks,
>> Gene
>>
>> On Wed, Jan 4, 2017 at 10:50 AM, Vin J  wrote:
>>
>>> Until Spark 1.6 I see there were specific properties to configure such
>>> as the external block store master url (spark.externalBlockStore.url) etc
>>> to use OFF_HEAP storage level which made it clear that an external Tachyon
>>> type of block store as required/used for OFF_HEAP storage.
>>>
>>> Can someone clarify how this has been changed in Spark 2.x - because I
>>> do not see config settings anymore that point Spark to an external block
>>> store like Tachyon (now Alluxio) (or am i missing seeing it?)
>>>
>>> I understand there are ways to use Alluxio with Spark, but how about
>>> OFF_HEAP storage - can Spark 2.x OFF_HEAP rdd persistence still exploit
>>> alluxio/external block store? Any pointers to design decisions/Spark JIRAs
>>> related to this will also help.
>>>
>>> Thanks,
>>> Vin.
>>>
>>
>>
>


Re: Spark Master stops abruptly

2017-01-09 Thread streamly tester
After investigating thoroughly, I find out that it was zk that was causing
the issue, Infact one of my zk amongst the three that I configured was done.

On Mon, Jan 9, 2017 at 7:16 PM, streamly tester 
wrote:

> Hi,
>
> I have setup spark in cluster of 4 machines, with 2 masters and 2 workers.
> I have zookeeper who does the election of the master. Here is my
> configuration
>
> The spark-env.sh contains
>
> export SPARK_MASTER_IP=master1,master2
>
> The conf/slaves contains
>
> worker1
> worker2
>
> conf/spark-defaults.conf
>
> spark.master=spark://master2:7077,master1:7077
> spark.driver.memory=1g
> spark.executor.memory=1g
> spark.eventLog.enabled=true
> spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/opt/spark/
> conf/log4j.properties
> spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/
> opt/spark/conf/log4j.properties
> spark.deploy.recoveryMode=ZOOKEEPER
> spark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
> spark.deploy.zookeeper.dir=/spark
>
> spark-masters
>
> master1
> master2
>
> For ssh communication, I have added the following in .ssh/config
>
> Host 0.0.0.0
>   IdentityFile ~/.ssh/spark.key
>   StrictHostKeyChecking no
>
>   Host localhost
>   IdentityFile ~/.ssh/spark.key
>   StrictHostKeyChecking no
>
>   Host master2
>   #User ubuntu
>   IdentityFile ~/.ssh/spark.key
>   StrictHostKeyChecking no
>
>   Host master1
>   #User ubuntu
>   IdentityFile ~/.ssh/spark.key
>   StrictHostKeyChecking no
>
>   Host worker1
>   #User ubuntu
>   IdentityFile ~/.ssh/spark.key
>   StrictHostKeyChecking no
>
>   Host worker2
>   #User ubuntu
>   IdentityFile ~/.ssh/spark.key
>   StrictHostKeyChecking no
>
>
> These configurations works well with spark 2.0.2 and I'm able to submit a
> spark job and visualize it on the webui of spark.
>
> While working on my cluster, I suddenly wasn't able to submit anything to
> spark and its web ui became unreachable.
>
> While looking at the logs of the master I saw the following
>
> 17/01/07 10:43:04 INFO Master: Driver submitted org.apache.spark.deploy.
> worker.DriverWrapper
> 17/01/07 10:43:04 INFO Master: Launching driver driver-20170107104304-0003
> on worker worker-20170106162522-192.168.0.143-41738
> 17/01/07 10:43:07 INFO Master: Removing driver: driver-20170107104304-0003
> 17/01/07 11:11:21 INFO Master: Driver submitted org.apache.spark.deploy.
> worker.DriverWrapper
> 17/01/07 11:11:21 INFO Master: Launching driver driver-2017010721-0004
> on worker worker-20170106162522-192.168.0.143-41738
> 17/01/07 11:11:25 INFO Master: Removing driver: driver-2017010721-0004
> 17/01/07 11:24:02 INFO Master: Driver submitted org.apache.spark.deploy.
> worker.DriverWrapper
> 17/01/07 11:24:02 INFO Master: Launching driver driver-20170107112402-0005
> on worker worker-20170106162522-192.168.0.143-41738
> 17/01/07 11:24:12 INFO Master: Removing driver: driver-20170107112402-0005
> 17/01/07 11:39:05 INFO ClientCnxn: Client session timed out, have not
> heard from server in 26678ms for sessionid 0x35975540ed40003, closing
> socket connection and attempting reconnect
> 17/01/07 11:39:05 INFO ConnectionStateManager: State change: SUSPENDED
> 17/01/07 11:39:05 INFO ZooKeeperLeaderElectionAgent: We have lost
> leadership
> 17/01/07 11:39:05 ERROR Master: Leadership has been revoked -- master
> shutting down.
>
> And those of the workers
>
> 17/01/07 11:53:06 INFO Worker: Retrying connection to master (attempt # 15)
> 17/01/07 11:53:06 INFO Worker: Connecting to master ottawa41:7077...
> 17/01/07 11:53:06 WARN Worker: Failed to connect to master ottawa41:7077
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at org.apache.spark.rpc.RpcTimeout$$anonfun$1.
> applyOrElse(RpcTimeout.scala:77)
> at org.apache.spark.rpc.RpcTimeout$$anonfun$1.
> applyOrElse(RpcTimeout.scala:75)
> at scala.runtime.AbstractPartialFunction.apply(
> AbstractPartialFunction.scala:36)
> at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.
> applyOrElse(RpcTimeout.scala:59)
> at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.
> applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
> at org.apache.spark.deploy.worker.Worker$$anonfun$org$
> apache$spark$deploy$worker$Worker$$reregisterWithMaster$
> 1$$anon$2.run(Worker.scala:272)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to ottawa41/
> 

Re: How to connect Tableau to databricks spark?

2017-01-09 Thread Silvio Fiorito
Hi Raymond,

Are you using a Spark 2.0 or 1.6 cluster? With Spark 2.0 it’s just a matter of 
entering the hostname of your Databricks environment, the HTTP path from the 
cluster page, and your Databricks credentials.

Thanks,
Silvio

From: Raymond Xie 
Date: Sunday, January 8, 2017 at 10:30 PM
To: user 
Subject: How to connect Tableau to databricks spark?

I want to do some data analytics work by leveraging Databricks spark platform 
and connect my Tableau desktop to it for data visualization.

Does anyone ever make it? I've trying to follow the instruction below but not 
successful?

https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20Databricks%20Overview/14%20Third%20Party%20Integrations/01%20Setup%20JDBC%20or%20ODBC.html


I got an error message in Tableau's attempt to connect:

Unable to connect to the server 
"ec2-35-160-128-113.us-west-2.compute.amazonaws.com".
 Check that the server is running and that you have access privileges to the 
requested database.

"ec2-35-160-128-113.us-west-2.compute.amazonaws.com"
 is the hostname of a EC2 instance I just created on AWS, I may have some 
missing there though as I am new to AWS.

I am not sure that is related to account issue, I was using my Databricks 
account in Tableau to connect it.

Thank you very much. Any clue is appreciated.


Sincerely yours,


Raymond


Spark UI not coming up in EMR

2017-01-09 Thread Saurabh Malviya (samalviy)
Spark web UI for detailed monitoring for streaming jobs stop rendering after 2 
weeks. Its keep looping to fetch the page. Is there any clue I can get that 
page. Or logs where I can see how many events coming in spark for each internval

-Saurabh


Re: Spark SQL 1.6.3 ORDER BY and partitions

2017-01-09 Thread Yong Zhang
I am not sure what do you mean that "table" is comprised of 200/1200 partitions.


A partition could mean the dataset(RDD/DataFrame) will be chunked within Spark, 
then processed; Or it could mean you define the metadata in the Hive of the 
partitions of the table.

If you mean the first one, so you control the number of partitions by 
'spark.sql.shuffle.partitions', which has the default value of 200.

I will be surprised that a query works with default 200, but fails with the new 
value you set as 1200. As in general, when you increase this value, you force 
more partitions in your DF, which will lead less data per partition. So if you 
overset this value, it will hurt your performance, but should fail your job, if 
you can run the same job with less configured value.

Yong


From: Joseph Naegele 
Sent: Friday, January 6, 2017 1:14 PM
To: 'user'
Subject: Spark SQL 1.6.3 ORDER BY and partitions

I have two separate but similar issues that I've narrowed down to a pretty good 
level of detail. I'm using Spark 1.6.3, particularly Spark SQL.

I'm concerned with a single dataset for now, although the details apply to 
other, larger datasets. I'll call it "table". It's around 160 M records, 
average of 78 bytes each, so about 12 GB uncompressed. It's 2 GB compressed in 
HDFS.

First issue:
The following query works if "table" is comprised of 200 partitions (on disk), 
but fails when "table" is 1200 partitions with the "Total size of serialized 
results of 1031 tasks (6.0 GB) is bigger than spark.driver.maxResultSize (6.0 
GB)" error:

SELECT * FROM orc.`table` ORDER BY field DESC LIMIT 10;

This is possibly related to the TakeOrderedAndProject step in the execution 
plan, because the following queries do not give me problems:

SELECT * FROM orc.`table`;
SELECT * FROM orc.`table` ORDER BY field DESC;
SELECT * FROM orc.`table` LIMIT 10;

All of which have different execution plans.
My "table" has 1200 partitions because I must use a large value for 
spark.sql.shuffle.partitions to handle joins and window functions on much 
larger DataFrames in my application. Too many partitions may be suboptimal, but 
it shouldn't lead to large serialized results, correct?

Any ideas? I've seen https://issues.apache.org/jira/browse/SPARK-12837, but I 
think my issue is a bit more specific.


Second issue:
The difference between execution when calling .cache() and .count() on the 
following two DataFrames:

A: sqlContext.sql("SELECT * FROM table")
B: sqlContext.sql("SELECT * FROM table ORDER BY field DESC")

Counting the rows of A works as expected. A single Spark job with 2 stages. 
Load from Hadoop, map, aggregate, reduce to a number.

The same can't be said for B, however. The .cache() call spawns a Spark job 
before I even call .count(), loading from HDFS and performing ConvertToSafe and 
Exchange. The .count() call spawns another job, the first task of which appears 
to re-load from HDFS and again perform ConvertToSafe and Exchange, writing 1200 
shuffle partitions. The next stage then proceeds to read the shuffle data 
across only 2 tasks. One of these tasks completes immediately and the other 
runs indefinitely, failing because the partition is too large (the 
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE error).

Does this behavior make sense at all? Obviously it doesn't make sense to sort 
rows if I'm just counting them, but this is a simplified example of a more 
complex application in which caching makes sense. My executors have more than 
enough memory to cache this entire DataFrame.

Thanks for reading

---
Joe Naegele
Grier Forensics



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Master stops abruptly

2017-01-09 Thread streamly tester
Hi,

I have setup spark in cluster of 4 machines, with 2 masters and 2 workers.
I have zookeeper who does the election of the master. Here is my
configuration

The spark-env.sh contains

export SPARK_MASTER_IP=master1,master2

The conf/slaves contains

worker1
worker2

conf/spark-defaults.conf

spark.master=spark://master2:7077,master1:7077
spark.driver.memory=1g
spark.executor.memory=1g
spark.eventLog.enabled=true
spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/opt/spark/conf/log4j.properties
spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/opt/spark/conf/log4j.properties
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
spark.deploy.zookeeper.dir=/spark

spark-masters

master1
master2

For ssh communication, I have added the following in .ssh/config

Host 0.0.0.0
  IdentityFile ~/.ssh/spark.key
  StrictHostKeyChecking no

  Host localhost
  IdentityFile ~/.ssh/spark.key
  StrictHostKeyChecking no

  Host master2
  #User ubuntu
  IdentityFile ~/.ssh/spark.key
  StrictHostKeyChecking no

  Host master1
  #User ubuntu
  IdentityFile ~/.ssh/spark.key
  StrictHostKeyChecking no

  Host worker1
  #User ubuntu
  IdentityFile ~/.ssh/spark.key
  StrictHostKeyChecking no

  Host worker2
  #User ubuntu
  IdentityFile ~/.ssh/spark.key
  StrictHostKeyChecking no


These configurations works well with spark 2.0.2 and I'm able to submit a
spark job and visualize it on the webui of spark.

While working on my cluster, I suddenly wasn't able to submit anything to
spark and its web ui became unreachable.

While looking at the logs of the master I saw the following

17/01/07 10:43:04 INFO Master: Driver submitted
org.apache.spark.deploy.worker.DriverWrapper
17/01/07 10:43:04 INFO Master: Launching driver driver-20170107104304-0003
on worker worker-20170106162522-192.168.0.143-41738
17/01/07 10:43:07 INFO Master: Removing driver: driver-20170107104304-0003
17/01/07 11:11:21 INFO Master: Driver submitted
org.apache.spark.deploy.worker.DriverWrapper
17/01/07 11:11:21 INFO Master: Launching driver driver-2017010721-0004
on worker worker-20170106162522-192.168.0.143-41738
17/01/07 11:11:25 INFO Master: Removing driver: driver-2017010721-0004
17/01/07 11:24:02 INFO Master: Driver submitted
org.apache.spark.deploy.worker.DriverWrapper
17/01/07 11:24:02 INFO Master: Launching driver driver-20170107112402-0005
on worker worker-20170106162522-192.168.0.143-41738
17/01/07 11:24:12 INFO Master: Removing driver: driver-20170107112402-0005
17/01/07 11:39:05 INFO ClientCnxn: Client session timed out, have not heard
from server in 26678ms for sessionid 0x35975540ed40003, closing socket
connection and attempting reconnect
17/01/07 11:39:05 INFO ConnectionStateManager: State change: SUSPENDED
17/01/07 11:39:05 INFO ZooKeeperLeaderElectionAgent: We have lost leadership
17/01/07 11:39:05 ERROR Master: Leadership has been revoked -- master
shutting down.

And those of the workers

17/01/07 11:53:06 INFO Worker: Retrying connection to master (attempt # 15)
17/01/07 11:53:06 INFO Worker: Connecting to master ottawa41:7077...
17/01/07 11:53:06 WARN Worker: Failed to connect to master ottawa41:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at
org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1$$anon$2.run(Worker.scala:272)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to ottawa41/
192.168.0.141:7077
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.ConnectException: 

Re: What is the difference between hive on spark and spark on hive?

2017-01-09 Thread Nicholas Hakobian
Hive on Spark is Hive which takes sql statements in and creates Spark jobs
for processing instead of Mapreduce or Tez.

There is no such thing as "Spark on Hive", but there is SparkSQL. SparkSQL
can accept both programmatic statements or it can parse SQL statements to
produce a native Spark DataFrame. It does provide connectivity to the Hive
metastore, and in Spark 1.6 does call into Hive to provide functionality
that doesn't yet exist natively in Spark. I'm not sure how much of that
still exists in Spark 2.0, but I think much of it has been converted into
native Spark functions.

There is also the SparkSQL shell and thrift server which provides a SQL
only interface, but uses all the native Spark pipeline.

Hope this helps!
-Nick

Nicholas Szandor Hakobian, Ph.D.
Senior Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com

On Mon, Jan 9, 2017 at 7:05 AM, 李斌松  wrote:

> What is the difference between hive on spark and spark on hive?
>


Re: Spark GraphFrame ConnectedComponents

2017-01-09 Thread Ankur Srivastava
Hi Steve,

I could get the application working by setting "spark.hadoop.fs.default.name".
Thank you!!

And thank you for your input on using S3 for checkpoint. I am still working
on PoC so will consider using HDFS for the final implementation.

Thanks
Ankur

On Fri, Jan 6, 2017 at 9:57 AM, Steve Loughran 
wrote:

>
> On 5 Jan 2017, at 21:10, Ankur Srivastava 
> wrote:
>
> Yes I did try it out and it choses the local file system as my checkpoint
> location starts with s3n://
>
> I am not sure how can I make it load the S3FileSystem.
>
>
> set fs.default.name to s3n://whatever , or, in spark context,
> spark.hadoop.fs.default.name
>
> However
>
> 1. you should really use s3a, if you have the hadoop 2.7 JARs on your
> classpath.
> 2. neither s3n or s3a are real filesystems, and certain assumptions that
> checkpointing code tends to make "renames being O(1) atomic calls" do not
> hold. It may be that checkpointing to s3 isn't as robust as you'd like
>
>
>
>
> On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
> wrote:
>
>> Right, I'd agree, it seems to be only with delete.
>>
>> Could you by chance run just the delete to see if it fails
>>
>> FileSystem.get(sc.hadoopConfiguration)
>> .delete(new Path(somepath), true)
>> --
>> *From:* Ankur Srivastava 
>> *Sent:* Thursday, January 5, 2017 10:05:03 AM
>> *To:* Felix Cheung
>> *Cc:* user@spark.apache.org
>>
>> *Subject:* Re: Spark GraphFrame ConnectedComponents
>>
>> Yes it works to read the vertices and edges data from S3 location and is
>> also able to write the checkpoint files to S3. It only fails when deleting
>> the data and that is because it tries to use the default file system. I
>> tried looking up how to update the default file system but could not find
>> anything in that regard.
>>
>> Thanks
>> Ankur
>>
>> On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
>> wrote:
>>
>>> From the stack it looks to be an error from the explicit call to
>>> hadoop.fs.FileSystem.
>>>
>>> Is the URL scheme for s3n registered?
>>> Does it work when you try to read from s3 from Spark?
>>>
>>> _
>>> From: Ankur Srivastava 
>>> Sent: Wednesday, January 4, 2017 9:23 PM
>>> Subject: Re: Spark GraphFrame ConnectedComponents
>>> To: Felix Cheung 
>>> Cc: 
>>>
>>>
>>>
>>> This is the exact trace from the driver logs
>>>
>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong
>>> FS: s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7
>>> be/connected-components-c1dbc2b0/3, expected: file:///
>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>> ileSystem.java:80)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
>>> tus(RawLocalFileSystem.java:529)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
>>> ernal(RawLocalFileSystem.java:747)
>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>> alFileSystem.java:524)
>>> at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileS
>>> ystem.java:534)
>>> at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib
>>> $ConnectedComponents$$run(ConnectedComponents.scala:340)
>>> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone
>>> nts.scala:139)
>>> at GraphTest.main(GraphTest.java:31) --- Application Class
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>> ssorImpl.java:57)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>>> $SparkSubmit$$runMain(SparkSubmit.scala:731)
>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>>> .scala:181)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10
>>>
>>> Thanks
>>> Ankur
>>>
>>> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava <
>>> ankur.srivast...@gmail.com> wrote:
>>>
 Hi

 I am rerunning the pipeline to generate the exact trace, I have below
 part of trace from last run:

 Exception in thread "main" java.lang.IllegalArgumentException: Wrong
 FS: s3n://, expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
 at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
 ileSystem.java:69)
 at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
 

Re: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Md. Rezaul Karim
Hi,

Currently, I have been using Spark 2.1.0 for ML and so far did not
experience any critical issue. It's much stable compared to Spark
2.0.1/2.0.2 I would say.

Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


On 9 January 2017 at 16:36, Ankur Jain  wrote:

>
>
> Thanks Rezaul…
>
>
>
> Is Spark 2.1.0 still have any issues w.r.t. stability?
>
>
>
> Regards,
>
> Ankur
>
>
>
> *From:* Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org]
> *Sent:* Monday, January 09, 2017 5:02 PM
> *To:* Ankur Jain 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Machine Learning in Spark 1.6 vs Spark 2.0
>
>
>
> Hello Jain,
>
> I would recommend using Spark MLlib
> (and ML) of *Spark
> 2.1.0* with the following features:
>
>- ML Algorithms: common learning algorithms such as classification,
>regression, clustering, and collaborative filtering
>- Featurization: feature extraction, transformation, dimensionality
>reduction, and selection
>- Pipelines: tools for constructing, evaluating, and tuning ML
>Pipelines
>- Persistence: saving and load algorithms, models, and Pipelines
>- Utilities: linear algebra, statistics, data handling, etc.
>
> These features will help make your machine learning scalable and easy too.
>
>
> Regards,
> _
>
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
>
> IDA Business Park, Dangan, Galway, Ireland
>
> Web: http://www.reza-analytics.eu/index.html
> 
>
>
>
> On 9 January 2017 at 10:19, Ankur Jain  wrote:
>
> Hi Team,
>
>
>
> I want to start a new project with ML. But wanted to know which version of
> Spark is much stable and have more features w.r.t ML
>
> Please suggest your opinion…
>
>
>
> Thanks in Advance…
>
>
>
> [image: cid:image013.png@01D1AAE2.28F7BBF0]
>
> *Thanks & Regards*
>
> Ankur Jain
>
> Technical Architect – Big Data | IoT | Innovation Group
>
> Board: +91-731-663-6363 <+91%20731%20663%206363>
>
> Direct: +91-731-663-6125 <+91%20731%20663%206125>
>
> *www.yash.com *
>
> Follow YASH:
>
> [image: cid:image002.png@01CF5E10.26C55CF0]
> 
>
> [image: cid:image003.png@01CF5E10.26C55CF0] 
>
> [image: cid:image004.png@01CF5E10.26C55CF0]
> 
>
> [image: cid:image005.png@01CF5E10.26C55CF0]
> 
>
> [image: cid:image006.png@01CF5E10.26C55CF0]
> 
>
>  [image: Solutions-Architect-Associate]  *[image:
> cid:image010.png@01D1AD0C.4AFA3760]*  *[image: GPTWF LOGO]*
>
>
>
> 'Information transmitted by this e-mail is proprietary to YASH
> Technologies and/ or its Customers and is intended for use only by the
> individual or entity to which it is addressed, and may contain information
> that is privileged, confidential or exempt from disclosure under applicable
> law. If you are not the intended recipient or it appears that this mail has
> been forwarded to you without proper authority, you are notified that any
> use or dissemination of this information in any manner is strictly
> prohibited. In such cases, please notify us immediately at i...@yash.com
> and delete this mail from your records.
>
>
> 'Information transmitted by this e-mail is proprietary to YASH
> Technologies and/ or its Customers and is intended for use only by the
> individual or entity to which it is addressed, and may contain information
> that is privileged, confidential or exempt from disclosure under applicable
> law. If you are not the intended recipient or it appears that this mail has
> been forwarded to you without proper authority, you are notified that any
> use or dissemination of this information in any manner is strictly
> prohibited. In such cases, please notify us immediately at i...@yash.com
> and delete this mail from your records.
>


RE: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Ankur Jain

Thanks Rezaul…

Is Spark 2.1.0 still have any issues w.r.t. stability?

Regards,
Ankur

From: Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org]
Sent: Monday, January 09, 2017 5:02 PM
To: Ankur Jain 
Cc: user@spark.apache.org
Subject: Re: Machine Learning in Spark 1.6 vs Spark 2.0

Hello Jain,
I would recommend using Spark MLlib 
 (and ML) of Spark 2.1.0 
with the following features:

  *   ML Algorithms: common learning algorithms such as classification, 
regression, clustering, and collaborative filtering
  *   Featurization: feature extraction, transformation, dimensionality 
reduction, and selection
  *   Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
  *   Persistence: saving and load algorithms, models, and Pipelines
  *   Utilities: linear algebra, statistics, data handling, etc.
These features will help make your machine learning scalable and easy too.


Regards,
_
Md. Rezaul Karim, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html

On 9 January 2017 at 10:19, Ankur Jain 
> wrote:
Hi Team,

I want to start a new project with ML. But wanted to know which version of 
Spark is much stable and have more features w.r.t ML
Please suggest your opinion…

Thanks in Advance…

[cid:image013.png@01D1AAE2.28F7BBF0]

Thanks & Regards
Ankur Jain
Technical Architect – Big Data | IoT | Innovation Group
Board: +91-731-663-6363
Direct: +91-731-663-6125
www.yash.com
Follow YASH:

[cid:image002.png@01CF5E10.26C55CF0]

[cid:image003.png@01CF5E10.26C55CF0]

[cid:image004.png@01CF5E10.26C55CF0]

[cid:image005.png@01CF5E10.26C55CF0]

[cid:image006.png@01CF5E10.26C55CF0]


 [Solutions-Architect-Associate]   [cid:image010.png@01D1AD0C.4AFA3760]   
[GPTWF LOGO]

'Information transmitted by this e-mail is proprietary to YASH Technologies 
and/ or its Customers and is intended for use only by the individual or entity 
to which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this 
mail from your records.

'Information transmitted by this e-mail is proprietary to YASH Technologies 
and/ or its Customers and is intended for use only by the individual or entity 
to which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.


[PySpark] py4j.Py4JException: PythonFunction Does Not Exist

2017-01-09 Thread Tegan Snyder
Hello,

I recently setup a small 3 cluster setup of Spark on an existing Hadoop 
installation. I’m running into an error message when attempting to use the 
pyspark shell. I can reproduce the error in the pyspark shell the with the 
following example:


from operator import add
text = sc.textFile("shakespeare.txt")
def tokenize(text):
return text.split()

words = text.flatMap(tokenize)
print(words)



Once the above code is executed I receive the following error message:


Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/rdd.py", line 199, in __repr__
return self._jrdd.toString()
  File "/opt/spark/python/pyspark/rdd.py", line 2439, in _jrdd
self._jrdd_deserializer, profiler)
  File "/opt/spark/python/pyspark/rdd.py", line 2374, in _wrap_function
sc.pythonVer, broadcast_vars, sc._javaAccumulator)
  File "/usr/local/lib/python3.5/site-packages/py4j/java_gateway.py", line 
1414, in __call__
answer, self._gateway_client, None, self._fqn)
  File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/local/lib/python3.5/site-packages/py4j/protocol.py", line 324, in 
get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling 
None.org.apache.spark.api.python.PythonFunction. Trace:
py4j.Py4JException: Constructor 
org.apache.spark.api.python.PythonFunction([class [B, class java.util.HashMap, 
class java.util.ArrayList, class java.lang.String, class java.lang.String, 
class java.util.ArrayList, class 
org.apache.spark.api.python.PythonAccumulatorV2]) does not exist
at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at 
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


I’m running the following versions:

Spark Version: 2.0.2
Hadoop Version: 2.7.3
Python Version: 3.5.2
Scala: 2.12
OS Version – RHEL 7.2

It appears that the python py4j gateway is having trouble communicating with 
“org.apache.spark.api.python.PythonFunction”. Is there any debugging that I can 
do on my end to figure out what may be causing that?

For reference, using the spark-shell and Scala works as expected.

Thanks for your time.

-Tegan



Re: Entering the variables in the Argument part in Submit job section to run a spark code on Google Cloud

2017-01-09 Thread Dinko Srkoč
Not knowing how the code that handles those arguments look like, I
would, in the "Arguments" field for submitting a dataproc job, put:

--trainFile=gs://Anahita/small_train.dat
--testFile=gs://Anahita/small_test.dat
--numFeatures=9947
--numRounds=100

... providing you still keep those files in the "Anahita" bucket.

Each line in the "Arguments" field ends up as an element of the `args`
argument (an Array) of the method `main`.

Cheers,
Dinko

On 9 January 2017 at 13:43, Anahita Talebi  wrote:
> Dear friends,
>
> I am trying to run a run a spark code on Google cloud using submit job.
> https://cloud.google.com/dataproc/docs/tutorials/spark-scala
>
> My question is about the part "argument".
> In my spark code, they are some variables that their values are defined in a
> shell file (.sh), as following:
>
> --trainFile=small_train.dat \
> --testFile=small_test.dat \
> --numFeatures=9947 \
> --numRounds=100 \
>
>
> - I have tried to enter only the values and each value in a separate box as
> following but it is not working:
>
> data/small_train.dat
> data/small_test.dat
> 9947
> 100
>
> I have also tried to give the parameters like in this below, but it is not
> working neither:
> trainFile=small_train.dat
> testFile=small_test.dat
> numFeatures=9947
> numRounds=100
>
> I added the files small_train.dat and small_test.dat in the same bucket
> where I saved the .jar file. Let's say if my bucket is named Anahita, I
> added spark.jar, small_train.dat and small_test.dat in the bucket "Anahita".
>
>
> Does anyone know, how I can enter these values in the argument part?
>
> Thanks in advance,
> Anahita
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Tuning spark.executor.cores

2017-01-09 Thread Aaron Perrin
That setting defines the total number of tasks that an executor can run in
parallel.

Each node is partitioned into executors, each with identical heap and
cores. So, it can be a balancing act to optimally set these values,
particularly if the goal is to maximize CPU usage with memory and other IO.

For example, let's say your node has 1 TB memory and 100 cores. A general
rule of thumb is to keep JVM heap below 50-60 GB. So, you could partition
the node into maybe 20 executors, each with  around 50 GB memory and 5
cores. You then can run your job and monitor resource usage. If you find
that the processing is memory or CPU or IO bound, you may modify the
resource allocation appropriately.

That said, it can be time consuming to optimize these values, and in many
cases it's cheaper to just increase the number of nodes or the size of each
node. Of course, there are lots of factors in play.


On Mon, Jan 9, 2017 at 8:52 AM Appu K  wrote:

> Are there use-cases for which it is advisable to give a value greater than
> the actual number of cores to spark.executor.cores ?
>


What is the difference between hive on spark and spark on hive?

2017-01-09 Thread 李斌松
What is the difference between hive on spark and spark on hive?


Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi Takeshi,

Thanks for the answer. My UDAF aggregates data into an array of rows.

Apparently this makes it ineligible to using Hash-based aggregate based on
the logic at:
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java#L74
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L108

The list of support data type is VERY limited unfortunately.

It doesn't make sense to me that data type must be mutable for the UDAF to
use hash-based aggregate, but I could be missing something here :). I could
achieve hash-based aggregate by turning this query to RDD mode, but that is
counter intuitive IMO.

---
Regards,
Andy

On Mon, Jan 9, 2017 at 2:05 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Spark always uses hash-based aggregates if the types of aggregated data
> are supported there;
> otherwise, spark fails to use hash-based ones, then it uses sort-based
> ones.
> See: https://github.com/apache/spark/blob/master/sql/
> core/src/main/scala/org/apache/spark/sql/execution/
> aggregate/AggUtils.scala#L38
>
> So, I'm not sure about your query though, it seems the types of aggregated
> data in your query
> are not supported for hash-based aggregates.
>
> // maropu
>
>
>
> On Mon, Jan 9, 2017 at 10:52 PM, Andy Dang  wrote:
>
>> Hi all,
>>
>> It appears to me that Dataset.groupBy().agg(udaf) requires a full sort,
>> which is very inefficient for certain aggration:
>>
>> The code is very simple:
>> - I have a UDAF
>> - What I want to do is: dataset.groupBy(cols).agg(udaf).count()
>>
>> The physical plan I got was:
>> *HashAggregate(keys=[], functions=[count(1)], output=[count#67L])
>> +- Exchange SinglePartition
>>+- *HashAggregate(keys=[], functions=[partial_count(1)],
>> output=[count#71L])
>>   +- *Project
>>  +- Generate explode(internal_col#31), false, false,
>> [internal_col#42]
>> +- SortAggregate(key=[key#0], functions=[aggregatefunction(key#0,
>> nested#1, nestedArray#2, nestedObjectArray#3, value#4L,
>> com.[...]uDf@108b121f, 0, 0)], output=[internal_col#31])
>>+- *Sort [key#0 ASC], false, 0
>>   +- Exchange hashpartitioning(key#0, 200)
>>  +- SortAggregate(key=[key#0],
>> functions=[partial_aggregatefunction(key#0, nested#1, nestedArray#2,
>> nestedObjectArray#3, value#4L, com.[...]uDf@108b121f, 0, 0)],
>> output=[key#0,internal_col#37])
>> +- *Sort [key#0 ASC], false, 0
>>+- Scan ExistingRDD[key#0,nested#1,nes
>> tedArray#2,nestedObjectArray#3,value#4L]
>>
>> How can I make Spark to use HashAggregate (like the count(*) expression)
>> instead of SortAggregate with my UDAF?
>>
>> Is it intentional? Is there an issue tracking this?
>>
>> ---
>> Regards,
>> Andy
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Takeshi Yamamuro
Hi,

Spark always uses hash-based aggregates if the types of aggregated data are
supported there;
otherwise, spark fails to use hash-based ones, then it uses sort-based ones.
See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L38

So, I'm not sure about your query though, it seems the types of aggregated
data in your query
are not supported for hash-based aggregates.

// maropu



On Mon, Jan 9, 2017 at 10:52 PM, Andy Dang  wrote:

> Hi all,
>
> It appears to me that Dataset.groupBy().agg(udaf) requires a full sort,
> which is very inefficient for certain aggration:
>
> The code is very simple:
> - I have a UDAF
> - What I want to do is: dataset.groupBy(cols).agg(udaf).count()
>
> The physical plan I got was:
> *HashAggregate(keys=[], functions=[count(1)], output=[count#67L])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[partial_count(1)],
> output=[count#71L])
>   +- *Project
>  +- Generate explode(internal_col#31), false, false,
> [internal_col#42]
> +- SortAggregate(key=[key#0], functions=[aggregatefunction(key#0,
> nested#1, nestedArray#2, nestedObjectArray#3, value#4L,
> com.[...]uDf@108b121f, 0, 0)], output=[internal_col#31])
>+- *Sort [key#0 ASC], false, 0
>   +- Exchange hashpartitioning(key#0, 200)
>  +- SortAggregate(key=[key#0], 
> functions=[partial_aggregatefunction(key#0,
> nested#1, nestedArray#2, nestedObjectArray#3, value#4L,
> com.[...]uDf@108b121f, 0, 0)], output=[key#0,internal_col#37])
> +- *Sort [key#0 ASC], false, 0
>+- Scan ExistingRDD[key#0,nested#1,
> nestedArray#2,nestedObjectArray#3,value#4L]
>
> How can I make Spark to use HashAggregate (like the count(*) expression)
> instead of SortAggregate with my UDAF?
>
> Is it intentional? Is there an issue tracking this?
>
> ---
> Regards,
> Andy
>



-- 
---
Takeshi Yamamuro


How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi all,

It appears to me that Dataset.groupBy().agg(udaf) requires a full sort,
which is very inefficient for certain aggration:

The code is very simple:
- I have a UDAF
- What I want to do is: dataset.groupBy(cols).agg(udaf).count()

The physical plan I got was:
*HashAggregate(keys=[], functions=[count(1)], output=[count#67L])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)],
output=[count#71L])
  +- *Project
 +- Generate explode(internal_col#31), false, false,
[internal_col#42]
+- SortAggregate(key=[key#0],
functions=[aggregatefunction(key#0, nested#1, nestedArray#2,
nestedObjectArray#3, value#4L, com.[...]uDf@108b121f, 0, 0)],
output=[internal_col#31])
   +- *Sort [key#0 ASC], false, 0
  +- Exchange hashpartitioning(key#0, 200)
 +- SortAggregate(key=[key#0],
functions=[partial_aggregatefunction(key#0, nested#1, nestedArray#2,
nestedObjectArray#3, value#4L, com.[...]uDf@108b121f, 0, 0)],
output=[key#0,internal_col#37])
+- *Sort [key#0 ASC], false, 0
   +- Scan
ExistingRDD[key#0,nested#1,nestedArray#2,nestedObjectArray#3,value#4L]

How can I make Spark to use HashAggregate (like the count(*) expression)
instead of SortAggregate with my UDAF?

Is it intentional? Is there an issue tracking this?

---
Regards,
Andy


Tuning spark.executor.cores

2017-01-09 Thread Appu K
Are there use-cases for which it is advisable to give a value greater than
the actual number of cores to spark.executor.cores ?


Fwd: Entering the variables in the Argument part in Submit job section to run a spark code on Google Cloud

2017-01-09 Thread Anahita Talebi
Dear friends,

I am trying to run a run a spark code on Google cloud using submit job.
https://cloud.google.com/dataproc/docs/tutorials/spark-scala

My question is about the part "argument".
In my spark code, they are some variables that their values are defined in
a shell file (.sh), as following:

--trainFile=small_train.dat \
--testFile=small_test.dat \
--numFeatures=9947 \
--numRounds=100 \


- I have tried to enter only the values and each value in a separate box as
following but it is not working:

data/small_train.dat
data/small_test.dat
9947
100

I have also tried to give the parameters like in this below, but it is not
working neither:
trainFile=small_train.dat
testFile=small_test.dat
numFeatures=9947
numRounds=100

I added the files small_train.dat and small_test.dat in the same bucket
where I saved the .jar file. Let's say if my bucket is named Anahita, I
added spark.jar, small_train.dat and small_test.dat in the bucket "Anahita".


Does anyone know, how I can enter these values in the argument part?

Thanks in advance,
Anahita


Re: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Md. Rezaul Karim
Hello Jain,

I would recommend using Spark MLlib
(and ML) of *Spark 2.1.0*
with the following features:

   - ML Algorithms: common learning algorithms such as classification,
   regression, clustering, and collaborative filtering
   - Featurization: feature extraction, transformation, dimensionality
   reduction, and selection
   - Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
   - Persistence: saving and load algorithms, models, and Pipelines
   - Utilities: linear algebra, statistics, data handling, etc.

These features will help make your machine learning scalable and easy too.



Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


On 9 January 2017 at 10:19, Ankur Jain  wrote:

> Hi Team,
>
>
>
> I want to start a new project with ML. But wanted to know which version of
> Spark is much stable and have more features w.r.t ML
>
> Please suggest your opinion…
>
>
>
> Thanks in Advance…
>
>
>
> [image: cid:image013.png@01D1AAE2.28F7BBF0]
>
> *Thanks & Regards*
>
> Ankur Jain
>
> Technical Architect – Big Data | IoT | Innovation Group
>
> Board: +91-731-663-6363 <+91%20731%20663%206363>
>
> Direct: +91-731-663-6125 <+91%20731%20663%206125>
>
> *www.yash.com *
>
> Follow YASH:
>
> [image: cid:image002.png@01CF5E10.26C55CF0]
> 
>
> [image: cid:image003.png@01CF5E10.26C55CF0] 
>
> [image: cid:image004.png@01CF5E10.26C55CF0]
> 
>
> [image: cid:image005.png@01CF5E10.26C55CF0]
> 
>
> [image: cid:image006.png@01CF5E10.26C55CF0]
> 
>
>  [image: Solutions-Architect-Associate]  *[image:
> cid:image010.png@01D1AD0C.4AFA3760]*  *[image: GPTWF LOGO]*
>
>
> 'Information transmitted by this e-mail is proprietary to YASH
> Technologies and/ or its Customers and is intended for use only by the
> individual or entity to which it is addressed, and may contain information
> that is privileged, confidential or exempt from disclosure under applicable
> law. If you are not the intended recipient or it appears that this mail has
> been forwarded to you without proper authority, you are notified that any
> use or dissemination of this information in any manner is strictly
> prohibited. In such cases, please notify us immediately at i...@yash.com
> and delete this mail from your records.
>


Re: AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread Santosh.B
Yes it provides but whatever have seen its line by line update. Please see
below link
 https://gist.github.com/QwertyManiac/4724582

This is very slow because of append Avro , am thinking of something  which
we normally do for test files where we buffer the data to a size and the
flush the buffer.





On Mon, Jan 9, 2017 at 3:17 PM, Jörn Franke  wrote:

> Avro itself supports it, but I am not sure if this functionality is
> available through the Spark API. Just out of curiosity, if your use case is
> only write to HDFS then you might use simply flume.
>
> On 9 Jan 2017, at 09:58, awkysam  wrote:
>
> Currently for our project we are collecting data and pushing into Kafka
> with messages are in Avro format. We need to push this data into HDFS and
> we are using SparkStreaming and in HDFS also it is stored in Avro format.
> We are partitioning the data per each day. So when we write data into HDFS
> we need to append to the same file. Curenttly we are using
> GenericRecordWriter and we will be using saveAsNewAPIHadoopFile for writing
> into HDFS. Is there a way to append data into file in HDFS with Avro format
> using saveAsNewAPIHadoopFile ? Thanks, Santosh B
> --
> View this message in context: AVRO Append HDFS using
> saveAsNewAPIHadoopFile
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>


Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Ankur Jain
Hi Team,

I want to start a new project with ML. But wanted to know which version of 
Spark is much stable and have more features w.r.t ML
Please suggest your opinion...

Thanks in Advance...

[cid:image013.png@01D1AAE2.28F7BBF0]

Thanks & Regards
Ankur Jain
Technical Architect - Big Data | IoT | Innovation Group
Board: +91-731-663-6363
Direct: +91-731-663-6125
www.yash.com
Follow YASH:

[cid:image002.png@01CF5E10.26C55CF0]

[cid:image003.png@01CF5E10.26C55CF0]

[cid:image004.png@01CF5E10.26C55CF0]

[cid:image005.png@01CF5E10.26C55CF0]

[cid:image006.png@01CF5E10.26C55CF0]


 [Solutions-Architect-Associate]   [cid:image010.png@01D1AD0C.4AFA3760]   
[GPTWF LOGO]

'Information transmitted by this e-mail is proprietary to YASH Technologies 
and/ or its Customers and is intended for use only by the individual or entity 
to which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.


Re: AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread Jörn Franke
Avro itself supports it, but I am not sure if this functionality is available 
through the Spark API. Just out of curiosity, if your use case is only write to 
HDFS then you might use simply flume.

> On 9 Jan 2017, at 09:58, awkysam  wrote:
> 
> Currently for our project we are collecting data and pushing into Kafka with 
> messages are in Avro format. We need to push this data into HDFS and we are 
> using SparkStreaming and in HDFS also it is stored in Avro format. We are 
> partitioning the data per each day. So when we write data into HDFS we need 
> to append to the same file. Curenttly we are using GenericRecordWriter and we 
> will be using saveAsNewAPIHadoopFile for writing into HDFS. Is there a way to 
> append data into file in HDFS with Avro format using saveAsNewAPIHadoopFile ? 
> Thanks, Santosh B 
> View this message in context: AVRO Append HDFS using saveAsNewAPIHadoopFile
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread awkysam
Currently for our project we are collecting data and pushing into Kafka with
messages are in Avro format.  We need to push this data into HDFS and we are
using SparkStreaming and in HDFS also it is stored in Avro format. We are
partitioning the data per each day. So when we write data into HDFS we need
to append to the same file. Curenttly we are using GenericRecordWriter and
we will be using saveAsNewAPIHadoopFile for writing into HDFS.Is there a way
to append data into file in HDFS with Avro format using
saveAsNewAPIHadoopFile ?Thanks,Santosh B



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/AVRO-Append-HDFS-using-saveAsNewAPIHadoopFile-tp28292.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-09 Thread Appu K
That explains it.  Appreciate the help Hyukjin!

Thank you

On 9 January 2017 at 1:08:02 PM, Hyukjin Kwon (gurwls...@gmail.com) wrote:

Hi Appu,


I believe that textFile and filter came from...

https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L59-L61


It needs to read a first line even if using the header is disabled and
schema inference is disabled because we need anyway need a default string
schema

which having the number of fields same with the first row, "_c#" where # is
its position of fields if the schema is not specified manually.

I believe that another job would happen if the schema is explicitly given


I hope this is helpful


Thanks.

2017-01-09 0:11 GMT+09:00 Appu K :

> I was trying to create a base-data-frame in an EMR cluster from a csv file
> using
>
> val baseDF = spark.read.csv("s3://l4b-d4t4/wikipedia/pageviews-by-second-
> tsv”)
>
> Omitted the options to infer the schema and specify the header, just to
> understand what happens behind the screen.
>
>
> The Spark UI shows that this kicked off a job with one stage.The stage
> shows that a filter was applied
>
> Got curious a little bit about this. Is there any place where i could
> better understand why a filter was applied here and why there was an action
> in this case
>
>
> thanks
>
>


2D7ABF39-B592-4AF6-BB8B-C9A5E0AE7508
Description: Binary data