Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing!

On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang  wrote:

> Dear Apache Spark community,
>
> We are delighted to announce the launch of a groundbreaking tool that aims
> to make Apache Spark more user-friendly and accessible - the English SDK
> . Powered by the
> application of Generative AI, the English SDK
>  allows you to execute
> complex tasks with simple English instructions. This exciting news was 
> announced
> recently at the Data+AI Summit
>  and also introduced
> through a detailed blog post
> 
> .
>
> Now, we need your invaluable feedback and contributions. The aim of the
> English SDK is not only to simplify and enrich your Apache Spark experience
> but also to grow with the community. We're calling upon Spark developers
> and users to explore this innovative tool, offer your insights, provide
> feedback, and contribute to its evolution.
>
> You can find more details about the SDK and usage examples on the GitHub
> repository https://github.com/databrickslabs/pyspark-ai/. If you have any
> feedback or suggestions, please feel free to open an issue directly on the
> repository. We are actively monitoring the issues and value your insights.
>
> We also welcome pull requests and are eager to see how you might extend or
> refine this tool. Let's come together to continue making Apache Spark more
> approachable and user-friendly.
>
> Thank you in advance for your attention and involvement. We look forward
> to hearing your thoughts and seeing your contributions!
>
> Best,
> Gengliang Wang
>


Re: Complexity with the data

2022-05-25 Thread Gavin Ray
Forgot to reply-all last message, whoops. Not very good at email.

You need to normalize the CSV with a parser that can escape commas inside
of strings
Not sure if Spark has an option for this?


On Wed, May 25, 2022 at 4:37 PM Sid  wrote:

> Thank you so much for your time.
>
> I have data like below which I tried to load by setting multiple options
> while reading the file but however, but I am not able to consolidate the
> 9th column data within itself.
>
> [image: image.png]
>
> I tried the below code:
>
> df = spark.read.option("header", "true").option("multiline",
> "true").option("inferSchema", "true").option("quote",
>
> '"').option(
> "delimiter", ",").csv("path")
>
> What else I can do?
>
> Thanks,
> Sid
>
>
> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
> papad...@csd.auth.gr> wrote:
>
>> Dear Sid,
>>
>> can you please give us more info? Is it true that every line may have a
>> different number of columns? Is there any rule followed by
>>
>> every line of the file? From the information you have sent I cannot
>> fully understand the "schema" of your data.
>>
>> Regards,
>>
>> Apostolos
>>
>>
>> On 25/5/22 23:06, Sid wrote:
>> > Hi Experts,
>> >
>> > I have below CSV data that is getting generated automatically. I can't
>> > change the data manually.
>> >
>> > The data looks like below:
>> >
>> > 2020-12-12,abc,2000,,INR,
>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>> > 2020-12-09,fgh,,software_developer,I only manage the development part.
>> >
>> > Since I don't have much experience with the other domains.
>> >
>> > It is handled by the other people.,INR
>> > 2020-12-12,abc,2000,,USD,
>> >
>> > The third record is a problem. Since the value is separated by the new
>> > line by the user while filling up the form. So, how do I handle this?
>> >
>> > There are 6 columns and 4 records in total. These are the sample
>> records.
>> >
>> > Should I load it as RDD and then may be using a regex should eliminate
>> > the new lines? Or how it should be? with ". /n" ?
>> >
>> > Any suggestions?
>> >
>> > Thanks,
>> > Sid
>>
>> --
>> Apostolos N. Papadopoulos, Associate Professor
>> Department of Informatics
>> Aristotle University of Thessaloniki
>> Thessaloniki, GREECE
>> tel: ++0030312310991918
>> email: papad...@csd.auth.gr
>> twitter: @papadopoulos_ap
>> web: http://datalab.csd.auth.gr/~apostol
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: [Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-18 Thread Gavin Ray
Following up on this in case anyone runs across it in the archives in the
future
>From reading through the config docs and trying various combinations, I've
discovered that:

- You don't want to disable codegen. This roughly doubled the time to
perform simple, few-column/few-row queries from basic testing
  -  Can test this by setting an internal property after setting
"spark.testing" to "true" in system properties


> System.setProperty("spark.testing", "true")
> val spark = SparkSession.builder()
>   .config("spark.sql.codegen.wholeStage", "false")
>   .config("spark.sql.codegen.factoryMode", "NO_CODEGEN")
>

-  The following gave the best performance. I don't know if enabling CBO
did much.

val spark = SparkSession.builder()
> .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .config("spark.kryo.unsafe", "true")
> .config("spark.sql.adaptive.enabled", "true")
> .config("spark.sql.cbo.enabled", "true")
> .config("spark.sql.cbo.joinReorder.dp.star.filter", "true")
> .config("spark.sql.cbo.joinReorder.enabled", "true")
> .config("spark.sql.cbo.planStats.enabled", "true")
> .config("spark.sql.cbo.starSchemaDetection", "true")


If you're running on more recent JDK's, you'll need to set "--add-opens"
flags for a few namespaces for "kryo.unsafe" to work.



On Mon, May 16, 2022 at 12:55 PM Gavin Ray  wrote:

> Hi all,
>
> I've not got much experience with Spark, but have been reading the
> Catalyst and
> Datasources V2 code/tests to try to get a basic understanding.
>
> I'm interested in trying Catalyst's query planner + optimizer for queries
> spanning one-or-more JDBC sources.
>
> Somewhat unusually, I'd like to do this with as minimal latency as
> possible to
> see what the experience for standard line-of-business apps is like (~90/10
> read/write ratio).
> Few rows would be returned in the reads (something on the order of
> 1-to-1,000).
>
> My question is: What configuration settings would you want to use for
> something
> like this?
>
> I imagine that doing codegen/JIT compilation of the query plan might not be
> worth the cost, so maybe you'd want to disable that and do interpretation?
>
> And possibly you'd want to use query plan config/rules that reduce the time
> spent in planning, trading efficiency for latency?
>
> Does anyone know how you'd configure Spark to test something like this?
>
> Would greatly appreciate any input (even if it's "This is a bad idea and
> will
> never work well").
>
> Thank you =)
>


[SQL] Why does a small two-source JDBC query take ~150-200ms with all optimizations (AQE, CBO, pushdown, Kryo, unsafe) enabled? (v3.4.0-SNAPSHOT)

2022-05-18 Thread Gavin Ray
I did some basic testing of multi-source queries with the most recent Spark:
https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L46-L115

The output of "spark.time()" surprised me:

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 1

+---++---+--+
| id|name| id| title|
+---++---+--+
|  1| Bob|  1|Todo 1|
|  1| Bob|  2|Todo 2|
+---++---+--+
Time taken: 168 ms

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 2
LIMIT 1

+---+-+---+--+
| id| name| id| title|
+---+-+---+--+
|  2|Alice|  3|Todo 3|
+---+-+---+--+
Time taken: 228 ms


Calcite and Teiid manage to do this on the order of 5-50ms for basic
queries,
so I'm curious about the technical specifics on why Spark appears to be so
much slower here?


[Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-16 Thread Gavin Ray
Hi all,

I've not got much experience with Spark, but have been reading the Catalyst
and
Datasources V2 code/tests to try to get a basic understanding.

I'm interested in trying Catalyst's query planner + optimizer for queries
spanning one-or-more JDBC sources.

Somewhat unusually, I'd like to do this with as minimal latency as possible
to
see what the experience for standard line-of-business apps is like (~90/10
read/write ratio).
Few rows would be returned in the reads (something on the order of
1-to-1,000).

My question is: What configuration settings would you want to use for
something
like this?

I imagine that doing codegen/JIT compilation of the query plan might not be
worth the cost, so maybe you'd want to disable that and do interpretation?

And possibly you'd want to use query plan config/rules that reduce the time
spent in planning, trading efficiency for latency?

Does anyone know how you'd configure Spark to test something like this?

Would greatly appreciate any input (even if it's "This is a bad idea and
will
never work well").

Thank you =)


unsubscribe

2022-05-02 Thread Ray Qiu



Re: No SparkR on Mesos?

2016-09-07 Thread ray
Hi, Rodrick,

Interesting. SparkR is expected not to work with Mesos due to lack of support 
for mesos in some places, and it has not been tested yet.

Have you modified Spark source code by yourself? Have you deployed Spark binary 
distribution on all salve nodes, and set “spark.mesos.executor.home” to point 
to it?

It would be cool that you can contribute a patch:)

From:   on behalf of 
Rodrick Brown
Date:  Thursday, September 8, 2016 at 09:46
To:  Peter Griessl
Cc:  "user@spark.apache.org"
Subject:  Re: No SparkR on Mesos?

We've been using SparkR on Mesos for quite sometime with no issues. 


[fedora@prod-rstudio-1 ~]$ /opt/spark-1.6.1/bin/sparkR

R version 3.3.0 (2016-05-03) -- "Supposedly Educational"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Launching java with spark-submit command /opt/spark-1.6.1/bin/spark-submit   
"sparkr-shell" /tmp/Rtmphk5zxe/backend_port11f8414240b65
16/09/08 01:44:04 INFO SparkContext: Running Spark version 1.6.1
16/09/08 01:44:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/09/08 01:44:05 INFO SecurityManager: Changing view acls to: fedora
16/09/08 01:44:05 INFO SecurityManager: Changing modify acls to: fedora
16/09/08 01:44:05 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(fedora); users 
with modify permissions: Set(fedora)
16/09/08 01:44:05 INFO Utils: Successfully started service 'sparkDriver' on 
port 39193.
16/09/08 01:44:05 INFO Slf4jLogger: Slf4jLogger started
16/09/08 01:44:05 INFO Remoting: Starting remoting
16/09/08 01:44:05 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriverActorSystem@172.1.34.13:44212]
16/09/08 01:44:05 INFO Utils: Successfully started service 
'sparkDriverActorSystem' on port 44212.
16/09/08 01:44:05 INFO SparkEnv: Registering MapOutputTracker
16/09/08 01:44:05 INFO SparkEnv: Registering BlockManagerMaster
16/09/08 01:44:05 INFO DiskBlockManager: Created local directory at 
/home/fedora/spark-tmp-73604/blockmgr-2928edf7-635e-45ca-83ed-8dc1de50b141
16/09/08 01:44:05 INFO MemoryStore: MemoryStore started with capacity 3.4 GB
16/09/08 01:44:05 INFO SparkEnv: Registering OutputCommitCoordinator
16/09/08 01:44:05 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
16/09/08 01:44:05 INFO SparkUI: Started SparkUI at http://172.1.34.13:4040
16/09/08 01:44:06 INFO Executor: Starting executor ID driver on host localhost
16/09/08 01:44:06 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 45678.
16/09/08 01:44:06 INFO NettyBlockTransferService: Server created on 45678
16/09/08 01:44:06 INFO BlockManager: external shuffle service port = 31338
16/09/08 01:44:06 INFO BlockManagerMaster: Trying to register BlockManager
16/09/08 01:44:06 INFO BlockManagerMasterEndpoint: Registering block manager 
localhost:45678 with 3.4 GB RAM, BlockManagerId(driver, localhost, 45678)
16/09/08 01:44:06 INFO BlockManagerMaster: Registered BlockManager

 Welcome to
  __
   / __/__  ___ _/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  1.6.1
/_/


 Spark context is available as sc, SQL context is available as sqlContext
>



On Wed, Sep 7, 2016 at 8:02 AM, Peter Griessl  wrote:
Hello,

 

does SparkR really not work (yet?) on Mesos (Spark 2.0 on Mesos 1.0)?

 

$ /opt/spark/bin/sparkR

 

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"

Copyright (C) 2016 The R Foundation for Statistical Computing

Platform: x86_64-pc-linux-gnu (64-bit)

Launching java with spark-submit command /opt/spark/bin/spark-submit   
"sparkr-shell" /tmp/RtmpPYVJxF/backend_port338581f434

Error: SparkR is not supported for Mesos cluster.

Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  :

  JVM is not ready after 10 seconds

 

 

I couldn’t find any information on this subject in the docs – am I missing 
something?

 

Thanks for any hints,

Peter



-- 
Rodrick Brown / DevOPs

9174456839 / rodr...@orchardplatform.com

Orchard Platform 
101 5th Avenue, 4th Floor, New York, NY

NOTICE TO RECIPIENTS: This communication is confidential and intended for the 
use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the

Comparison between Standalone mode and YARN mode

2015-07-22 Thread Dogtail Ray
Hi,

I am very curious about the differences between Standalone mode and YARN
mode. According to
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/,
it seems that YARN mode is always better than Standalone mode. Is that the
case? Or I should choose different modes according to my specific
requirements? Thanks!


How to build Spark with my own version of Hadoop?

2015-07-21 Thread Dogtail Ray
Hi,

I have modified some Hadoop code, and want to build Spark with the modified
version of Hadoop. Do I need to change the compilation dependency files?
How to then? Great thanks!


Re: init / shutdown for complex map job?

2014-12-28 Thread Ray Melton
A follow-up to the blog cited below was hinted at, per "But Wait,
There's More ... To keep this post brief, the remainder will be left to
a follow-up post."

Is this follow-up pending?  Is it sort of pending?  Did the follow-up
happen, but I just couldn't find it on the web?

Regards, Ray.


On Sun, 28 Dec 2014 08:54:13 +
Sean Owen  wrote:

> You can't quite do cleanup in mapPartitions in that way. Here is a
> bit more explanation (farther down):
> http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
> On Dec 28, 2014 8:18 AM, "Akhil Das" 
> wrote:
> 
> > Something like?
> >
> > val a = myRDD.mapPartitions(p => {
> >
> >
> >
> > //Do the init
> >
> > //Perform some operations
> >
> > //Shut it down?
> >
> >  })
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-15 Thread Ray
Hi Xiangrui,

I am using yarn-cluster mode. The current hadoop cluster is configured to
only accept "yarn-cluster" mode and not allow "yarn-client" mode. I have no
prevelige to change that.

Without initializing with "k-means||", the job finished in 10 minutes. With
"k-means", it just hangs there for almost 1 hour.

I guess I can only go with "random" initialization in KMeans.

Thanks again for your help.

Ray




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui,

Thanks for the guidance. I read the log carefully and found the root cause. 

KMeans, by default, uses KMeans++ as the initialization mode. According to
the log file, the 70-minute hanging is actually the computing time of
Kmeans++, as pasted below:

14/10/14 14:48:18 INFO DAGScheduler: Stage 20 (collectAsMap at
KMeans.scala:293) finished in 2.233 s
14/10/14 14:48:18 INFO SparkContext: Job finished: collectAsMap at
KMeans.scala:293, took 85.590020124 s
14/10/14 14:48:18 INFO ShuffleBlockManager: Could not find files for shuffle
5 for deleting
14/10/14 *14:48:18* INFO ContextCleaner: Cleaned shuffle 5
14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
*14/10/14 15:54:36 INFO LocalKMeans: Local KMeans++ converged in 11
iterations.
14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913
seconds.*
14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at
KMeans.scala:190
14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at
KMeans.scala:190)
14/10/14 15:54:37 INFO DAGScheduler: Got job 16 (collectAsMap at
KMeans.scala:190) with 100 output partitions (allowLocal=false)
14/10/14 15:54:37 INFO DAGScheduler: Final stage: Stage 22(collectAsMap at
KMeans.scala:190)



I now use "random" as the Kmeans initialization mode, and other confs remain
the same. This time, it just finished quickly~~

In your test on mnis8m, did you use KMeans++ as initialization mode? How
long it takes?

Thanks again for your help.

Ray







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Burak,

In Kmeans, I used k_value = 100, num_iteration = 2, and num_run = 1.

In the current test, I increase num-executors = 200. In the storage info 2
(as shown below), 11 executors are  used (I think the data is kind of
balanced) and others have zero memory usage. 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n16438/spark_storage.png>
 


Currently, there is no active stage running, just as the first image I
posted in the first place. You mentioned "It seems that it is hanging, but
there is a lot of calculation going on." I thought if some calculation is
going on, there would be an active stage with an incomplete progress bar in
the UI. Am I wrong? 


Thanks, Burak!

Ray



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16438.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui,

The input dataset has 1.5 million sparse vectors. Each sparse vector has a
dimension(cardinality) of 9153 and has less than 15 nonzero elements.


Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can
see the application got  201 vCores. From the spark UI, I can see it got 201
executors (as shown below).

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_core.png>
  

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_executor.png>
 



Thanks.

Ray




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys,

An interesting thing, for the input dataset which has 1.5 million vectors,
if set the KMeans's k_value = 100 or k_value = 50, it hangs as mentioned
above. However, if decrease k_value  = 10, the same error still appears in
the log but the application finished successfully, without observable
hanging.

Hopefully this provides more information.

Thanks.

Ray



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys,

I am new to Spark. When I run Spark Kmeans
(org.apache.spark.mllib.clustering.KMeans) on a small dataset, it works
great. However, when using a large dataset with 1.5 million vectors, it just
hangs there at some reducyByKey/collectAsMap stages (attached image shows
the corresponding UI).

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n16413/spark.png> 



In the log file, I can see the errors below:

14/10/14 13:04:30 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/10/14 13:04:30 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@4aeed0e6
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
at
java.nio.channels.SelectionKey.isConnectable(SelectionKey.java:336)
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:352)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/10/14 13:04:30 ERROR SendingConnection: Exception while reading
SendingConnection to ConnectionManagerId(server_name_here,32936)
java.nio.channels.ClosedChannelException
at
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at
org.apache.spark.network.SendingConnection.read(Connection.scala:397)
at
org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:176)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/10/14 13:04:30 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@2d584a4e
14/10/14 13:04:30 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(server_name_here,37767)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,37767)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,37767)
14/10/14 13:04:30 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2d584a4e
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/10/14 13:04:30 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@server_name_here:44765] ->
[akka.tcp://spark@server_name_here:46406] disassociated! Shutting down.




Regarding the above errors, I searched online and tried increasing the
following confs, but still did not work.

spark.worker.timeout=3 

spark.akka.timeout=3 
spark.akka.retry.wait=3 
spark.akka.frameSize=1

spark.storage.blockManagerHeartBeatMs=3  

--driver-memory "2g"
--executor-memory "2g"
--num-executors 100



I am running spark-submit on YARN. The Spark version is 1.1.0,  and Hadoop
is 2.4.1.

Could you please some comments/insights?

Thanks a lot.

Ray




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark cluster spanning multiple data centers

2014-07-23 Thread Ray Qiu
Hi,

Is it feasible to deploy a Spark cluster spanning multiple data centers if
there is fast connections with not too high latency (30ms) between them?  I
don't know whether there is any presumptions in the software expecting all
cluster nodes to be local (super low latency, for instance).  Has anyone
tried this?

Thanks,
Ray