Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
eleases/spark-release-3-5-1.html >> >> We would like to acknowledge all community members for contributing to >> this >> release. This release would not have been possible without you. >> >> Jungtaek Lim >> >> ps. Yikun is helping us through releasing the official docker image for >> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available. >> >> -- John Zhuge

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
https://github.com/apache/arrow-datafusion-comet for more details if >> you are interested. We'd love to collaborate with people from the open >> source community who share similar goals. >> >> Thanks, >> Chao >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- John Zhuge

Rename columns without manually setting them all

2023-06-21 Thread John Paul Jayme
Hi, This is currently my column definition : Employee ID NameClient Project Team01/01/2022 02/01/2022 03/01/2022 04/01/2022 05/01/2022 12345 Dummy x Dummy a abc team a OFF WO WH WH WH As you can see, the outer columns are just

How to read excel file in PySpark

2023-06-20 Thread John Paul Jayme
ct has no attribute 'read_excel'. Can you advise? JOHN PAUL JAYME Data Engineer [https://app.tdcx.com/email-signature/assets/img/tdcx-logo.png] m. +639055716384 w. www.tdcx.com<http://www.tdcx.com/> Winner of over 350 Industry Awards [Linkedin]<https://www.linkedin.com/company/tdcxgr

RE: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Crowe, John
to reduce the madness..  Regards; John Crowe TDi Technologies, Inc. 1600 10th Street Suite B Plano, TX 75074 (800) 695-1258 supp...@tditechnologies.com<mailto:supp...@tditechnologies.com> From: Sean Owen Sent: Wednesday, January 12, 2022 10:23 AM To: Crowe, John Cc: user@spark.apac

RE: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Crowe, John
I too would like to know when you anticipate Spark 3.3.0 to be released due to the Log4j CVE’s. Our customers are all quite concerned. Regards; John Crowe TDi Technologies, Inc. 1600 10th Street Suite B Plano, TX 75074 (800) 695-1258 supp...@tditechnologies.com<mailto:s

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
n Karau >>>> wrote: >>>> >>>>> Hi Folks, >>>>> >>>>> I'm continuing my adventures to make Spark on containers party and I >>>>> was wondering if folks have experience with the different batch >>>>> scheduler options that they prefer? I was thinking so that we can >>>>> better support dynamic allocation it might make sense for us to >>>>> support using different schedulers and I wanted to see if there are >>>>> any that the community is more interested in? >>>>> >>>>> I know that one of the Spark on Kube operators supports >>>>> volcano/kube-batch so I was thinking that might be a place I start >>>>> exploring but also want to be open to other schedulers that folks >>>>> might be interested in. >>>>> >>>>> Cheers, >>>>> >>>>> Holden :) >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> -- John Zhuge

Column-level encryption in Spark SQL

2020-12-18 Thread john washington
Dear Spark team members, Can you please advise if Column-level encryption is available in Spark SQL? I am aware that HIVE supports column level encryption. Appreciate your response. Thanks, John

Re: Timestamp Difference/operations

2018-10-12 Thread John Zhuge
y available. > > Kindly provide some insight on this. > > > Paras > 9130006036 > -- John

Re: Handle BlockMissingException in pyspark

2018-08-06 Thread John Zhuge
age > 25.0 (TID 35067, localhost, executor driver) > : org.apache.hadoop.hdfs.BlockMissingException > : Could not obtain block: > BP-1742911633-10.225.201.50-1479296658503:blk_1233169822_159765693 > > ``` > > Please can anyone help me with how to handle such exception in pyspark. > > -- > Best Regards > *Divay Jindal* > > > -- John

Where can I read the Kafka offsets in SparkSQL application

2018-07-24 Thread John, Vishal (Agoda)
Hello all, I have to read data from Kafka topic at regular intervals. I create the dataframe as shown below. I don’t want to start reading from the beginning on each run. At the same time, I don’t want to miss the messages between run intervals. val queryDf = sqlContext .read

Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread John Zhuge
ine (where the app is being submitted from). > > On Wed, Jan 3, 2018 at 6:46 PM, John Zhuge <john.zh...@gmail.com> wrote: > > Thanks Jacek and Marcelo! > > > > Any reason it is not sourced? Any security consideration? > > > > > > On Wed, Jan 3, 2018 at 9:59 A

Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread John Zhuge
Thanks Jacek and Marcelo! Any reason it is not sourced? Any security consideration? On Wed, Jan 3, 2018 at 9:59 AM, Marcelo Vanzin <van...@cloudera.com> wrote: > On Tue, Jan 2, 2018 at 10:57 PM, John Zhuge <jzh...@apache.org> wrote: > > I am running Spark 2.0.0 and 2.1.

Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-02 Thread John Zhuge
lustermode. See the YARN-related Spark Properties > <https://github.com/apache/spark/blob/master/docs/running-on-yarn.html#spark-properties> > for > more information. Does it mean spark-env.sh will not be sourced when starting AM in cluster mode? Does this paragraph appy to executor as well? Thanks, -- John Zhuge

Cases when to clear the checkpoint directories.

2017-10-07 Thread John, Vishal (Agoda)
Hello TD, You had replied to one of the questions about checkpointing – This is an unfortunate design on my part when I was building DStreams :) Fortunately, we learnt from our mistakes and built Structured Streaming the correct way. Checkpointing in Structured Streaming stores only the

Re: Logging in RDD mapToPair of Java Spark application

2017-07-30 Thread John Zeng
/tmp/logs/root/logs/application_1501197841826_0013does not exist. Log aggregation has not completed or is not enabled. Any other way to see my logs? Thanks John From: ayan guha <guha.a...@gmail.com> Sent: Sunday, July 30, 2017 10:34 PM To: John Zeng; Ri

Re: Logging in RDD mapToPair of Java Spark application

2017-07-30 Thread John Zeng
. But where are they? Thanks John From: Riccardo Ferrari <ferra...@gmail.com> Sent: Saturday, July 29, 2017 8:18 PM To: johnzengspark Cc: User Subject: Re: Logging in RDD mapToPair of Java Spark application Hi John, The reason you don't see the second

Spark on yarn logging

2017-06-29 Thread John Vines
I followed the instructions for configuring a custom logger per https://spark.apache.org/docs/2.0.2/running-on-yarn.html (because we have long running spark jobs, sometimes occasionally get stuck and without a rolling file appender will fill up disk). This seems to work well for us, but it breaks

PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-27 Thread John Omernik
Am I doing something wrong here? Why is the temp stuff owned by root? Is there a bug in saving things due to this ownership? John Exception: Py4JJavaError: An error occurred while calling o338.save. : org.apache.hadoop.security.AccessControlException: User jomernik(user id 101) does has been den

Re: [MLLib]: Executor OutOfMemory in BlockMatrix Multiplication

2017-06-14 Thread John Compitello
No problem. It was a big headache for my team as well. One of us already reimplemented it from scratch, as seen in this pending PR for our project. https://github.com/hail-is/hail/pull/1895 Hopefully you find that useful. We'll hopefully try to PR that into Spark at some point. Best, John

Re: [MLLib]: Executor OutOfMemory in BlockMatrix Multiplication

2017-06-14 Thread John Compitello
Hey Anthony, You're the first person besides myself I've seen mention this. BlockMatrix multiply is not the best method. As far as me and my team can tell, the memory problem stems from the fact that when Spark tries to compute block (i, j) of the matrix, it tries to manifest all of row i from

Re: Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

2017-06-08 Thread Satish John Bosco
I have tried the configuration calculator sheet provided by Cloudera as well but no improvements. However, ignoring the 17 mil operation to begin with. Let consider the simple sort on yarn and spark which has tremendous difference. The operation is simple Selected numeric col to be sorted

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread John Leach
at MapR? Usually the system guys target snapshots, volumes, and posix compliance if they are bought into Isilon. Good luck Mich. Regards, John Leach > On Jun 5, 2017, at 9:27 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi John, > > Thanks. Did you

Re: An Architecture question on the use of virtualised clusters

2017-06-05 Thread John Leach
point. Regards, John Leach > On Jun 5, 2017, at 9:11 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > I am concerned about the use case of tools like Isilon or Panasas to create a > layer on top of HDFS, essentially a HCFS on top of HDFS with the usual 3x

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread John Compitello
Spark is doing operations on each partition in parallel. If you decrease number of partitions, you’re potentially doing less work in parallel depending on your cluster setup. > On May 23, 2017, at 4:23 PM, Andrii Biletskyi > wrote: > > > No, I didn't

Matrix multiplication and cluster / partition / blocks configuration

2017-05-11 Thread John Compitello
size? Does anyone have any suggestions? I’ve tried throwing 900 cores at a 100k by 100k matrix multiply with 1000 by 1000 sized blocks, and that seemed to hang forever and eventually fail. Thanks , John - To unsubscribe e-mail

[ANNOUNCE] Apache Gora 0.7 Release

2017-03-23 Thread lewis john mcgibbney
Hi Folks, The Apache Gora team are pleased to announce the immediate availability of Apache Gora 0.7. The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and

Re: DataFrame from in memory datasets in multiple JVMs

2017-02-28 Thread John Desuvio
transmits the data from each of the JVMs over the network. This >> seems like overkill though. >> >> Is there a simpler solution for getting this data into a DataFrame? >> >> Thanks, >> John >> >> >> >> -- >

回复:Driver hung and happend out of memory while writing to console progress bar

2017-02-09 Thread John Fang
the spark version is 2.1.0 --发件人:方孝健(玄弟) 发送时间:2017年2月10日(星期五) 12:35收件人:spark-dev ; spark-user 主 题:Driver hung and happend out of memory while writing to

Driver hung and happend out of memory while writing to console progress bar

2017-02-09 Thread John Fang
[Stage 172:==> (10328 + 93) / 16144] [Stage 172:==> (10329 + 93) / 16144] [Stage 172:==> (10330 + 93) / 16144] [Stage 172:==>

spark main thread quit, but the driver don't crash at standalone cluster

2017-01-17 Thread John Fang
My spark main thread create some daemon threads which maybe timer thread. Then the spark application throw some exceptions, and the main thread will quit. But the jvm of driver don't crash for standalone cluster. Of course the question don't happen at yarn cluster. Because the application

spark main thread quit, but the Jvm of driver don't crash

2017-01-17 Thread John Fang
My spark main thread create some daemon thread. Then the spark application throw some exceptions, and the main thread will quit. But the jvm of driver don't crash, so How can i do? for example: val sparkConf = new SparkConf().setAppName("NetworkWordCount")

send this email to unsubscribe

2016-12-29 Thread John

how can I get the application belong to the driver?

2016-12-26 Thread John Fang
I hope I can get the application by the driverId, but I don't find the rest api at spark。Then how can i get the application, which belong to one driver。

can we unite the UI among different standaone clusters' UI?

2016-12-14 Thread John Fang
As we know, each standaone cluster has itself UI. Then we will have more than one UI if we have many standalone cluster. How can I only have a UI which can access different standaone clusters?

how can I set the log configuration file for spark history server ?

2016-12-08 Thread John Fang
./start-history-server.sh starting org.apache.spark.deploy.history.HistoryServer, logging to  /home/admin/koala/data/versions/0/SPARK/2.0.2/spark-2.0.2-bin-hadoop2.6/logs/spark-admin-org.apache.spark.deploy.history.HistoryServer-1-v069166214.sqa.zmf.out Then the history will print all log to the

Question about the DirectKafkaInputDStream

2016-12-08 Thread John Fang
The source is DirectKafkaInputDStream which can ensure the exectly-once of the  consumer side. But I have a question based the following code。As we known, the  "graph.generateJobs(time)" will create rdds and generate jobs。And the source  RDD is KafkaRDD which contain the offsetRange。 The jobs are 

Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread John Fang
1. If a task complete the operation, it will notify driver. The driver may not receive the message due to the network, and think the task is still running. Then the child stage won't be scheduled ? 2. how do spark guarantee the downstream-task can receive the shuffle-data completely. As fact, I

two spark-shells spark on mesos not working

2016-11-22 Thread John Yost
would be very much appreciated. Thanks! :) --John

Spark Logging : log4j.properties or log4j.xml

2016-08-24 Thread John Jacobs
One can specify "-Dlog4j.configuration=" or "-Dlog4j.configuration=". Is there any preference to using one over other? All the spark documentation talks about using "log4j.properties" only ( http://spark.apache.org/docs/latest/configuration.html#configuring-logging). So is only "log4j.properties"

Re: Using R code as part of a Spark Application

2016-06-29 Thread John Aherne
file. > >>> > >>> Is this even possible or the only way to use R is as part of RStudio > >>> orchestration of our Spark cluster? > >>> > >>> > >>> > >>> Thanks for the help! > >>> > >>> >

Re: Using R code as part of a Spark Application

2016-06-29 Thread John Aherne
t; >>> I want to use R code as part of spark application (the same way I would >>> do with Scala/Python). I want to be able to run an R syntax as a map >>> function on a big Spark dataframe loaded from a parquet file. >>> >>> Is this even possible o

Re: Explode row with start and end dates into row for each date

2016-06-22 Thread John Aherne
Thanks Saurabh! That explode function looks like it is exactly what I need. We will be using MLlib quite a lot - Do I have to worry about python versions for that? John On Wed, Jun 22, 2016 at 4:34 PM, Saurabh Sardeshpande <saurabh...@gmail.com> wrote: > Hi John, > > If you ca

Explode row with start and end dates into row for each date

2016-06-22 Thread John Aherne
? Particularly how to split one row into multiples. Lastly, I am a bit hesitant to ask but is there a recommendation on which version of python to use? Not interested in which is better, just want to know if they are both supported equally. I am using Spark 1.6.1 (Hortonworks distro). Thanks! John

Re: Long Running Spark Streaming getting slower

2016-06-10 Thread John Simon
Hi Mich, batch interval is 10 seconds, and I don't use sliding window. Typical message count per batch is ~100k. -- John Simon On Fri, Jun 10, 2016 at 10:31 AM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Hi John, > > I did not notice anything unusual in you

Re: Spark Streaming getting slower

2016-06-09 Thread John Simon
Sorry, forgot to mention that I don't use broadcast variables. That's why I'm puzzled here. -- John Simon On Thu, Jun 9, 2016 at 11:09 AM, John Simon <john.si...@tapjoy.com> wrote: > Hi, > > I'm running Spark Streaming with Kafka Direct Stream, batch interval > is 10 second

Spark Streaming getting slower

2016-06-09 Thread John Simon
user.country US user.dir /home/hadoop user.home /home/hadoop user.language en user.name hadoop user.timezone UTC ``` -- John Simon - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user

Re: HBase / Spark Kerberos problem

2016-05-19 Thread John Trengrove
Have you had a look at this issue? https://issues.apache.org/jira/browse/SPARK-12279 There is a comment by Y Bodnar on how they successfully got Kerberos and HBase working. 2016-05-18 18:13 GMT+10:00 : > Hi all, > > I have been puzzling over a Kerberos

Re: How to get the batch information from Streaming UI

2016-05-16 Thread John Trengrove
You would want to add a listener to your Spark Streaming context. Have a look at the StatsReportListener [1]. [1] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StatsReportListener 2016-05-17 7:18 GMT+10:00 Samuel Zhou : > Hi, >

Re: Silly Question on my part...

2016-05-16 Thread John Trengrove
If you are wanting to share RDDs it might be a good idea to check out Tachyon / Alluxio. For the Thrift server, I believe the datasets are located in your Spark cluster as RDDs and you just communicate with it via the Thrift JDBC Distributed Query Engine connector. 2016-05-17 5:12 GMT+10:00

Re: How to use the spark submit script / capability

2016-05-15 Thread John Trengrove
] https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java John 2016-05-16 2:33 GMT+10:00 Stephen Boesch <java...@gmail.com>: > > There is a committed PR from Marcelo Vanzin addressing that capability: > > https://githu

Re: VectorAssembler handling null values

2016-04-20 Thread John Trengrove
You could handle null values by using the DataFrame.na functions in a preprocessing step like DataFrame.na.fill(). For reference: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions John On 21 April 2016 at 03:41, Andres Perez <and...@tresata.

Docker Mesos Spark Port Mapping

2016-04-17 Thread John Omernik
am hoping I have some options for net=bridge. Thoughts? John

Access to Mesos Docker Cmd for Spark Executors

2016-04-17 Thread John Omernik
umes, and the port maps, allowing us to edit the command (have the default be "what works" and if we edit in such a way, it's our own fault) could give us the freedom to do things like this... does this capability exist? Thanks, John

Re: Problem with pyspark on Docker talking to YARN cluster

2016-04-06 Thread John Omernik
of the physical node in bridged mode, it doesn't see it and errors out... as stated we need a bind address, and advertise address if this is to work), 2. Same restrictions. 3. cluster mode doesn't work for pyspark shell. Any other thoughts? John On Thu, Jun 11, 2015 at 12:09 AM, Ashwin Shankar

Spark - Mesos HTTP API

2016-04-03 Thread John Omernik
result in a really big/heavy docker images in order to achieve this.So that got me thinking about the HTTP API, and was wondering if there is JIRA to track this or if this is something Spark is planning. Thanks! John

Apache Spark-Get All Field Names From Nested Arbitrary JSON Files

2016-03-31 Thread John Radin
JSON as a first class citizen eventually, but it is still aways off yet. Any guidance would be sincerely appreciated! Thanks! John

RE: Graphx

2016-03-11 Thread John Lilley
We have almost zero node info – just an identifying integer. John Lilley From: Alexis Roos [mailto:alexis.r...@gmail.com] Sent: Friday, March 11, 2016 11:24 AM To: Alexander Pivovarov <apivova...@gmail.com> Cc: John Lilley <john.lil...@redpoint.net>; Ovidiu-Cristian MARCU <ovi

RE: Graphx

2016-03-11 Thread John Lilley
to run our software on 1bn edges. John Lilley From: Alexander Pivovarov [mailto:apivova...@gmail.com] Sent: Friday, March 11, 2016 11:13 AM To: John Lilley <john.lil...@redpoint.net> Cc: Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>; lihu <lihu...@gmail.com>;

RE: Graphx

2016-03-11 Thread John Lilley
currentGroupSize++; } if (currentGroupSize >= groupSize) { currentGroupSize = 0; currentEdge += 2; } else { currentEdge++; } } } } John Lilley Chief Architect, RedPoint Global Inc. T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077 Skype: jl

RE: Graphx

2016-03-11 Thread John Lilley
currentGroupSize++; } if (currentGroupSize >= groupSize) { currentGroupSize = 0; currentEdge += 2; } else { currentEdge++; } } } } John Lilley Chief Architect, RedPoint Global Inc. T: +1 303 541 1516 | M: +1 720 938 5761

RE: Graphx

2016-03-11 Thread John Lilley
. It degrades gracefully along the O(N^2) curve and additional memory reduces time. John Lilley From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr] Sent: Friday, March 11, 2016 8:14 AM To: John Lilley <john.lil...@redpoint.net> Cc: lihu <lihu...@gmail.com>; Andrew

RE: Graphx

2016-03-11 Thread John Lilley
A colleague did the experiments and I don’t know exactly how he observed that. I think it was indirect from the Spark diagnostics indicating the amount of I/O he deduced that this was RDD serialization. Also when he added light compression to RDD serialization this improved matters. John

RE: Graphx

2016-03-11 Thread John Lilley
would get failures. By contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a single 16GB node in about an hour. I think that a very large cluster will do better, but we did not explore that. John Lilley Chief Architect, RedPoint Global Inc. T: +1 303 541 1516 | M: +1 720

Appropriate Apache Users List Uses

2016-02-09 Thread John Omernik
All, I received this today, is this appropriate list use? Note: This was unsolicited. Thanks John From: Pierce Lamb <pl...@snappydata.io> 11:57 AM (1 hour ago) to me Hi John, I saw you on the Spark Mailing List and noticed you worked for * and wanted to reach out. My company, Snap

[ANNOUNCE] Apache Nutch 2.3.1 Release

2016-01-21 Thread lewis john mcgibbney
Hi Folks, !!Apologies for cross posting!! The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v2.3.1, we advise all current users and developers of the 2.X series to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 2.X branch

Problem About Worker System.out

2015-12-28 Thread David John
I have used Spark 1.4 for 6 months. Thanks all the members of this community for your great work.I have a question about the logging issue. I hope this question can be solved. The program is running under this configurations: YARN Cluster, YARN-client mode. In Scala,writing a code

FW: Problem About Worker System.out

2015-12-28 Thread David John
015 at 5:33 PM, David John <david_john_2...@outlook.com> wrote: I have used Spark 1.4 for 6 months. Thanks all the members of this community for your great work.I have a question about the logging issue. I hope this question can be solved. The program is running under this configurati

Conf Settings in Mesos

2015-11-12 Thread John Omernik
c) Instead of having to repackage an tgz for each app, it would just propagate...am I looking at this wrong? John

NPE is Spark Running on Mesos in Finegrained Mode

2015-11-12 Thread John Omernik
behaivior, and if it's something that might point to a bug or if it's just classic uninitiated user error :) John NPE in Fine Grained Mode: 15/11/12 13:52:00 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-94b6962b-2c28-4c10-946c-bd3b5c8c8069 15/11/12 13:52:00 INFO stor

Different classpath across stages?

2015-11-11 Thread John Meehan
the right protocol buffer class across stage boundaries? -John - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

RE: Question about GraphX connected-components

2015-10-12 Thread John Lilley
is reached? Are there tuning parameters that optimize for data all fitting in memory vs. data that must spill? Thanks, John Lilley From: Igor Berman [mailto:igor.ber...@gmail.com] Sent: Saturday, October 10, 2015 12:06 PM To: John Lilley <john.lil...@redpoint.net> Cc: user@spark.apache.org;

Question about GraphX connected-components

2015-10-09 Thread John Lilley
happens when the data set exceed memory, does it spill to disk "nicely" or degrade catastrophically? Thanks, John Lilley

Python Packages in Spark w/Mesos

2015-09-21 Thread John Omernik
! John

Mesos Tasks only run on one node

2015-09-21 Thread John Omernik
I have a happy healthy Mesos cluster (0.24) running in my lab. I've compiled spark-1.5.0 and it seems to be working fine, except for one small issue, my tasks all seem to run on one node. (I have 6 in the cluster). Basically, I have directory of compressed text files. Compressed, these 25 files

Docker/Mesos with Spark

2015-09-19 Thread John Omernik
thout tons of admin overhead, so I really want to explore. Thanks! John

[ANNOUNCE] Apache Gora 0.6.1 Release

2015-09-15 Thread lewis john mcgibbney
Hi All, The Apache Gora team are pleased to announce the immediate availability of Apache Gora 0.6.1. What is Gora? Gora is a framework which provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs,

Re: insert overwrite table phonesall in spark-sql resulted in java.io.StreamCorruptedException

2015-08-20 Thread John Jay
The answer is that my table was not serialized by kyro,but I started spark-sql shell with kyro,so the data could not be deserialized。 -- View this message in context:

ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Joji John
'SparkUI' failed after 16 retries! Thanks Joji john

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Joji John
and use that. Thanks Joji John From: Ajay Singal asinga...@gmail.com Sent: Friday, July 24, 2015 6:59 AM To: Joji John Cc: user@spark.apache.org Subject: Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI

many-to-many join

2015-07-21 Thread John Berryman
extra hard to first locate states and then emit all userid-state pairs. How should I be doing this? Thanks, -John

insert overwrite table phonesall in spark-sql resulted in java.io.StreamCorruptedException

2015-07-02 Thread John Jay
My spark-sql command: spark-sql --driver-memory 2g --master spark://hadoop04.xx.xx.com:8241 --conf spark.driver.cores=20 --conf spark.cores.max=20 --conf spark.executor.memory=2g --conf spark.driver.memory=2g --conf spark.akka.frameSize=500 --conf spark.eventLog.enabled=true --conf

PySpark on YARN port out of range

2015-06-19 Thread John Meehan
/: java.lang.IllegalArgumentException (port out of range:1315905645) [duplicate 7] Traceback (most recent call last): File stdin, line 1, in module 15/06/19 11:49:44 INFO cluster.YarnScheduler: Removed TaskSet 39.0, whose tasks have all completed, from pool File /home/john/spark-1.4.0/python/pyspark/rdd.py

Fwd: Spark/PySpark errors on mysterious missing /tmp file

2015-06-12 Thread John Berryman
. a=sc.parallelize([(16646160,1)]) b=stuff.map(lambda x:(16646160,1)) #b=sc.parallelize(b.collect()) a.join(b).take(10) It still breaks. (Here again including the comment line fixes the problem.) So I'm apparently looking at some sort of spark/pyspark bug. Spark 1.2.0. Any idea? -John

Reopen Jira or New Jira

2015-06-11 Thread John Omernik
in their code, but I think someone (with more knowledge than I) should probably look into this on Spark as well due it appearing to have changed behavior between versions. Thoughts? John Previous Post All - I am facing and odd issue and I am not really sure where to go for support at this point. I am

Re: Spark 1.3.1 On Mesos Issues.

2015-06-08 Thread John Omernik
that this is happening at is way above my head. :) On Fri, Jun 5, 2015 at 4:38 PM, John Omernik j...@omernik.com wrote: Thanks all. The answers post is me too, I multi thread. That and Ted is aware to and Mapr is helping me with it. I shall report the answer of that investigation when we

Transform Functions and Python Modules

2015-06-08 Thread John Omernik
?) If there are any good docs on this, I'd love to understand it more. Thanks! John Example def parseLine(line): restr = ^(\w\w\w ?\d\d? \d\d:\d\d:\d\d) ([^ ]+) logre = re.compile(restr) m = logre.search(line[1]) # Why does every record of he RDD have a NONE value

Re: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread John Omernik
yuzhih...@gmail.com wrote: John: Which Spark release are you using ? As of 1.4.0, RDD has this method: def isEmpty(): Boolean = withScope { FYI On Fri, Jun 5, 2015 at 9:01 AM, Evo Eftimov evo.efti...@isecc.com wrote: Foreachpartition callback is provided with Iterator by the Spark

Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread John Omernik
://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Jun 1, 2015 at 2:49 PM, John Omernik j...@omernik.com javascript:_e(%7B%7D,'cvml','j...@omernik.com'); wrote: All - I am facing and odd issue and I am not really sure where to go for support at this point. I am running MapR

Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread John Omernik
Is there pythonic/sparkonic way to test for an empty RDD before using the foreachRDD? Basically I am using the Python example https://spark.apache.org/docs/latest/streaming-programming-guide.html to put records somewhere When I have data, it works fine, when I don't I get an exception. I am not

Re: Spark 1.3.1 On Mesos Issues.

2015-06-04 Thread John Omernik
be appreciated. Thanks! John On Mon, Jun 1, 2015 at 6:14 PM, Dean Wampler deanwamp...@gmail.com wrote: It would be nice to see the code for MapR FS Java API, but my google foo failed me (assuming it's open source)... So, shooting in the dark ;) there are a few things I would check, if you

Spark 1.3.1 On Mesos Issues.

2015-06-01 Thread John Omernik
perplexed on the change from 1.2.0 to 1.3.1. Thank you, John Full Error on 1.3.1 on Mesos: 15/05/19 09:31:26 INFO MemoryStore: MemoryStore started with capacity 1060.3 MB java.lang.NullPointerException at com.mapr.fs.ShimLoader.getRootClassLoader(ShimLoader.java:96

dependencies on java-netlib and jblas

2015-05-08 Thread John Niekrasz
perfectly fine without any native (non-Java) libraries installed at all? Thanks for the help, John -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/dependencies-on-java-netlib-and-jblas-tp22818.html Sent from the Apache Spark User List mailing list archive

Re: Joining data using Latitude, Longitude

2015-03-10 Thread John Meehan
There are some techniques you can use If you geohash http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping

Re: Getting to proto buff classes in Spark Context

2015-02-28 Thread John Meehan
Maybe try including the jar with --driver-class-path jar On Feb 26, 2015, at 12:16 PM, Akshat Aranya aara...@gmail.com wrote: My guess would be that you are packaging too many things in your job, which is causing problems with the classpath. When your jar goes in first, you get the

Sharing Spark Drivers

2015-02-24 Thread John Omernik
perspective. I will grant, that I am coming from a traditional background, so some of the older ideas for how to set things up may be creeping into my thinking, but if that's the case, I'd love to understand better. Thanks1 John

Re: Sharing Spark Drivers

2015-02-24 Thread John Omernik
though, if side projects are spinning up to support this, why not make this a feature of the main project or is it just that esoteric that it's not important for the main project to be looking into it? On Tue, Feb 24, 2015 at 9:25 AM, Chip Senkbeil chip.senkb...@gmail.com wrote: Hi John

Re: Spark on Mesos: Multiple Users with iPython Notebooks

2015-02-20 Thread John Omernik
sort of time frame I could possibly communicate to my team? Anything I can do? Thanks! On Fri, Feb 20, 2015 at 4:36 AM, Iulian Dragoș iulian.dra...@typesafe.com wrote: On Thu, Feb 19, 2015 at 2:49 PM, John Omernik j...@omernik.com wrote: I am running Spark on Mesos and it works quite well

Spark on Mesos: Multiple Users with iPython Notebooks

2015-02-19 Thread John Omernik
was not clear on those options. If anyone could point me in the right direction, I would greatly appreciate it! John - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

[ANNOUNCE] Apache Science and Healthcare Track @ApacheCon NA 2015

2015-01-08 Thread Lewis John Mcgibbney
Hi Folks, Apologies for cross posting :( As some of you may already know, @ApacheCon NA 2015 is happening in Austin, TX April 13th-16th. This email is specifically written to attract all folks interested in Science and Healthcare... this is an official call to arms! I am aware that there are

  1   2   >