Announcing the Community Over Code 2024 Streaming Track

2024-03-20 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Denver, Colorado, October 7-10, 2024. The call for presentations is open

Re: Spark Connect, Master, and Workers

2023-09-01 Thread James Yu
Can I simply understand Spark Connect this way: The client process is now the Spark driver? From: Brian Huynh Sent: Thursday, August 10, 2023 10:15 PM To: Kezhi Xiong Cc: user@spark.apache.org Subject: Re: Spark Connect, Master, and Workers Hi Kezhi, Yes,

[k8s] Fail to expose custom port on executor container specified in my executor pod template

2023-06-26 Thread James Yu
ports are exposed (5005/TCP, 7078/TCP, 7079/TCP, 4040/TCP) as expected. Did I miss anything, or is this a known bug where executor pod template is not respected in terms of the port expose? Thanks in advance for your help. James

Announcing the Community Over Code 2023 Streaming Track

2023-06-09 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Halifax, Nova Scotia October 7-10, 2023. The call for presentations is open now through July 13, 2023. I am one of the co-chairs for the

Re: query time comparison to several SQL engines

2022-04-07 Thread James Turton
What might be the biggest factor affecting running time here is that Drill's query execution is not fault tolerant while Spark's is.  The philosophy is different, Drill's says "when you're doing interactive analytics and a node dies, killing your query as it goes, just run the query again."

Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread James Yu
Question: Spark use log4j 1.2.17, if my application jar contains log4j 2.x and gets submitted to the Spark cluster. Which version of log4j gets actually used during the Spark session? From: Sean Owen Sent: Monday, December 13, 2021 8:25 AM To: Jörn Franke Cc:

Re: start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-08 Thread James Yu
damage or destruction. On Wed, 8 Dec 2021 at 19:45, James Yu mailto:ja...@ispot.tv>> wrote: Just thought about another possibility which is to containerize the history server and run the container with proper restart policy. This may be the approach we will be taking because the deploymen

Re: start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-08 Thread James Yu
Sent: Tuesday, December 7, 2021 1:29 PM To: James Yu Cc: user @spark Subject: Re: start-history-server.sh doesn't survive system reboot. Recommendation? The scripts just launch the processes. To make any process restart on system restart, you would need to set it up as a system service (i.e

start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-07 Thread James Yu
Hi Users, We found that the history server launched by using the "start-history-server.sh" command does not survive system reboot. Any recommendation of making it always up even after reboot? Thanks, James

Re: Performance Problems Migrating to S3A Committers

2021-08-05 Thread James Yu
See this ticket https://issues.apache.org/jira/browse/HADOOP-17201. It may help your team. From: Johnny Burns Sent: Tuesday, June 22, 2021 3:41 PM To: user@spark.apache.org Cc: data-orchestration-team Subject: Performance Problems Migrating to S3A Committers

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
rito Sent: Wednesday, February 3, 2021 11:05 AM To: James Yu ; user Subject: Re: Poor performance caused by coalesce to 1 Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absol

Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
there is a simple and useful way to solve this kind of issue which we believe is quite common for many people. Thanks James

Re: Where do the executors get my app jar from?

2020-08-14 Thread James Yu
Henoc, Ok. That is for Yarn with HDFS. What will happen in Kubernetes as resource manager without HDFS scenario? James From: Henoc Sent: Thursday, August 13, 2020 10:45 PM To: James Yu Cc: user ; russell.spit...@gmail.com Subject: Re: Where do the executors

Where do the executors get my app jar from?

2020-08-13 Thread James Yu
in advance for explanation. James

Re: [Spark ML] existence of Matrix Factorization ALS algorithm's log version

2020-07-29 Thread James Yuan
Thanks for your quick reply. I'll hack it if needed :) James -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: NoClassDefFoundError: scala/Product$class

2020-06-06 Thread James Moore
How are you depending on that org.bdgenomics.adam library? Maybe you're pulling the 2.11 version of that.

Trump and modi butcher of Gujarat as Allies. Modi was banned to enter by US courts

2020-04-29 Thread James Mitchel

What is a VPN ? freedom from natzi owen censorship

2020-04-29 Thread James Mitchel

https://www.lausanne.org/content/lga/2019-05/the-rise-of-hindu-fundamentalism?gclid=Cj0KCQjwy6T1BRDXARIsAIqCTXpmVG-8QJwiOSTVH8fkhRXj3QXUufApRXbPJUTpLlZ4f4wWgFNlPVkaAndGEALw_wcB

2020-04-29 Thread James Mitchel
https://globalnews.ca/news/6823170/canadian-politicians-targeted-indian-intelligence/ Natzi Owen of Apache.org and two hindutwa against me.Characters who stole last remaining dignity from Apache tribe.Allies蘿 Abusing me A price. Worth paying. I will chose different technology to put bread on

Re: Spark driver thread

2020-03-06 Thread James Yu
Pol, thanks for your reply. Actually I am running Spark apps in CLUSTER mode. Is what you said still applicable in cluster mode. Thanks in advance for your further clarification. From: Pol Santamaria Sent: Friday, March 6, 2020 12:59 AM To: James Yu Cc: user

Spark driver thread

2020-03-05 Thread James Yu
Hi, Does a Spark driver always works as single threaded? If yes, does it mean asking for more than one vCPU for the driver is wasteful? Thanks, James

[Spark SQL] dependencies to use test helpers

2019-07-24 Thread James Pirz
encies += "org.apache.spark" % "spark-core_2.11" % "2.4.3" libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.4.3" libraryDependencies += "org.apache.spark" % "spark-catalyst_2.11" % "2.4.3" [1] sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala Thanks, James

Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread James Cotrotsios
Is there a plan to have a business catalog component for the Data Lake? If not how would someone make a proposal to create an open source project related to that. I would be interested in building out an open source data catalog that would use the Hive metadata store as a baseline for technical

Parallel read parquet file, write to postgresql

2018-12-03 Thread James Starks
Reading Spark doc (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not mentioned how to parallel read parquet file with SparkSession. Would --num-executors just work? Any additional parameters needed to be added to SparkSession as well? Also if I want to parallel

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread James Starks
{ ... }.filter{ ... }.flatMap { records => records.flatMap { record => Seq(record) } } Not smart code, but it works for my case. Thanks for the advice! ‐‐‐ Original Message ‐‐‐ On Saturday, December 1, 2018 12:17 PM, Chris Teoh wrote: > Hi James, > > Try flatMap (_.toL

Re: Caused by: java.io.NotSerializableException: com.softwaremill.sttp.FollowRedirectsBackend

2018-11-30 Thread James Starks
ich should lead to spark > being able to use the serializable versions. > > That’s very much a last resort though! > > Chris > > On 30 Nov 2018, at 05:08, Koert Kuipers wrote: > >> if you only use it in the executors sometimes using lazy works >> >>

Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-11-30 Thread James Starks
When processing data, I create an instance of RDD[Iterable[MyCaseClass]] and I want to convert it to RDD[MyCaseClass] so that it can be further converted to dataset or dataframe with toDS() function. But I encounter a problem that SparkContext can not be instantiated within SparkSession.map

Caused by: java.io.NotSerializableException: com.softwaremill.sttp.FollowRedirectsBackend

2018-11-29 Thread James Starks
This is not problem directly caused by Spark, but it's related; thus asking here. I use spark to read data from parquet and processing some http call with sttp (https://github.com/softwaremill/sttp). However, spark throws Caused by: java.io.NotSerializableException:

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
fully because I believe you are a bit confused. > > regards, > > Apostolos > > On 07/09/2018 05:39 μμ, James Starks wrote: > > > Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking > > doc shows that my spark doesn't use those act

Re: Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
reduced. Otherwise does Spark support streaming read from database (i.e. spark streaming + spark sql)? Thanks for your reply. ‐‐‐ Original Message ‐‐‐ On 7 September 2018 4:15 PM, Apostolos N. Papadopoulos wrote: > Dear James, > > - check the Spark documentation to see

Spark job's driver programe consums too much memory

2018-09-07 Thread James Starks
I have a Spark job that read data from database. By increasing submit parameter '--driver-memory 25g' the job can works without a problem locally but not in prod env because prod master do not have enough capacity. So I have a few questions: - What functions such as collecct() would cause the

Re: [External Sender] How to debug Spark job

2018-09-07 Thread James Starks
, Sep 7, 2018 at 5:48 AM James Starks > wrote: > >> I have a Spark job that reads from a postgresql (v9.5) table, and write >> result to parquet. The code flow is not complicated, basically >> >> case class MyCaseClass(field1: String, field2: St

How to debug Spark job

2018-09-07 Thread James Starks
I have a Spark job that reads from a postgresql (v9.5) table, and write result to parquet. The code flow is not complicated, basically case class MyCaseClass(field1: String, field2: String) val df = spark.read.format("jdbc")...load() df.createOrReplaceTempView(...) val newdf =

Re: Pass config file through spark-submit

2018-08-17 Thread James Starks
Accidentally to get it working, though don't thoroughly understand why (So far as I know, it's to configure in allowing executor refers to the conf file after copying to executors' working dir). Basically it's a combination of parameters --conf, --files, and --driver-class-path, instead of any

Pass config file through spark-submit

2018-08-16 Thread James Starks
I have a config file that exploits type safe config library located on the local file system, and want to submit that file through spark-submit so that spark program can read customized parameters. For instance, my.app { db { host = domain.cc port = 1234 db = dbname user =

Data source jdbc does not support streamed reading

2018-08-08 Thread James Starks
Now my spark job can perform sql operations against database table. Next I want to combine that with streaming context, so switching to readStream() function. But after job submission, spark throws Exception in thread "main" java.lang.UnsupportedOperationException: Data source jdbc does

Re: Newbie question on how to extract column value

2018-08-07 Thread James Starks
(id, derived_data) }.show() Thanks for the advice, it's really helpful! ‐‐‐ Original Message ‐‐‐ On August 7, 2018 5:33 PM, Gourav Sengupta wrote: > Hi James, > > It is always advisable to use the latest SPARK version. That said, can you > please giving a try to datafr

Newbie question on how to extract column value

2018-08-07 Thread James Starks
I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show() Now I want to perform filter and map function on col_b value. In plain scala it

unsubscribe

2018-02-01 Thread James Casiraghi
unsubscribe

RE: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Gabriel James
Me too. Experiences and recommendations please. Gabriel From: Kevin Wang [mailto:buz...@gmail.com] Sent: Wednesday, April 12, 2017 6:11 AM To: Alonso Isidoro Roman Cc: Gaurav1809 ; user@spark.apache.org Subject: Re: Any NLP library for

Anyone attending spark summit?

2016-10-11 Thread Andrew James
Hey, I just found a promo code for Spark Summit Europe that saves 20%. It’s "Summit16" - I love Brussels and just registered! Who’s coming with me to get their Spark on?! Cheers, Andrew

UNSUBSCRIBE

2016-08-09 Thread James Ding
smime.p7s Description: S/MIME cryptographic signature

Re: Spark, Scala, and DNA sequencing

2016-07-25 Thread James McCabe
into interesting open-source projects like this. James On 24/07/16 09:09, Sean Owen wrote: Also also, you may be interested in GATK, built on Spark, for genomics: https://github.com/broadinstitute/gatk On Sun, Jul 24, 2016 at 7:56 AM, Ofir Manor <ofir.ma...@equalum.io> wrote: Hi Jame

Spark, Scala, and DNA sequencing

2016-07-22 Thread James McCabe
Hi! I hope this may be of use/interest to someone: Spark, a Worked Example: Speeding Up DNA Sequencing http://scala-bility.blogspot.nl/2016/07/spark-worked-example-speeding-up-dna.html James - To unsubscribe e-mail: user

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu <yuzhih...@gmail.com> wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016

Help understanding an exception that produces multiple stack traces

2016-05-09 Thread James Casiraghi
of the lazy evaluation, and that only actions will do this, but the initial stack trace seems to be showing a persist call with underlying executing work. -Thank you. -James Stack Trace: An error occurred while calling o236.persist. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException

Re: Error from reading S3 in Scala

2016-05-04 Thread James Hammerton
sing s3a// instead of s3:// ? > Probably because of what's said about s3:// and s3n:// here (which is why I use s3a://): https://wiki.apache.org/hadoop/AmazonS3 Regards, James > Besides that you can increase s3 speeds using the instructions mentioned > here: > https://aws.amazon.com/blogs/aws/a

Will nested field performance improve?

2016-04-15 Thread James Aley
ress that in our ETL pipeline with a flattening step. If it's a known issue that we expect will be fixed in upcoming releases, I'll hold off. Any advice greatly appreciated! Thanks, James.

Re: ML Random Forest Classifier

2016-04-13 Thread James Hammerton
Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab <as...@live.com>

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
tegoricalFeatures, numClasses, numFeatures) > > } > > > def toOld(newModel: RandomForestClassificationModel): > OldRandomForestModel = { > > newModel.toOld > > } > > } > Regards, James On 11 April 2016 at 10:36, James Hammerton <ja...@gluru.co>

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton
not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: > Hello, > I'm trying to save a pipeline with a random forest classifier. If I try to > save the pipeline, it complains that

Logistic regression throwing errors

2016-04-01 Thread James Hammerton
use the learning to fail - f1 = 0. Anyone got any idea why this might happen? Regards, James

Re: Work out date column in CSV more than 6 months old (datediff or something)

2016-03-22 Thread James Hammerton
pe String). E.g. scala> "22/03/2016" < "24/02/2015" > > res4: Boolean = true > > >> scala> "22/03/2016" < "04/02/2015" > > res5: Boolean = false > > This is the correct result for a string comparison but it's not the compari

Re: Find all invoices more than 6 months from csv file

2016-03-22 Thread James Hammerton
e api I see an add_month(), date_add() and date_sub() methods, the first adds a number of months to a start date (would adding a -ve number of months to the current date work?), the latter two add or subtract a specified number of days to/from a date, these are available in 1.5.0 onwards. Alterna

Add org.apache.spark.mllib model .predict() method to models in org.apache.spark.ml?

2016-03-22 Thread James Hammerton
used outside of Spark than the new models at this time. Are there any plans to add the .predict() method back to the models in the new API? Regards, James

Re: best way to do deep learning on spark ?

2016-03-20 Thread James Hammerton
In the meantime there is also deeplearning4j which integrates with Spark (for both Java and Scala): http://deeplearning4j.org/ Regards, James On 17 March 2016 at 02:32, Ulanov, Alexander <alexander.ula...@hpe.com> wrote: > Hi Charles, > > > > There is an implementation of

Saving the DataFrame based RandomForestClassificationModels

2016-03-18 Thread James Hammerton
s.apache.org/jira/browse/SPARK-11888 My question is whether there's a work around given that these bugs are unresolved at least until 2.0.0. Regards, James

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton
on reading the events in, this should work. Anyone know definitively if this is the case? Regards, James

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread James Hammerton
Hi Ted, Finally got round to creating this: https://issues.apache.org/jira/browse/SPARK-13773 I hope you don't mind me selecting you as the shepherd for this ticket. Regards, James On 7 March 2016 at 17:50, James Hammerton <ja...@gluru.co> wrote: > Hi Ted, > > Thanks for g

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
to choose the project. Regards, James On 7 March 2016 at 13:09, Ted Yu <yuzhih...@gmail.com> wrote: > Have you tried clicking on Create button from an existing Spark JIRA ? > e.g. > https://issues.apache.org/jira/browse/SPARK-4352 > > Once you're logged in, you should be able

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton
Infrastructure. There doesn't seem to be an option for me to raise an issue for Spark?! Regards, James On 4 March 2016 at 14:03, James Hammerton <ja...@gluru.co> wrote: > Sure thing, I'll see if I can isolate this. > > Regards. > > James > > On 4 March 2016 at 12:24,

Spark reduce serialization question

2016-03-04 Thread James Jia
, which is approximately 4 * 330 MB. I know I can set the driver's max result size, but I just want to confirm that this is expected behavior. Thanks! James Stage 0:==>(1 + 3) / 4]16/02/19 05:59:28 ERROR TaskSetManager: Total s

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton
Sure thing, I'll see if I can isolate this. Regards. James On 4 March 2016 at 12:24, Ted Yu <yuzhih...@gmail.com> wrote: > If you can reproduce the following with a unit test, I suggest you open a > JIRA. > > Thanks > > On Mar 4, 2016, at 4:01 AM, James Hammerton <ja

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread James Hammerton
ons in the DataFrame by using coalesce() before saving the data. Regards, James On 1 March 2016 at 21:01, SRK <swethakasire...@gmail.com> wrote: > Hi, > > How can I control the number of parquet files getting created under a > partition? I have my sqlContext queries to c

Re: Sample sql query using pyspark

2016-03-01 Thread James Barney
Maurin, I don't know the technical reason why but: try removing the 'limit 100' part of your query. I was trying to do something similar the other week and what I found is that each executor doesn't necessarily get the same 100 rows. Joins would fail or result with a bunch of nulls when keys

Re: How could I do this algorithm in Spark?

2016-02-24 Thread James Barney
Guillermo, I think you're after an associative algorithm where A is ultimately associated with D, correct? Jakob would correct if that is a typo--a sort would be all that is necessary in that case. I believe you're looking for something else though, if I understand correctly. This seems like a

Count job stalling at shuffle stage on 3.4TB input (but only 5.3GB shuffle write)

2016-02-23 Thread James Hammerton
; TungstenExchange hashpartitioning(objectId#0) > TungstenAggregate(key=[objectId#0], functions=[], output=[objectId#0]) > Scan > CsvRelation(,Some(s3n://gluru-research/data/events.prod.2016-02-04/extractedEventsUncompressed),false, > > ,",null,#,PERMISSIVE,COMMONS,false,false,false,StructType(StructField(objectId,StringType,true), > StructField(eventName,StringType,true), > StructField(eventJson,StringType,true), > StructField(timestampNanos,StringType,true)),false,null)[objectId#0] > > Code Generation: true > > Regards, James

Re: Is this likely to cause any problems?

2016-02-19 Thread James Hammerton
://spark.apache.org/docs/latest/index.html) mentions EC2 but not EMR. Regards, James On 19 February 2016 at 14:25, Daniel Siegmann <daniel.siegm...@teamaol.com> wrote: > With EMR supporting Spark, I don't see much reason to use the spark-ec2 > script unless it is important for you to be ab

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
I have now... So far I think the issues I've had are not related to this, but I wanted to be sure in case it should be something that needs to be patched. I've had some jobs run successfully but this warning appears in the logs. Regards, James On 18 February 2016 at 12:23, Ted Yu <yuz

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
I'm fairly new to Spark. The documentation suggests using the spark-ec2 script to launch clusters in AWS, hence I used it. Would EMR offer any advantage? Regards, James On 18 February 2016 at 14:04, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > Just out of shee

Is this likely to cause any problems?

2016-02-18 Thread James Hammerton
. Could this contribute to any problems running the jobs? Regards, James

pyspark.DataFrame.dropDuplicates

2016-02-12 Thread James Barney
to appear? Or random? I would like to guarantee that the row with the longest list itemsInPocket is kept. How can I do that? Thanks, James

Re: Extract all the values from describe

2016-02-08 Thread James Barney
Hi Arunkumar, >From the scala documentation it's recommended to use the agg function for performing any actual statistics programmatically on your data. df.describe() is meant only for data exploration. See Aggregator here:

Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
Hi! My name is James, and I’m working on a question there doesn’t seem to be a lot of answers about online. I was hoping spark/hadoop gurus could shed some light on this. I have a data feed on NFS that looks like /foobar/.gz Currently I have a spark scala job that calls sparkContext.textFile

Re: Recursive nested wildcard directory walking in Spark

2015-12-09 Thread James Ding
/*.gz). Any thoughts or workarounds? I’m considering using bash globbing to match files recursively and feed hundreds of thousands of arguments to spark-submit. Reasons for/against? From: Ted Yu <yuzhih...@gmail.com> Date: Wednesday, December 9, 2015 at 3:50 PM To: James Ding <jd...@pal

Re: Setting executors per worker - Standalone

2015-09-29 Thread James Pirz
park-graphx-in-action > > > > > > On 29 Sep 2015, at 04:47, James Pirz <james.p...@gmail.com> wrote: > > Thanks for your reply. > > Setting it as > > --conf spark.executor.cores=1 > > when I start spark-shell (as an example application) indeed sets the

Re: Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
er worker since you > have 4 cores per worker > > > > On Tue, Sep 29, 2015 at 8:24 AM, James Pirz <james.p...@gmail.com> wrote: > >> Hi, >> >> I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while >> each machine has 12GB of RAM and

Setting executors per worker - Standalone

2015-09-28 Thread James Pirz
Hi, I am using speak 1.5 (standalone mode) on a cluster with 10 nodes while each machine has 12GB of RAM and 4 cores. On each machine I have one worker which is running one executor that grabs all 4 cores. I am interested to check the performance with "one worker but 4 executors per machine -

Unreachable dead objects permanently retained on heap

2015-09-25 Thread James Aley
to reclaim this memory? Should those arrays be GC'ed when jobs finish? Any guidance greatly appreciated. Many thanks, James.

Java UDFs in GROUP BY expressions

2015-09-07 Thread James Aley
the issue. The equivalent code from Scala seems to work fine for me. Is anyone else seeing this problem? For us, the attached code fails every time on Spark 1.4.1 Thanks, James

RE: Feedback: Feature request

2015-08-28 Thread Murphy, James
This is great and much appreciated. Thank you. - Jim From: Manish Amde [mailto:manish...@gmail.com] Sent: Friday, August 28, 2015 9:20 AM To: Cody Koeninger Cc: Murphy, James; user@spark.apache.org; d...@spark.apache.org Subject: Re: Feedback: Feature request Sounds good. It's a request I have

Feedback: Feature request

2015-08-26 Thread Murphy, James
Hey all, In working with the DecisionTree classifier, I found it difficult to extract rules that could easily facilitate visualization with libraries like D3. So for example, using : print(model.toDebugString()), I get the following result = If (feature 0 = -35.0) If (feature 24 = 176.0)

Repartitioning external table in Spark sql

2015-08-18 Thread James Pirz
I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes. Using Spark sql and Hive Context, I am trying to run a simple scan query on an existing Hive table (which is an external table consisting of rows in text files stored in HDFS - it is NOT parquet, ORC or any other richer

Re: worker and executor memory

2015-08-14 Thread James Pirz
are scheduled that way, as it is a map-only job and reading can happen in parallel. On Thu, Aug 13, 2015 at 9:10 PM, James Pirz james.p...@gmail.com wrote: Hi, I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple

worker and executor memory

2015-08-13 Thread James Pirz
Hi, I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables),

Re: SparkSQL: add jar blocks all queries

2015-08-07 Thread Wu, James C.
Hi, The issue only seems to happen when trying to access spark via the SparkSQL Thrift Server interface. Does anyone know a fix? james From: Wu, Walt Disney james.c...@disney.commailto:james.c...@disney.com Date: Friday, August 7, 2015 at 12:40 PM To: user@spark.apache.orgmailto:user

SparkSQL: remove jar added by add jar command from dependencies

2015-08-07 Thread Wu, James C.
the jar removed from the dependencies so that It is not blocking all my spark sql queries for all sessions. Thanks, James

SparkSQL: add jar blocks all queries

2015-08-07 Thread Wu, James C.
Hi, I got into a situation where a prior add jar command causing Spark SQL stops to work for all users. Does anyone know how to fix the issue? Regards, james From: Wu, Walt Disney james.c...@disney.commailto:james.c...@disney.com Date: Friday, August 7, 2015 at 10:29 AM To: user

[POWERED BY] Please add our organization

2015-07-24 Thread Baxter, James
quality analysis and data exploration. Regards James Baxter Technology and Innovation Analyst ISS Woodside Energy Ltd. Woodside Plaza 240 St Georges Terrace Perth WA 6000 Australia T: +61 8 9348 4218 F: +61 8 9348 6561 E: james.bax...@woodside.com.aumailto:james.bax...@woodside.com.au NOTICE

Streaming: updating broadcast variables

2015-07-03 Thread James Cole
and DStreams? Thanks, James

Re: Help optimising Spark SQL query

2015-06-30 Thread James Aley
to go sifting through. Turns out we're already writing our data as type/timestamp/parquet file, we just missed the date= naming convention - d'oh! At least that means a fairly simple rename script should get us out of trouble! Appreciate everyone's tips, thanks again! James. On 23 June 2015 at 17

Re: Help optimising Spark SQL query

2015-06-23 Thread James Aley
to have made any remarkable difference in running time for the query. I'll hook up YourKit and see if we can figure out where the CPU time is going, then post back. On 22 June 2015 at 16:01, Yin Huai yh...@databricks.com wrote: Hi James, Maybe it's the DISTINCT causing the issue. I rewrote

Help optimising Spark SQL query

2015-06-22 Thread James Aley
similar issues. Many thanks, James.

Re: Help optimising Spark SQL query

2015-06-22 Thread James Aley
Thanks for the responses, guys! Sorry, I forgot to mention that I'm using Spark 1.3.0, but I'll test with 1.4.0 and try the codegen suggestion then report back. On 22 June 2015 at 12:37, Matthew Johnson matt.john...@algomi.com wrote: Hi James, What version of Spark are you using? In Spark

Re: Optimisation advice for Avro-Parquet merge job

2015-06-12 Thread James Aley
Hey Kiran, Thanks very much for the response. I left for vacation before I could try this out, but I'll experiment once I get back and let you know how it goes. Thanks! James. On 8 June 2015 at 12:34, kiran lonikar loni...@gmail.com wrote: It turns out my assumption on load and unionAll

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread James Pirz
to communicate with Hive metastore. So your program need to instantiate a `org.apache.spark.sql.hive.HiveContext` instead. Cheng On 6/10/15 10:19 AM, James Pirz wrote: I am using Spark (standalone) to run queries (from a remote client) against data in tables that are already defined/loaded

Re: Running SparkSql against Hive tables

2015-06-09 Thread James Pirz
to connect to hive, which should work even without spark. Best Ayan On Tue, Jun 9, 2015 at 10:42 AM, James Pirz james.p...@gmail.com wrote: Thanks for the help! I am actually trying Spark SQL to run queries against tables that I've defined in Hive. I follow theses steps: - I start

Re: Running SparkSql against Hive tables

2015-06-09 Thread James Pirz
a query file with -f flag). Looking at the Spark SQL documentation, it seems that it is possible. Please correct me if I am wrong. On Mon, Jun 8, 2015 at 6:56 PM, Cheng Lian lian.cs@gmail.com wrote: On 6/9/15 8:42 AM, James Pirz wrote: Thanks for the help! I am actually trying Spark SQL to run

spark-submit does not use hive-site.xml

2015-06-09 Thread James Pirz
I am using Spark (standalone) to run queries (from a remote client) against data in tables that are already defined/loaded in Hive. I have started metastore service in Hive successfully, and by putting hive-site.xml, with proper metastore.uri, in $SPARK_HOME/conf directory, I tried to share its

Re: Running SparkSql against Hive tables

2015-06-08 Thread James Pirz
that would be highly appreciated. Thnx On Sun, Jun 7, 2015 at 6:39 AM, Cheng Lian lian.cs@gmail.com wrote: On 6/6/15 9:06 AM, James Pirz wrote: I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark SQL' to run some SQL scripts, on the cluster. I realized

  1   2   3   >