from:"Ben"

Sources/V2 DatasourceV2 in Spark 3.*

2022-06-28 Thread Bigg Ben

I can maintain backwards compatibility with spark2.4 while upgrading to spark 3.2. ? Or is there any documentation of those changes or if someone can point me to the right direction? Regards Ben

How do I read parquet with python object

2022-05-09 Thread ben

# python: import pandas as pd a = pd.DataFrame([[1, [2.3, 1.2]]], columns=['a', 'b']) a.to_parquet('a.parquet') # pyspark: d2 = spark.read.parquet('a.parquet') will return error: An error was encountered: An error occurred while calling o277.showString. : org.apache.spark.SparkException: Job

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor

rouped(100).toList.par.map(groupedParts => > spark.read.parquet(groupedParts: _*)) > > val finalDF = dfs.seq.grouped(100).toList.par.map(dfgroup => > dfgroup.reduce(_ union _)).reduce(_ union _).coalesce(2000) > > > > *From: *Ben Kaylor > *Date: *Tuesday, March 16, 2021 at 3:2

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor

: > P.S.: 3. If fast updates are required, one way would be capturing S3 > events & putting the paths/modifications dates/etc of the paths into > DynamoDB/your DB of choice. > > > > *From:* Boris Litvak > *Sent:* Tuesday, 16 March 2021 9:03 > *To:* Ben Kaylor ;

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Ben Kaylor

Not sure on answer on this, but am solving similar issues. So looking for additional feedback on how to do this. My thoughts if unable to do via spark and S3 boto commands, then have apps self report those changes. Where instead of having just mappers discovering the keys, you have services self

Re: Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-21 Thread Ben Smith

I can also recreate with the very latest master branch (3.1.0-SNAPSHOT) if I compile it locally -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-20 Thread Ben Smith

Thanks for that. I have played with this a bit more after your feedback and found: I can only recreate the problem with python 3.6+. If I change between python 2.7, python 3.6 and python 3.7 I find that the problem occurs in the python 3.6 and 3.7 case but not in the python 2.7. - I have used mini

Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-17 Thread Ben Smith

Hi, I am having an issue that looks like a potentially serious bug with Spark 2.4.3 as it impacts data accuracy. I have searched in the Spark Jira and mail lists as best I can and cannot find reference to anyone else having this issue. I am not sure if this would be suitable for raising as a bug i

Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-12 Thread Ben Roling

I've noticed that DataSet.sqlContext is public in Scala but the equivalent (DataFrame._sc) in PySpark is named as if it should be treated as private. Is this intentional? If so, what's the rationale? If not, then it feels like a bug and DataFrame should have some form of public access back to th

Start a standalone server as root and use it with user accounts

2020-01-28 Thread Ben Caine

rver as root and run a job on it from a user account. Thanks, Ben Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you should delete this

K8S spark submit for spark 2.4

2019-05-05 Thread Ben Chukwumobi (CONT)

R_SERVICEACCOUNTNAME)) .setConf(Constants.SPARK_DRIVER_IAM_ROLE_ARN_KEY, config.get[String]("k8s.submitter.spark.driver.iam.role.arn")) .setConf(Constants.SPARK_EXECUTOR_IAM_ROLE_ARN_KEY, config.get[String]("k8s.submitter.spark.executor.iam.role.arn")) .addAppA

Re: Dataframe multiple joins with same dataframe not able to resolve correct join columns

2018-07-11 Thread Ben White

Sounds like the same root cause as SPARK-14948 or SPARK-10925. A workaround is to "clone" df3 like this: val df3clone = df3.toDF(df.schema.fieldNames:_*) Then use df3clone in place of df3 in the second join. On Wed, Jul 11, 2018 at 2:52 PM Nirav Patel wrote: > I am trying to joind df1 with

Re: [External] Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-04 Thread Ben Teeuwen

s >>>>>>>>>> generally useful: https://issues.apache. >>>>>>>>>> org/jira/browse/SPARK-19031 >>>>>>>>>> >>>>>>>>>> In the mean time you could try implementing your own Source

Parallel dynamic partitioning producing duplicated data

2016-11-30 Thread Mehdi Ben Haj Abbes

Hi Folks, I have a spark job reading a csv file into a dataframe. I register that dataframe as a tempTable then I’m writing that dataframe/tempTable to hive external table (using parquet format for storage) I’m using this kind of command : hiveContext.sql(*"INSERT INTO TABLE t PARTITION(statPa

pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Ben Leslie

in this case it would fall back to regenerating from the parent RDD, rather than providing the wrong answer. Is this the expected behaviour? It seems a little difficult to work with in practise. Cheers, Ben - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-22 Thread Ben Teeuwen

n mappings_bc[column]: key = unicode(x) return Vectors.sparse(len(mappings_bc[column]),[mappings_bc[column][key]],[1.0]) Ben > On Aug 19, 2016, at 11:34 PM, Davies Liu wrote: > > The OOM happen in driver, you may also need more memory for driver. > > On Fri, Aug 19, 2016 at

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Ben Teeuwen

d to see how it compares. > > By the way, the feature size you select for the hasher should be a power of 2 > (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes are > evenly distributed (see the section on HashingTF under > http://spark.apache.org/docs/latest/ml

Re: Spark join and large temp files

2016-08-11 Thread Ben Teeuwen

. > On Aug 11, 2016, at 9:48 PM, Gourav Sengupta > wrote: > > Hi Ben, > > and that will take care of skewed data? > > Gourav > > On Thu, Aug 11, 2016 at 8:41 PM, Ben Teeuwen <mailto:bteeu...@gmail.com>> wrote: > When you read both ‘a’ and ‘b', c

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Ben Teeuwen

uot;{} min: {} max: {}".format(c, min(mappings[c].values()), max(mappings[c].values( # some logging to confirm the indexes. logging.info("Missing value = {}".format(mappings[c]['missing'])) return max_index, mappings I’d love to see the StringIndexe

Re: Spark join and large temp files

2016-08-11 Thread Ben Teeuwen

When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’? If you .cache() and .count() to force a shuffle, it'll push the records that will be joined to the same executors. So; a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache() a.count() b = spark.read.pa

Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Ben Teeuwen

, right? So I’m curious what to use instead. > On Aug 4, 2016, at 3:54 PM, Nicholas Chammas > wrote: > > Have you looked at pyspark.sql.functions.udf and the associated examples? > 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen <mailto:bteeu...@gmail.com>>님이 작성: > Hi, >

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen

-2 status: ____ After 40 minutes or so, with no activity in the application master, it dies. Ben > On Aug 4, 2016, at 12:14 PM, Nick Pentreath wrote: > > Hi Ben > > Perhaps with this size cardinality it is worth looking at feature hashing for > your probl

registering udf to use in spark.sql('select...

2016-08-04 Thread Ben Teeuwen

ion? I only see registerFunction in the deprecated sqlContext at http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html <http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html>. As the ‘spark’ object unifies hiveContext and sqlContext, what is the new way to go? Ben

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen

at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecu

Re: how to debug spark app?

2016-08-04 Thread Ben Teeuwen

Related question: what are good profiling tools other than watching along the application master with the running code? Are there things that can be logged during the run? If I have say 2 ways of accomplishing the same thing, and I want to learn about the time/memory/general resource blocking p

Re: How does MapWithStateRDD distribute the data

2016-08-03 Thread Ben Teeuwen

Did you check the executors logs to check whether the kafka offsets pulled in evenly over the 4 executors? I recall a similar situation with such uneven balancing from a kafka stream, and ended up raising the amount of resources (RAM and cores). Then it nicely balanced out. I don’t understand t

OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-03 Thread Ben Teeuwen

Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error. I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G

Materializing mapWithState .stateSnapshot() after ssc.stop

2016-07-28 Thread Ben Teeuwen

ate-statesnapshots-to-database-for-later-resume-of-spark-st>. The context is to be able to pause and resume a streaming app while not losing the state information. So I want to save and reload (initialize) the state snapshot. Has anyone of you already been able to do this? Thanks, Ben

Re: How do i get a spark instance to use my log4j properties

2016-04-15 Thread Demi Ben-Ari

n VERY RELUCTANT to edit the spark jars > > I know this point has been discussed here before but I do not see a clean > answer > > > > > > > -- Best regards, Demi Ben-Ari <http://il.linkedin.com/in/demibenari> Entrepreneur @ Stealth Mode Startup Twitter: @demibenari <https://twitter.com/demibenari>

Analyzing json Data streams using sparkSQL in spark streaming returns java.lang.ClassNotFoundException

2016-03-08 Thread Nesrine BEN MUSTAPHA

Hello, I tried to use sparkSQL to analyse json data streams within a standalone application. here the code snippet that receive the streaming data: *final JavaReceiverInputDStream lines = streamCtx.socketTextStream("localhost", Integer.parseInt(args[0]), StorageLevel.MEMORY_AND_DISK_SER_2());*

local class incompatible: stream classdesc

2016-03-01 Thread Nesrine BEN MUSTAPHA

Hi, I installed a standalone spark cluster with two workers. I developed a Java Application that use the maven dependency of spark (same version as the spark cluster). In my class Spark jobs I have only two methods considered as two different jobs: the first one is the example of spark word coun

Re: equalTo isin not working as expected with a constructed column with DataFrames

2016-02-18 Thread Mehdi Ben Haj Abbes

Hi, I forgot to mention that I'm using the 1.5.1 version. Regards, On Thu, Feb 18, 2016 at 4:20 PM, Mehdi Ben Haj Abbes wrote: > Hi folks, > > I have DataFrame with let's say this schema : > -dealId, > -ptf, > -ts > from it I derive another dataframe (lets call i

equalTo isin not working as expected with a constructed column with DataFrames

2016-02-18 Thread Mehdi Ben Haj Abbes

he original columns I get what is expected as count which 0 The same goes for isin Any help will be more than appreciated. Best regards, -- Mehdi BEN HAJ ABBES

Re: Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Mehdi Ben Haj Abbes

Terry > > On Fri, Jan 29, 2016 at 5:45 PM, Mehdi Ben Haj Abbes < > mehdi.ab...@gmail.com> wrote: > >> Hi folks, >> >> I have a streaming job running for more than 24 hours. It seems that >> there is a limit on the number of the batches displayed in the Stre

Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Mehdi Ben Haj Abbes

Saturday. I will have the batches that have run the previous 24 hours and today it was like only the previous 3 hours. Any help will be very appreciated. -- Mehdi BEN HAJ ABBES

Re: Spark on YARN multitenancy

2015-12-15 Thread Ben Roling

I'm curious to see the feedback others will provide. My impression is the only way to get Spark to give up resources while it is idle would be to use the preemption feature of the scheduler you're using in YARN. When another user comes along the scheduler would preempt one or more Spark executors

Re: Spark on YARN multitenancy

2015-12-15 Thread Ben Roling

Oops - I meant while it is *busy* when I said while it is *idle*. On Tue, Dec 15, 2015 at 11:35 AM Ben Roling wrote: > I'm curious to see the feedback others will provide. My impression is the > only way to get Spark to give up resources while it is idle would be to use >

spark shared RDD

2015-11-10 Thread Ben

Hi, After reading some documentations about spark and ignite, I am wondering if shared RDD from ignite can be used to share data in memory without any duplication between multiple spark jobs. Running on mesos I can collocate them, but will this be enough to avoid memory duplication or not? I am als

spark shared RDD

2015-11-10 Thread Ben

Hi, After reading some documentations about spark and ignite, I am wondering if shared RDD from ignite can be used to share data in memory without any duplication between multiple spark jobs. Running on mesos I can collocate them, but will this be enough to avoid memory duplication or not? I am als

spark shared RDD

2015-11-09 Thread Ben

also confused by Tachyon usage compare to apache ignite which seems to be overlapping at some points. Thanks for you help Regards Ben

Poor use cases for Spark

2015-10-21 Thread Ben Thompson

time -- some graph problems ( http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) This is not in anyway an attack on Spark. It's an amazing tool that does it's job very well. I'm just curious where it starts breaking down. Let me know if you have any experiences! Thanks very much, Ben

Re: Partitioning a RDD for training multiple classifiers

2015-09-08 Thread Ben Tucker

Hi Maximo — This is a relatively naive answer, but I would consider structuring the RDD into a DataFrame, then saving the 'splits' using something like DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then read a DataFrame from each resulting parquet directory and do your per-

Re: Unable to build Spark 1.5, is build broken or can anyone successfully build?

2015-08-31 Thread Ben Burford

Thanks very much Kevin. I didn't recognize from the title ( Maven issues with 1.5-RC) that this was a 1.5 build problem. Ben On 8/30/15 10:28 PM, Kevin Jung [via Apache Spark User List] wrote: > I expect it because the versions are not in the range defined in pom.xml. > You sho

SparkR dataFrame read.df fails to read from aws s3

2015-07-08 Thread Ben Spark

I have Spark 1.4 deployed on AWS EMR but methods of SparkR dataFrame read.df method cannot load data from aws s3. 1) "read.df" error message read.df(sqlContext,"s3://some-bucket/some.json","json") 15/07/09 04:07:01 ERROR r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed jav

Re: Worker is KILLED for no reason

2015-06-23 Thread Demi Ben-Ari

t archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best regards, Demi Ben-Ari <http://il.linkedin.com/in/demibenari> Senior Software Engineer Windward Ltd. <http://windward.eu/>

foreach vs foreachPartitions

2015-05-21 Thread ben

I would like to know if the foreachPartitions will results in a better performance, due to an higher level of parallelism, compared to the foreach method considering the case in which I'm flowing through an RDD in order to perform some sums into an accumulator variable. Thank you, Beniamino. -

foreach plus accumulator Vs mapPartitions performance

2015-05-21 Thread ben

Hi, everybody. There are some cases in which I can obtain the same results by using the mapPartitions and the foreach method. For example in a typical MapReduce approach one would perform a reduceByKey immediately after a mapPartitions that transform the original RDD in a collection of tuple (ke

Re: Matei Zaharai: Reddit Ask Me Anything

2015-04-03 Thread ben lorica

Happening right now https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364p22369.html Sent from the Apache Spark User List

Matei Zaharai: Reddit Ask Me Anything

2015-04-02 Thread ben lorica

*Ask Me Anything about Apache Spark & big data* Reddit AMA with Matei Zaharia Friday, April 3 at 9AM PT/ 12PM ET Details can be found here: http://strataconf.com/big-data-conference-uk-2015/public/content/reddit-ama -- View this message in context: http://apache-spark-user-list.1001560.n3.na

Passing Spark Configuration from Driver (Master) to all of the Slave nodes

2014-12-12 Thread Demi Ben-Ari

014/12/spark-configuration-mess-solved.html> -- Enjoy, Demi Ben-Ari Senior Software Engineer Windward LTD.

BUG in spark-ec2 script (--ebs-vol-size) and workaround...

2014-07-18 Thread Ben Horner

months now, and it's incredibly helpful. I hope this post helps towards fixing the problem! Thanks, -Ben P.S. This is the full initial command I used, in case this is isolated to particular instance types or anything: ec2/spark-ec2 -k ... -i ... -z us-east-1d -s 4 -t m3.2xlarge --ebs-vol-si

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-18 Thread Ben Horner

so used df -h to see what drives were mounted, and I get the same drives regardless of whether I specify the --ebs-vol-size parameter when I execute the script... Can you give me an exact command line that works for you? I will try and execute it verbatim. Thanks, -Ben -- View this message i

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Ben Horner

please add From: "Ben Horner [via Apache Spark User List]" mailto:ml-node+s1001560n9934...@n3.nabble.com>> Date: Wednesday, July 16, 2014 at 8:47 AM To: Ben Horner mailto:ben.hor...@atigeo.com>> Subject: Re: Trouble with spark-ec2 script: --ebs-vol-size Should I take it fr

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Ben Horner

Should I take it from the lack of replies that the --ebs-vol-size feature doesn't work? -Ben -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-spark-ec2-script-ebs-vol-size-tp9619p9934.html Sent from the Apache Spark User List mailing

Trouble with spark-ec2 script: --ebs-vol-size

2014-07-14 Thread Ben Horner

Hello, I'm using the spark-0.9.1-bin-hadoop1 distribution, and the ec2/spark-ec2 script within it to spin up a cluster. I tried running my processing just using the default (ephemeral) HDFS configuration, but my job errored out, saying that there was no space left. So now I'm trying to increase

55 matches

Mail list logo