Sources/V2 DatasourceV2 in Spark 3.*

2022-06-28 Thread Bigg Ben
backwards compatibility with spark2.4 while upgrading to spark 3.2. ? Or is there any documentation of those changes or if someone can point me to the right direction? Regards Ben

How do I read parquet with python object

2022-05-09 Thread ben
# python: import pandas as pd a = pd.DataFrame([[1, [2.3, 1.2]]], columns=['a', 'b']) a.to_parquet('a.parquet') # pyspark: d2 = spark.read.parquet('a.parquet') will return error: An error was encountered: An error occurred while calling o277.showString. : org.apache.spark.SparkException: Job

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor
upedParts => > spark.read.parquet(groupedParts: _*)) > > val finalDF = dfs.seq.grouped(100).toList.par.map(dfgroup => > dfgroup.reduce(_ union _)).reduce(_ union _).coalesce(2000) > > > > *From: *Ben Kaylor > *Date: *Tuesday, March 16, 2021 at 3:23 PM > *To: *Bo

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor
: > P.S.: 3. If fast updates are required, one way would be capturing S3 > events & putting the paths/modifications dates/etc of the paths into > DynamoDB/your DB of choice. > > > > *From:* Boris Litvak > *Sent:* Tuesday, 16 March 2021 9:03 > *To:* Ben Kaylor ;

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Ben Kaylor
Not sure on answer on this, but am solving similar issues. So looking for additional feedback on how to do this. My thoughts if unable to do via spark and S3 boto commands, then have apps self report those changes. Where instead of having just mappers discovering the keys, you have services self

Re: Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-21 Thread Ben Smith
I can also recreate with the very latest master branch (3.1.0-SNAPSHOT) if I compile it locally -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-20 Thread Ben Smith
Thanks for that. I have played with this a bit more after your feedback and found: I can only recreate the problem with python 3.6+. If I change between python 2.7, python 3.6 and python 3.7 I find that the problem occurs in the python 3.6 and 3.7 case but not in the python 2.7. - I have used

Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-17 Thread Ben Smith
Hi, I am having an issue that looks like a potentially serious bug with Spark 2.4.3 as it impacts data accuracy. I have searched in the Spark Jira and mail lists as best I can and cannot find reference to anyone else having this issue. I am not sure if this would be suitable for raising as a bug

Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-12 Thread Ben Roling
I've noticed that DataSet.sqlContext is public in Scala but the equivalent (DataFrame._sc) in PySpark is named as if it should be treated as private. Is this intentional? If so, what's the rationale? If not, then it feels like a bug and DataFrame should have some form of public access back to

Start a standalone server as root and use it with user accounts

2020-01-28 Thread Ben Caine
s root and run a job on it from a user account. Thanks, Ben Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you should delete this messag

K8S spark submit for spark 2.4

2019-05-05 Thread Ben Chukwumobi (CONT)
VICEACCOUNTNAME)) .setConf(Constants.SPARK_DRIVER_IAM_ROLE_ARN_KEY, config.get[String]("k8s.submitter.spark.driver.iam.role.arn")) .setConf(Constants.SPARK_EXECUTOR_IAM_ROLE_ARN_KEY, config.get[String]("k8s.submitter.spark.executor.iam.role.arn")) .addAppArgs(a

Re: Dataframe multiple joins with same dataframe not able to resolve correct join columns

2018-07-11 Thread Ben White
Sounds like the same root cause as SPARK-14948 or SPARK-10925. A workaround is to "clone" df3 like this: val df3clone = df3.toDF(df.schema.fieldNames:_*) Then use df3clone in place of df3 in the second join. On Wed, Jul 11, 2018 at 2:52 PM Nirav Patel wrote: > I am trying to joind df1 with

Re: [External] Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-04 Thread Ben Teeuwen
es.apache. >>>>>>>>>> org/jira/browse/SPARK-19031 >>>>>>>>>> >>>>>>>>>> In the mean time you could try implementing your own Source, but >>>>>>>>>> that is pretty low level and is not yet a stable

Parallel dynamic partitioning producing duplicated data

2016-11-30 Thread Mehdi Ben Haj Abbes
Hi Folks, I have a spark job reading a csv file into a dataframe. I register that dataframe as a tempTable then I’m writing that dataframe/tempTable to hive external table (using parquet format for storage) I’m using this kind of command : hiveContext.sql(*"INSERT INTO TABLE t

pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Ben Leslie
to regenerating from the parent RDD, rather than providing the wrong answer. Is this the expected behaviour? It seems a little difficult to work with in practise. Cheers, Ben - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-22 Thread Ben Teeuwen
n mappings_bc[column]: key = unicode(x) return Vectors.sparse(len(mappings_bc[column]),[mappings_bc[column][key]],[1.0]) Ben > On Aug 19, 2016, at 11:34 PM, Davies Liu <dav...@databricks.com> wrote: > > The OOM happen in driver, you may also need more memory for driver. &

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Ben Teeuwen
mpares. > > By the way, the feature size you select for the hasher should be a power of 2 > (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes are > evenly distributed (see the section on HashingTF under > http://spark.apache.org/docs/latest/ml-features.html#t

Re: Spark join and large temp files

2016-08-11 Thread Ben Teeuwen
. > On Aug 11, 2016, at 9:48 PM, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > Hi Ben, > > and that will take care of skewed data? > > Gourav > > On Thu, Aug 11, 2016 at 8:41 PM, Ben Teeuwen <bteeu...@gmail.com > <mailto:bteeu...@gmail.com&g

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Ben Teeuwen
# some logging to confirm the indexes. logging.info("Missing value = {}".format(mappings[c]['missing'])) return max_index, mappings I’d love to see the StringIndexer + OneHotEncoder transformers cope with missing values during prediction; for now I’ll work with the ha

Re: Spark join and large temp files

2016-08-11 Thread Ben Teeuwen
When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’? If you .cache() and .count() to force a shuffle, it'll push the records that will be joined to the same executors. So; a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache() a.count() b =

Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Ben Teeuwen
o I’m curious what to use instead. > On Aug 4, 2016, at 3:54 PM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > Have you looked at pyspark.sql.functions.udf and the associated examples? > 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen <bteeu...@gmail.com > <mai

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
-2 status: ____ After 40 minutes or so, with no activity in the application master, it dies. Ben > On Aug 4, 2016, at 12:14 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > > Hi Ben > > Perhaps with this size cardinality it is worth looking at featur

registering udf to use in spark.sql('select...

2016-08-04 Thread Ben Teeuwen
y see registerFunction in the deprecated sqlContext at http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html <http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html>. As the ‘spark’ object unifies hiveContext and sqlContext, what is the new way to go? Ben

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.sca

Re: how to debug spark app?

2016-08-04 Thread Ben Teeuwen
Related question: what are good profiling tools other than watching along the application master with the running code? Are there things that can be logged during the run? If I have say 2 ways of accomplishing the same thing, and I want to learn about the time/memory/general resource blocking

OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-03 Thread Ben Teeuwen
Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error. I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G

Materializing mapWithState .stateSnapshot() after ssc.stop

2016-07-28 Thread Ben Teeuwen
ate-statesnapshots-to-database-for-later-resume-of-spark-st>. The context is to be able to pause and resume a streaming app while not losing the state information. So I want to save and reload (initialize) the state snapshot. Has anyone of you already been able to do this? Thanks, Ben

Re: How do i get a spark instance to use my log4j properties

2016-04-15 Thread Demi Ben-Ari
gt; I use maven and an VERY RELUCTANT to edit the spark jars > > I know this point has been discussed here before but I do not see a clean > answer > > > > > > > -- Best regards, Demi Ben-Ari <http://il.linkedin.com/in/demibenari> Entrepreneur @ Stealth Mode Startup Twitter: @demibenari <https://twitter.com/demibenari>

Analyzing json Data streams using sparkSQL in spark streaming returns java.lang.ClassNotFoundException

2016-03-08 Thread Nesrine BEN MUSTAPHA
Hello, I tried to use sparkSQL to analyse json data streams within a standalone application. here the code snippet that receive the streaming data: *final JavaReceiverInputDStream lines = streamCtx.socketTextStream("localhost", Integer.parseInt(args[0]), StorageLevel.MEMORY_AND_DISK_SER_2());*

local class incompatible: stream classdesc

2016-03-01 Thread Nesrine BEN MUSTAPHA
Hi, I installed a standalone spark cluster with two workers. I developed a Java Application that use the maven dependency of spark (same version as the spark cluster). In my class Spark jobs I have only two methods considered as two different jobs: the first one is the example of spark word

Re: equalTo isin not working as expected with a constructed column with DataFrames

2016-02-18 Thread Mehdi Ben Haj Abbes
Hi, I forgot to mention that I'm using the 1.5.1 version. Regards, On Thu, Feb 18, 2016 at 4:20 PM, Mehdi Ben Haj Abbes <mehdi.ab...@gmail.com> wrote: > Hi folks, > > I have DataFrame with let's say this schema : > -dealId, > -ptf, > -ts > from it I derive another

Re: Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Mehdi Ben Haj Abbes
> > Regards, > - Terry > > On Fri, Jan 29, 2016 at 5:45 PM, Mehdi Ben Haj Abbes < > mehdi.ab...@gmail.com> wrote: > >> Hi folks, >> >> I have a streaming job running for more than 24 hours. It seems that >> there is a limit on the number of the b

Number of batches in the Streaming Statics visualization screen

2016-01-29 Thread Mehdi Ben Haj Abbes
Saturday. I will have the batches that have run the previous 24 hours and today it was like only the previous 3 hours. Any help will be very appreciated. -- Mehdi BEN HAJ ABBES

Re: Spark on YARN multitenancy

2015-12-15 Thread Ben Roling
I'm curious to see the feedback others will provide. My impression is the only way to get Spark to give up resources while it is idle would be to use the preemption feature of the scheduler you're using in YARN. When another user comes along the scheduler would preempt one or more Spark

Re: Spark on YARN multitenancy

2015-12-15 Thread Ben Roling
Oops - I meant while it is *busy* when I said while it is *idle*. On Tue, Dec 15, 2015 at 11:35 AM Ben Roling <ben.rol...@gmail.com> wrote: > I'm curious to see the feedback others will provide. My impression is the > only way to get Spark to give up resources while it is idle wou

spark shared RDD

2015-11-10 Thread Ben
Hi, After reading some documentations about spark and ignite, I am wondering if shared RDD from ignite can be used to share data in memory without any duplication between multiple spark jobs. Running on mesos I can collocate them, but will this be enough to avoid memory duplication or not? I am

spark shared RDD

2015-11-10 Thread Ben
Hi, After reading some documentations about spark and ignite, I am wondering if shared RDD from ignite can be used to share data in memory without any duplication between multiple spark jobs. Running on mesos I can collocate them, but will this be enough to avoid memory duplication or not? I am

spark shared RDD

2015-11-09 Thread Ben
also confused by Tachyon usage compare to apache ignite which seems to be overlapping at some points. Thanks for you help Regards Ben

Poor use cases for Spark

2015-10-21 Thread Ben Thompson
-- some graph problems ( http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) This is not in anyway an attack on Spark. It's an amazing tool that does it's job very well. I'm just curious where it starts breaking down. Let me know if you have any experiences! Thanks very much, Ben

Re: Partitioning a RDD for training multiple classifiers

2015-09-08 Thread Ben Tucker
Hi Maximo — This is a relatively naive answer, but I would consider structuring the RDD into a DataFrame, then saving the 'splits' using something like DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then read a DataFrame from each resulting parquet directory and do your

SparkR dataFrame read.df fails to read from aws s3

2015-07-08 Thread Ben Spark
I have Spark 1.4 deployed on AWS EMR but methods of SparkR dataFrame read.df method cannot load data from aws s3. 1) read.df error message read.df(sqlContext,s3://some-bucket/some.json,json) 15/07/09 04:07:01 ERROR r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed

Re: Worker is KILLED for no reason

2015-06-24 Thread Demi Ben-Ari
For additional commands, e-mail: user-h...@spark.apache.org -- Best regards, Demi Ben-Ari http://il.linkedin.com/in/demibenari Senior Software Engineer Windward Ltd. http://windward.eu/

foreach vs foreachPartitions

2015-05-21 Thread ben
I would like to know if the foreachPartitions will results in a better performance, due to an higher level of parallelism, compared to the foreach method considering the case in which I'm flowing through an RDD in order to perform some sums into an accumulator variable. Thank you, Beniamino.

foreach plus accumulator Vs mapPartitions performance

2015-05-21 Thread ben
Hi, everybody. There are some cases in which I can obtain the same results by using the mapPartitions and the foreach method. For example in a typical MapReduce approach one would perform a reduceByKey immediately after a mapPartitions that transform the original RDD in a collection of tuple

Re: Matei Zaharai: Reddit Ask Me Anything

2015-04-03 Thread ben lorica
Happening right now https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364p22369.html Sent from the Apache Spark User List

Matei Zaharai: Reddit Ask Me Anything

2015-04-02 Thread ben lorica
*Ask Me Anything about Apache Spark big data* Reddit AMA with Matei Zaharia Friday, April 3 at 9AM PT/ 12PM ET Details can be found here: http://strataconf.com/big-data-conference-uk-2015/public/content/reddit-ama -- View this message in context:

Passing Spark Configuration from Driver (Master) to all of the Slave nodes

2014-12-12 Thread Demi Ben-Ari
/12/spark-configuration-mess-solved.html -- Enjoy, Demi Ben-Ari Senior Software Engineer Windward LTD.

BUG in spark-ec2 script (--ebs-vol-size) and workaround...

2014-07-18 Thread Ben Horner
, and it's incredibly helpful. I hope this post helps towards fixing the problem! Thanks, -Ben P.S. This is the full initial command I used, in case this is isolated to particular instance types or anything: ec2/spark-ec2 -k ... -i ... -z us-east-1d -s 4 -t m3.2xlarge --ebs-vol-size=250 -m r3.2xlarge

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Ben Horner
Should I take it from the lack of replies that the --ebs-vol-size feature doesn't work? -Ben -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-spark-ec2-script-ebs-vol-size-tp9619p9934.html Sent from the Apache Spark User List mailing list

Re: Trouble with spark-ec2 script: --ebs-vol-size

2014-07-16 Thread Ben Horner
please add From: Ben Horner [via Apache Spark User List] ml-node+s1001560n9934...@n3.nabble.commailto:ml-node+s1001560n9934...@n3.nabble.com Date: Wednesday, July 16, 2014 at 8:47 AM To: Ben Horner ben.hor...@atigeo.commailto:ben.hor...@atigeo.com Subject: Re: Trouble with spark-ec2 script

Trouble with spark-ec2 script: --ebs-vol-size

2014-07-14 Thread Ben Horner
Hello, I'm using the spark-0.9.1-bin-hadoop1 distribution, and the ec2/spark-ec2 script within it to spin up a cluster. I tried running my processing just using the default (ephemeral) HDFS configuration, but my job errored out, saying that there was no space left. So now I'm trying to increase