Anyone interested in Remote Shuffle Service

2020-10-21 Thread bo yang
Hi Spark Users, Uber open sourced Remote Shuffle Service ( https://github.com/uber/RemoteShuffleService ) recently. It works with open source Spark version without code change needed, and could store shuffle data on separate machines other than Spark executors. Anyone interested to try? Also we

listPartitionsByFilter Usage

2020-10-21 Thread navdeepniku
I'm trying to list partitions for a table with a filter using the below function, spark.sharedState.externalCatalog.listPartitionsByFilter() Signature of listPartitionsByFilter in org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener is as below, def listPartitionsByFilter(

Memory exception Issue at the time of inferring schema from huge json

2020-10-21 Thread Syed Furqan Rizvi
Hi, I am reading a json part files using Spark, it has 1.4 Million records and the input size is closer to 200 GB. At the time of reading/infering schema, (spark.read.json) its throwing out of memory Exception. The job is running in the cluster, where i am providing 22g executor memory, 4

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Wim Van Leuven
We actually zipped the full conda environments during our build and ship those On Wed, 21 Oct 2020 at 20:25, Mich Talebzadeh wrote: > How about PySpark? What process can that go through to not depend on > external repo access in production > > > LinkedIn * >

Client APIs for Accessing Spark Data Frames Directly

2020-10-21 Thread Artemis User
Is there anyway to access the Data Frames content directly/interactively via some client access APIs?  Some background info: 1. We have a Java client application that uses spark launcher to submit a spark job to a spark master. 2. The default spark launcher API has only a handle API that

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Mich Talebzadeh
How about PySpark? What process can that go through to not depend on external repo access in production LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Sean Owen
Yes, it's reasonable to build an uber-jar in development, using Maven/Ivy to resolve dependencies (and of course excluding 'provided' dependencies like Spark), and push that to production. That gives you a static artifact to run that does not depend on external repo access in production. On Wed,

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Wim Van Leuven
I like an artefact repo as the proper solution. Problem with environments that haven't yet fully embraced devops: artefact repos are considered development tools and are often not yet used to promote packages to production, air gapped if necessary. -wim On Wed, 21 Oct 2020 at 19:00, Mich

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Mich Talebzadeh
Hi Wim, This is an issue DEV/OPS face all the time. Cannot access the internet behind the company firewall. There is Nexus for this that manages dependencies with usual load times in seconds. However, only authorised accounts can request it through

Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-21 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
I think MaxOffsetsPerTrigger in Spark + Kafka integration docs would meet your requirement Отправлено с iPhone > 21 окт. 2020 г., в 12:36, KhajaAsmath Mohammed > написал(а): > > Thanks. Do we have option to limit number of records ? Like process only > 1 or the property we pass ? This

Re: Spark 3.0 ArrayIndexOutOfBoundsException at RDDOperationScope.toJson

2020-10-21 Thread constantin
I also had a java.lang.ArrayIndexOutOfBoundsException: 28499, very similar, after migrating to Spark 3.0.1 from 2.4.3, when performing a count, countApprox or rdd operation on Spark datasets. For me, this solution worked: https://programmersought.com/article/35311239379/ Basically I added this

Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-21 Thread KhajaAsmath Mohammed
Thanks. Do we have option to limit number of records ? Like process only 1 or the property we pass ? This way we can handle the amount of the data for batches that we need . Sent from my iPhone > On Oct 21, 2020, at 12:11 AM, lec ssmi wrote: > >  > Structured streaming's bottom