We have jsonl files each of which is compressed as gz file. Is it possible
to make SSS to handle such files? Appreciate any help!
Hello everyone,
Is there is a way to specify rack awareness in Spark? For example, if I want
to use AggregatebyKey, is there a way to let Spark aggregate within the same
rack first, then aggregate between rack? I'm interested in this because I am
trying to figure whether there is a way to deal
Hi Li,
Thank you very much for your reply!
> Did you make the headless service that reflects the driver pod name?
I am not sure but I used “app” in the headless service as selector which is the
same app name for the StatefulSet that will create the spark driver pod.
For your reference, I
It is okay to collect the iterator. That will not break Spark. However,
collecting it requires memory in the executor, so you may cause OOMs if a
group has a LOT of new data.
On Wed, Oct 31, 2018 at 3:44 AM Antonio Murgia -
antonio.murg...@studio.unibo.it wrote:
> Hi all,
>
> I'm currently
Hi Yuqi,
Yes we are running Jupyter Gateway and kernels on k8s and using Spark 2.4's
client mode to launch pyspark. In client mode your driver is running on the
same pod where your kernel runs.
I am planning to write some blog post on this on some future date. Did you
make the headless service
spark version 2.2.0
Hive version 1.1.0
There are lot of small files
Spark code :
"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true
val logs
=spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date >
20181003")
Hive:
"spark.sql.orc.enabled": "true",
How large are they? A lot of (small) files will cause significant delay in
progressing - try to merge as much as possible into one file.
Can you please share full source code in Hive and Spark as well as the versions
you are using?
> Am 31.10.2018 um 18:23 schrieb gpatcham :
>
>
>
> When
When reading large number of orc files from HDFS under a directory spark
doesn't launch any tasks until some amount of time and I don't see any tasks
running during that time. I'm using below command to read orc and spark.sql
configs.
What spark is doing under hoods when spark.read.orc is
Hi Li,
Thank you for your reply.
Do you mean running Jupyter client on k8s cluster to use spark 2.4? Actually I
am also trying to set up JupyterHub on k8s to use spark, that’s why I would
like to know how to run spark client mode on k8s cluster. If there is any
related documentation on how to
Yuqi,
Your error seems unrelated to headless service config you need to enable.
For headless service you need to create a headless service that matches to
your driver pod name exactly in order for spark 2.4 RC to work under client
mode. We have this running for a while now using Jupyter kernel as
very thanks
here any one run on pyspark
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Greetings,
There are libraries for deep neural nets that can be used with spark. DL4J is
one, and it’s as simple as changing a constructor and the maven dependency.
BR
MK
Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and
Hi Gourav,
Thank you for your reply.
I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
instances?
I could set up the k8s cluster on AWS, but my problem is don’t know how to run
spark-shell on kubernetes…
Since spark only support client mode on k8s from 2.4 version which
There are any libraries in spark to support deep neural network
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi Jorn,
Thanks for the help. I switched to using Apache Parquet 1.8.3 and now Spark
successfully loads the parquet file.
Do you have any hint for the other part of my question? What is the correct
way to reproduce this schema:
message Document {
required int64 DocId;
optional group Links {
I would try with the same version as Spark uses first. I don’t have the
changelog of parquet in my head (but you can find it ok the Internet), but it
could be the cause of your issues.
> Am 31.10.2018 um 12:26 schrieb lchorbadjiev :
>
> Hi Jorn,
>
> I am using Apache Spark 2.3.1.
>
> For
Hi Jorn,
I am using Apache Spark 2.3.1.
For creating the parquet file I have used Apache Parquet (parquet-mr) 1.10.
This does not match the version of parquet used in Apache Spark 2.3.1 and if
you think that this could be the problem I could try to use Apache Parquet
version 1.8.3.
I created a
Hi all,
I'm currently developing a Spark Structured Streaming job and I'm performing
flatMapGroupsWithState.
I'm concerned about the laziness of the Iterator[V] that is passed to my custom
function (func: (K, Iterator[V], GroupState[S]) => Iterator[U]).
Is it ok to collect that iterator (with
Hello all,
I have this peculiar problem where quote " characters are added to the
beginning and end of my string values.
I get data using Structured Streaming from an Azure Event Hub using a Scala
Notebook in Azure Databricks.
The Dataframe schema received contain a property of type Map named
Hi Yuqi,
Just curious can you share the spark-submit script and what are you passing
as --master argument?
Thanks & Regards
Biplob Biswas
On Wed, Oct 31, 2018 at 10:34 AM Gourav Sengupta
wrote:
> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
>
Just out of curiosity why would you not use Glue (which is Spark on
kubernetes) or EMR?
Regards,
Gourav Sengupta
On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi wrote:
> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client
21 matches
Mail list logo