Spark Structured Streaming handles compressed files

2018-10-31 Thread Lian Jiang
We have jsonl files each of which is compressed as gz file. Is it possible to make SSS to handle such files? Appreciate any help!

Rack Awareness in Spark

2018-10-31 Thread RuiyangChen
Hello everyone, Is there is a way to specify rack awareness in Spark? For example, if I want to use AggregatebyKey, is there a way to let Spark aggregate within the same rack first, then aggregate between rack? I'm interested in this because I am trying to figure whether there is a way to deal

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Li, Thank you very much for your reply! > Did you make the headless service that reflects the driver pod name? I am not sure but I used “app” in the headless service as selector which is the same app name for the StatefulSet that will create the spark driver pod. For your reference, I

Re: Iterator of KeyValueGroupedDataset.flatMapGroupsWithState function

2018-10-31 Thread Tathagata Das
It is okay to collect the iterator. That will not break Spark. However, collecting it requires memory in the executor, so you may cause OOMs if a group has a LOT of new data. On Wed, Oct 31, 2018 at 3:44 AM Antonio Murgia - antonio.murg...@studio.unibo.it wrote: > Hi all, > > I'm currently

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
Hi Yuqi, Yes we are running Jupyter Gateway and kernels on k8s and using Spark 2.4's client mode to launch pyspark. In client mode your driver is running on the same pod where your kernel runs. I am planning to write some blog post on this on some future date. Did you make the headless service

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
spark version 2.2.0 Hive version 1.1.0 There are lot of small files Spark code : "spark.sql.orc.enabled": "true", "spark.sql.orc.filterPushdown": "true val logs =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date > 20181003") Hive: "spark.sql.orc.enabled": "true",

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke
How large are they? A lot of (small) files will cause significant delay in progressing - try to merge as much as possible into one file. Can you please share full source code in Hive and Spark as well as the versions you are using? > Am 31.10.2018 um 18:23 schrieb gpatcham : > > > > When

Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs. What spark is doing under hoods when spark.read.orc is

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Li, Thank you for your reply. Do you mean running Jupyter client on k8s cluster to use spark 2.4? Actually I am also trying to set up JupyterHub on k8s to use spark, that’s why I would like to know how to run spark client mode on k8s cluster. If there is any related documentation on how to

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
Yuqi, Your error seems unrelated to headless service config you need to enable. For headless service you need to create a headless service that matches to your driver pod name exactly in order for spark 2.4 RC to work under client mode. We have this running for a while now using Jupyter kernel as

Re: I want run deep neural network on Spark

2018-10-31 Thread hager
very thanks here any one run on pyspark -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: I want run deep neural network on Spark

2018-10-31 Thread Kunkel, Michael C.
Greetings, There are libraries for deep neural nets that can be used with spark. DL4J is one, and it’s as simple as changing a constructor and the maven dependency. BR MK Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute and

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Gourav, Thank you for your reply. I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws instances? I could set up the k8s cluster on AWS, but my problem is don’t know how to run spark-shell on kubernetes… Since spark only support client mode on k8s from 2.4 version which

I want run deep neural network on Spark

2018-10-31 Thread hager
There are any libraries in spark to support deep neural network -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: dremel paper example schema

2018-10-31 Thread lchorbadjiev
Hi Jorn, Thanks for the help. I switched to using Apache Parquet 1.8.3 and now Spark successfully loads the parquet file. Do you have any hint for the other part of my question? What is the correct way to reproduce this schema: message Document { required int64 DocId; optional group Links {

Re: dremel paper example schema

2018-10-31 Thread Jörn Franke
I would try with the same version as Spark uses first. I don’t have the changelog of parquet in my head (but you can find it ok the Internet), but it could be the cause of your issues. > Am 31.10.2018 um 12:26 schrieb lchorbadjiev : > > Hi Jorn, > > I am using Apache Spark 2.3.1. > > For

Re: dremel paper example schema

2018-10-31 Thread lchorbadjiev
Hi Jorn, I am using Apache Spark 2.3.1. For creating the parquet file I have used Apache Parquet (parquet-mr) 1.10. This does not match the version of parquet used in Apache Spark 2.3.1 and if you think that this could be the problem I could try to use Apache Parquet version 1.8.3. I created a

Iterator of KeyValueGroupedDataset.flatMapGroupsWithState function

2018-10-31 Thread Antonio Murgia - antonio.murg...@studio.unibo.it
Hi all, I'm currently developing a Spark Structured Streaming job and I'm performing flatMapGroupsWithState. I'm concerned about the laziness of the Iterator[V] that is passed to my custom function (func: (K, Iterator[V], GroupState[S]) => Iterator[U]). Is it ok to collect that iterator (with

Event Hubs properties kvp-value adds " to strings

2018-10-31 Thread Magnus Nilsson
Hello all, I have this peculiar problem where quote " characters are added to the beginning and end of my string values. I get data using Structured Streaming from an Azure Event Hub using a Scala Notebook in Azure Databricks. The Dataframe schema received contain a property of type Map named

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Biplob Biswas
Hi Yuqi, Just curious can you share the spark-submit script and what are you passing as --master argument? Thanks & Regards Biplob Biswas On Wed, Oct 31, 2018 at 10:34 AM Gourav Sengupta wrote: > Just out of curiosity why would you not use Glue (which is Spark on > kubernetes) or EMR? > >

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Gourav Sengupta
Just out of curiosity why would you not use Glue (which is Spark on kubernetes) or EMR? Regards, Gourav Sengupta On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi wrote: > Hello guys, > > > > I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem > regarding using spark 2.4 client