read parallel processing spark-cassandra

2018-02-13 Thread sujeet jog
Folks, I have a time series table with each record being 350 columns. the primary key is ((date, bucket), objectid, timestamp) objective is to read 1 day worth of data, which comes to around 12k partitions, each partition has around 25MB of data, I see only 1 task active during the read

not able to read git info from Scala Test Suite

2018-02-13 Thread karan alang
Hello - I'm writing a scala unittest for my Spark project which checks the git information, and somehow it is not working from the Unit Test Added in pom.xml -- pl.project13.maven git-commit-id-plugin 2.2.4

[Spark GraphX pregel] default value for EdgeDirection not consistent between programming guide and API documentation

2018-02-13 Thread Ramon Bejar Torres
Hi, I just wanted to notice that in the API doc page for the pregel operator (graphX API for spark 2.2.1):

Inefficient state management in stream to stream join in 2.3

2018-02-13 Thread Yogesh Mahajan
In 2.3, stream to stream joins(both Inner and Outer) are implemented using symmetric hash join(SHJ) algorithm, and that is a good choice and I am sure you had compared with other family of algorithms like XJoin and non-blocking sort based algorithms like progressive merge join (PMJ

Why python cluster mode is not supported in standalone cluster?

2018-02-13 Thread Ashwin Sai Shankar
Hi Spark users! I noticed that spark doesn't allow python apps to run in cluster mode in spark standalone cluster. Does anyone know the reason? I checked jira but couldn't find anything relevant. Thanks, Ashwin

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-02-13 Thread Yogesh Mahajan
I had a similar issue and i think that’s where the structured streaming design lacks. Seems like Question#2 in your email is a viable workaround for you. In my case, I have a custom Sink backed by an efficient in-memory column store suited for fast ingestion. I have a Kafka stream coming from

Re: org.apache.kafka.clients.consumer.OffsetOutOfRangeException

2018-02-13 Thread dcam
Hi Mina I believe this is different for Structured Streaming from Kafka, specifically. I'm assuming you are using structured streaming based on the name of the dependency ("spark-streaming-kafka"). There is a note in the docs here:

Spark 2.2.1 EMR 5.11.1 Encrypted S3 bucket overwriting parquet file

2018-02-13 Thread Stephen Robinson
Hi All, I am using the latest version of EMR to overwrite Parquet files to an S3 bucket encrypted with a KMS key. I am seeing the attached error whenever I Overwrite a parquet file. For example the below code produces the attached error and stacktrace:

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-02-13 Thread dcam
Hi Priyank I have a similar structure, although I am reading from Kafka and sinking to multiple MySQL tables. My input stream has multiple message types and each is headed for a different MySQL table. I've looked for a solution for a few months, and have only come up with two alternatives: 1.

Retrieve batch metadata via the spark monitoring api

2018-02-13 Thread Hendrik Dev
I use Spark 2.2.1 with streaming and when i open the Spark Streaming UI i can see input metadata for each of my batches. In my case i stream from Kafka and in the metadata section i find useful informations about my topic, partitions and offsets. Assume the url for this batch looks like

Run Multiple Spark jobs. Reduce Execution time.

2018-02-13 Thread akshay naidu
Hello, I'm try to run multiple spark jobs on cluster running in yarn. Master is 24GB server with 6 Slaves of 12GB fairscheduler.xml settings are - FAIR 10 2 I am running 8 jobs simultaneously , jobs are running parallelly but not all. at a time only 7 of then runs simultaneously

Re: can udaf's return complex types?

2018-02-13 Thread Matteo Cossu
Hello, yes, sure they can return complex types. For example, the functions collect_list and collect_set return an ArrayType. On 10 February 2018 at 14:28, kant kodali wrote: > Hi All, > > Can UDAF's return complex types? like say a Map with key as an Integer and > the value

[Spark-Listener] [How-to] Listen only to specific events

2018-02-13 Thread Naved Alam
I have a spark application which creates multiple sessions. Each of these sessions can run jobs in parallel. I want to log some details about the execution of these jobs, but want to the tag them with the session they were called from. I tried creating a listener from within each session