Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread Srinivas V
gt; > >>>>>> [1] > > >>>>>> > https://eur06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.AccumulatorV2&data=02%7C01%7C%7Ce9cd79340511422f368608d802fc468d%7

Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread ZHANG Wei
.org%2Fdocs%2Flatest%2Fapi%2Fscala%2Findex.html%23org.apache.spark.util.AccumulatorV2&data=02%7C01%7C%7Ce9cd79340511422f368608d802fc468d%7C84df9e7fe9f640afb435%7C1%7C0%7C637262629816034378&sdata=73AxOzjhvImCuhXPoMN%2Bm7%2BY3KYwwaoCvmYMoOEGDtU%3D&reserved=0 > >>>>&

Re: Using Spark Accumulators with Structured Streaming

2020-05-28 Thread Srinivas V
to work? >>>>>> >>>>>> --- >>>>>> Cheers, >>>>>> -z >>>>>> [1] >>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.AccumulatorV2 >>>>>> [2

Re: Using Spark Accumulators with Structured Streaming

2020-05-27 Thread Something Something
>>>> [1] >>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.AccumulatorV2 >>>>> [2] >>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.CollectionAccumulator >>>>&g

Re: Using Spark Accumulators with Structured Streaming

2020-05-27 Thread Srinivas V
index.html#org.apache.spark.util.AccumulatorV2 >>>> [2] >>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.CollectionAccumulator >>>> [3] >>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator >>>> >>>> _

Re: Using Spark Accumulators with Structured Streaming

2020-05-26 Thread Something Something
e.org/docs/latest/api/scala/index.html#org.apache.spark.util.CollectionAccumulator >>> [3] >>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator >>> >>> >>> From: Something Somethi

Re: Using Spark Accumulators with Structured Streaming

2020-05-25 Thread Srinivas V
;> [3] >> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator >> >> ____ >> From: Something Something >> Sent: Saturday, May 16, 2020 0:38 >> To: spark-user >> Subject: Re: Using Spark Accumulators with Structured Streami

Re: Using Spark Accumulators with Structured Streaming

2020-05-25 Thread Something Something
__ > From: Something Something > Sent: Saturday, May 16, 2020 0:38 > To: spark-user > Subject: Re: Using Spark Accumulators with Structured Streaming > > Can someone from Spark Development team tell me if this functionality is > supported and tested? I've spent a l

Re: ETL Using Spark

2020-05-24 Thread vijay.bvp
Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to SQL Server possible as Mich mentioned below as longs as there a JDBC driver that can connect to SQL ServerFor a production workloads important points to consider, >> what is the QoS requirements for your case? at

Re: ETL Using Spark

2020-05-21 Thread Mich Talebzadeh
ny monetary damages arising from such loss, damage or destruction. On Thu, 21 May 2020 at 16:15, Avadhut Narayan Joshi wrote: > Hello Team > > > > I am working on ETL using Spark . > > > >- I am fetching streaming data from Confluent Kafka >- Wanted to do

ETL Using Spark

2020-05-21 Thread Avadhut Narayan Joshi
Hello Team I am working on ETL using Spark . * I am fetching streaming data from Confluent Kafka * Wanted to do aggregations by combining streaming data with Data from SQL Server For achieving above use case 1. Can I fetch data from SQL Server into Spark based on where

Re: Using Spark Accumulators with Structured Streaming

2020-05-15 Thread ZHANG Wei
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.util.LongAccumulator From: Something Something Sent: Saturday, May 16, 2020 0:38 To: spark-user Subject: Re: Using Spark Accumulators with Structured Streaming Can someone from Spark Develo

Re: Using Spark Accumulators with Structured Streaming

2020-05-15 Thread Something Something
Can someone from Spark Development team tell me if this functionality is supported and tested? I've spent a lot of time on this but can't get it to work. Just to add more context, we've our own Accumulator class that extends from AccumulatorV2. In this class we keep track of one or more accumulator

Using Spark Accumulators with Structured Streaming

2020-05-14 Thread Something Something
In my structured streaming job I am updating Spark Accumulators in the updateAcrossEvents method but they are always 0 when I try to print them in my StreamingListener. Here's the code: .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())( updateAcrossEvents ) The accumula

Re: How to populate all possible combination values in columns using Spark SQL

2020-05-09 Thread Edgardo Szrajber
Nube Technologies  On Thu, May 7, 2020 at 10:26 AM Aakash Basu wrote: Hi, I've described the problem in Stack Overflow with a lot of detailing, can you kindly check and help if possible? https://stackoverflow.com/q/61643910/5536733 I'd be absolutely fine if someone solves it using

Re: How to populate all possible combination values in columns using Spark SQL

2020-05-09 Thread Aakash Basu
kash Basu > wrote: > > Hi, > > I've described the problem in Stack Overflow with a lot of detailing, can > you kindly check and help if possible? > > https://stackoverflow.com/q/61643910/5536733 > > I'd be absolutely fine if someone solves it using Spark SQL APIs rather > than plain spark SQL query. > > Thanks, > Aakash. > >

Re: How to populate all possible combination values in columns using Spark SQL

2020-05-08 Thread Edgardo Szrajber
akash Basu wrote: Hi, I've described the problem in Stack Overflow with a lot of detailing, can you kindly check and help if possible? https://stackoverflow.com/q/61643910/5536733 I'd be absolutely fine if someone solves it using Spark SQL APIs rather than plain spark SQL query. Thanks,Aakash.

Re: How to populate all possible combination values in columns using Spark SQL

2020-05-07 Thread Aakash Basu
0:26 AM Aakash Basu > wrote: > >> Hi, >> >> I've described the problem in Stack Overflow with a lot of detailing, can >> you kindly check and help if possible? >> >> https://stackoverflow.com/q/61643910/5536733 >> >> I'd be absolutely fine if someone solves it using Spark SQL APIs rather >> than plain spark SQL query. >> >> Thanks, >> Aakash. >> >

Re: How to populate all possible combination values in columns using Spark SQL

2020-05-06 Thread Sonal Goyal
at 10:26 AM Aakash Basu wrote: > Hi, > > I've described the problem in Stack Overflow with a lot of detailing, can > you kindly check and help if possible? > > https://stackoverflow.com/q/61643910/5536733 > > I'd be absolutely fine if someone solves it using Spar

How to populate all possible combination values in columns using Spark SQL

2020-05-06 Thread Aakash Basu
Hi, I've described the problem in Stack Overflow with a lot of detailing, can you kindly check and help if possible? https://stackoverflow.com/q/61643910/5536733 I'd be absolutely fine if someone solves it using Spark SQL APIs rather than plain spark SQL query. Thanks, Aakash.

Re: PGP Encrypt using spark Scala

2019-08-26 Thread Roland Johann
I want to add that the major hadoop distributions also offer additional encryption possibilities (for example Ranger from Hortonworks) Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io W

Re: PGP Encrypt using spark Scala

2019-08-26 Thread Roland Johann
Hi all, instead of handling encryption explicit at application level, I suggest that you investigate into the topic „encryption at rest“, for example encryption at HDFS level https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html

Re: PGP Encrypt using spark Scala

2019-08-26 Thread Sachit Murarka
Hi Deepak, Thanks for reply. Yes. That is the option I am considering now because even apache camel needs data in local. I might need to copy data from hdfs to local if I want apache camel ( to get rid of shell). Thanks Sachit On Mon, 26 Aug 2019, 21:11 Deepak Sharma, wrote: > Hi Schit > PGP

Re: PGP Encrypt using spark Scala

2019-08-26 Thread Deepak Sharma
Hi Schit PGP Encrypt is something that is not inbuilt with spark. I would suggest writing a shell script that would do pgp encrypt and use it in spark scala program , which would run from driver. Thanks Deepak On Mon, Aug 26, 2019 at 8:10 PM Sachit Murarka wrote: > Hi All, > > I want to encrypt

PGP Encrypt using spark Scala

2019-08-26 Thread Sachit Murarka
Hi All, I want to encrypt my files available at HDFS location using PGP Encryption How can I do it in spark. I saw Apache Camel but it seems camel is used when source files are in Local location rather than HDFS. Kind Regards, Sachit Murarka

Re: Call Oracle Sequence using Spark

2019-08-16 Thread Nicolas Paris
> I have to call Oracle sequence using spark. You might use jdbc and write your own lib from scala I did such thing for postgres (https://framagit.org/parisni/spark-etl/tree/master/spark-postgres) see sqlExecWithResultSet On Thu, Aug 15, 2019 at 10:58:11PM +0530, rajat kumar wrote: >

Call Oracle Sequence using Spark

2019-08-15 Thread rajat kumar
Hi All, I have to call Oracle sequence using spark. Can you pls tell what is the way to do that? Thanks Rajat

[Pyspark 2.4] Large number of row groups in parquet files created using spark

2019-07-24 Thread Rishi Shah
application using spark-submit? df = spark.read.parquet(INPUT_PATH) df.coalesce(1).write.parquet(OUT_PATH) I did try --conf spark.parquet.block.size & spark.dfs.blocksize, but that makes no difference. -- Regards, Rishi Shah

Re: Looking for a developer to help us with a small ETL project using Spark and Kubernetes

2019-07-18 Thread Sebastian Piu
or a developer to help us with a small ETL project using > Spark and Kubernetes. Here are some of the requirements: > > 1. We need a REST API to run and schedule jobs. We would prefer this done > in Node.js but can be done using Java. The REST API will not be available > to the public

Looking for a developer to help us with a small ETL project using Spark and Kubernetes

2019-07-18 Thread Information Technologies
Hello, We are looking for a developer to help us with a small ETL project using Spark and Kubernetes. Here are some of the requirements: 1. We need a REST API to run and schedule jobs. We would prefer this done in Node.js but can be done using Java. The REST API will not be available to the

IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff] while using spark-sql-2.4.1v to read data from oracle

2019-05-08 Thread Shyam P
Hi , I have oracle table in which has column schema is : DATA_DATE DATE something like 31-MAR-02 I am trying to retrieve data from oracle using spark-sql-2.4.1 version. I tried to set the JdbcOptions as below : .option("lowerBound", "2002-03-31 00:00:00"); .option

Request for a working example of using Pregel API in GraphX using Spark Scala

2019-05-05 Thread Basavaraj
Hello All I am a beginner in Spark, trying to use GraphX for an iterative processing by connecting to Kafka Stream Processing Looking for any git reference to real application example, in Scala Please revert with any reference to it, or if someone is trying to build, I could join them Regar

Error while using spark-avro module in pyspark 2.4

2019-05-01 Thread kanchan tewary
Hi All, Greetings! I am facing an error while trying to write my dataframe into avro format, using spark-avro package ( https://spark.apache.org/docs/latest/sql-data-sources-avro.html#deploying). I have added the package while running spark-submit as follows. Do I need to add any additional

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Shixiong(Ryan) Zhu
or that the >> current offset no longer exists on the kafka topic? Moreover, that doesn't >> explain the fact that the spark logs that it is on one offset for that >> partition (5553330) - and then immediately errors out trying to read the >> old offset (4544296) that no l

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Shixiong(Ryan) Zhu
.mechanism", "PLAIN"); > kakaConsumerProperties.put("sasl.jaas.config", "security.protocol"); > kakaConsumerProperties.put("security.protocol", ""); > > and I'm using LocationStrategies.PreferConsistent() > > Thanks > > On

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Austin Weaver
ffset (4544296) that no longer exists? @Akshay - I am using Spark Streaming (D-streams) Here is a snippet of the kafka consumer configuration I am using (redacted some fields) - kakaConsumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, ""); kakaConsume

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Akshay Bhardwaj
Hi Austin, Are you using Spark Streaming or Structured Streaming? For better understanding, could you also provide sample code/config params for your spark-kafka connector for the said streaming job? Akshay Bhardwaj +91-97111-33849 On Mon, Apr 29, 2019 at 10:34 PM Austin Weaver wrote

Issue with offset management using Spark on Dataproc

2019-04-29 Thread Austin Weaver
Hey guys, relatively new Spark Dev here and i'm seeing some kafka offset issues and was wondering if you guys could help me out. I am currently running a spark job on Dataproc and am getting errors trying to re-join a group and read data from a kafka topic. I have done some digging and am not sure

Creating Hive Persistent view using Spark Sql defaults to Sequence File Format

2019-03-19 Thread arun rajesh
Hi All , I am using spark 2.2 in EMR cluster. I have a hive table in ORC format and I need to create a persistent view on top of this hive table. I am using spark sql to create the view. By default spark sql creates the view with LazySerde. How can I change the inputformat to use ORC ? PFA

Using Spark as an ETL tool for moving data from Hive tables to BigQuery

2019-01-03 Thread Mich Talebzadeh
Hi, To move data from Hive to Google BigQuery, one needs to create a staging table in Hive in a storage format that can be read in BigQuery. Both AVRO and ORC file format in Hive work but the files cannot be compressed. In addition, to handle both data types and Dounble types, best to convert the

Re:Re: running updates using SPARK

2018-12-23 Thread 大啊
Hi Gourav Sengupta, Thank you provide the infomation of Databricks. At 2018-12-21 16:55:52, "Gourav Sengupta" wrote: Hi Jiaan, Spark does support UPDATES but in the version that Databricks has. The question to the community was asking when are they going to support it. Regards, Gourav

Re: running updates using SPARK

2018-12-21 Thread Gourav Sengupta
Hi Jiaan, Spark does support UPDATES but in the version that Databricks has. The question to the community was asking when are they going to support it. Regards, Gourav On Fri, 21 Dec 2018, 03:36 Jiaan Geng I think Spark is a Calculation engine design for OLAP or Ad-hoc.Spark is > not > a tradi

Re:running updates using SPARK

2018-12-20 Thread 大啊
I think Spark is a Calculation engine design for OLAP or Ad-hoc.Spark is not a traditional relational database,UPDATE need some mandatory constraint like transaction and lock. At 2018-12-21 06:05:54, "Gourav Sengupta" wrote: Hi, Is there any time soon that SPARK will support UPDATES?

Re: running updates using SPARK

2018-12-20 Thread Jiaan Geng
I think Spark is a Calculation engine design for OLAP or Ad-hoc.Spark is not a traditional relational database,UPDATE need some mandatory constraint like transaction and lock. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

running updates using SPARK

2018-12-20 Thread Gourav Sengupta
Hi, Is there any time soon that SPARK will support UPDATES? Databricks does provide Delta which supports UPDATE, but I think that the open source SPARK does not have the UPDATE option. HIVE has been supporting UPDATES for a very very long time now, and I was thinking when would that become avail

Using spark and mesos container with host_path volume

2018-12-03 Thread Antoine DUBOIS
Hello, I'm trying to mount a local ceph volume to my mesos container. My cephfs is mounted on all agent at /ceph I'm using spark 2.4 with hadoop 3.11 and I'm not using Docker to deploy spark. The only option I could find to mount a volume though is the following (which is also

Re: Read Avro Data using Spark Streaming

2018-11-14 Thread Michael Shtelma
Hi, you can use this project in order to read Avro using Spark Structured Streaming: https://github.com/AbsaOSS/ABRiS Spark 2.4 has also built in support for Avro, so you can use from_avro function in Spark 2.4. Best, Michael On Sat, Nov 3, 2018 at 4:34 AM Divya Narayan wrote: > Hi, &g

Re: Read Avro Data using Spark Streaming

2018-11-14 Thread chandan prakash
o data to kafka topic using schema registry and now I want > to use spark streaming to read that data and do some computation in real > time. Can some one please give a sample code for doing that . I couldn't > find any working code online. I am using spark version 2.2.0 and > spark

Read Avro Data using Spark Streaming

2018-11-02 Thread Divya Narayan
Hi, I produced avro data to kafka topic using schema registry and now I want to use spark streaming to read that data and do some computation in real time. Can some one please give a sample code for doing that . I couldn't find any working code online. I am using spark version 2.2.0 and

Re: Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread kathleen li
You can use Spark sql window function , something like df.createOrReplaceTempView(“dfv”) Select count(eventid) over ( partition by start_time, end_time orderly start_time) from dfv Sent from my iPhone > On Sep 26, 2018, at 11:32 AM, Debajyoti Roy wrote: > > The problem statement and an appro

Given events with start and end times, how to count the number of simultaneous events using Spark?

2018-09-26 Thread Debajyoti Roy
The problem statement and an approach to solve it using windows is described here: https://stackoverflow.com/questions/52509498/given-events-with-start-and-end-times-how-to-count-the-number-of-simultaneous-e Looking for more elegant/performant solutions, if they exist. TIA !

How to recursively aggregate Treelike(hierarchical) data using Spark?

2018-09-25 Thread newroyker
The problem statement and an approach to solve it recursively is described here: https://stackoverflow.com/questions/52508872/how-to-recursively-aggregate-treelikehierarchical-data-using-spark Looking for more elegant/performant solutions, if they exist. TIA ! -- Sent from: http://apache-spark

Handling Very Large volume(500TB) data using spark

2018-08-25 Thread Great Info
Hi All, I have large volume of data nearly 500TB(from 2016-2018-till date), I have to do some ETL on that data. This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select . 1. Do I need to process monthly o

Using Spark Streaming for analyzing changing data

2018-07-30 Thread oripwk
We have a use case where there's a stream of events while every event has an ID and its current state with a timestamp: … 111,ready,1532949947 111,offline,1532949955 111,ongoing,1532949955 111,offline,1532949973 333,offline,1532949981 333,ongoing,1532949987 … We want to ask questions about the

modeling timestamp in Avro messages (read using Spark Structured Streaming)

2018-07-29 Thread karan alang
i've a questing regarding modeling timestamp column in Avro messages The options are - ISO 8601 "String" (UTC Time) - "int" 32bit signed UNIX Epoch time - Long (modeled as Logica datatype - timestamp in schema) what would be the best way to model the timestamp ? fyi. we are using Apache Spark(Stru

Run STA/LTA python function using spark streaming: java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute

2018-07-09 Thread zakhavan
Hello, I'm trying to run the sta/lta python code which I got it from obspy website using spark streaming and plot the events but I keep getting the following error! "java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute" H

Re: Not able to overwrite cassandra table using Spark

2018-06-27 Thread Siva Samraj
You can try with this, it will work val finaldf = merchantdf.write. format("org.apache.spark.sql.cassandra") .mode(SaveMode.Overwrite) .option("confirm.truncate", true) .options(Map("table" -> "tablename", "keyspace" -> "keyspace")) .save() On Wed 27 Jun, 2018,

Not able to overwrite cassandra table using Spark

2018-06-27 Thread Abhijeet Kumar
Hello Team, I’m creating a dataframe out of cassandra table using datastax spark connector. After making some modification into the dataframe, I’m trying to put that dataframe back to the Cassandra table by overwriting the old content. For that the piece of code is: modifiedList.write.format("

Re: load hbase data using spark

2018-06-20 Thread vaquar khan
Why you need tool,you can directly connect Hbase using spark. Regards, Vaquar khan On Jun 18, 2018 4:37 PM, "Lian Jiang" wrote: Hi, I am considering tools to load hbase data using spark. One choice is https://github.com/Huawei-Spark/Spark-SQL-on-HBase. However, this seems to be o

load hbase data using spark

2018-06-18 Thread Lian Jiang
Hi, I am considering tools to load hbase data using spark. One choice is https://github.com/Huawei-Spark/Spark-SQL-on-HBase. However, this seems to be out-of-date (e.g. "This version of 1.0.0 requires Spark 1.4.0."). Which tool should I use for this purpose? Thanks for any hint.

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-06-06 Thread spark receiver
> However, this may not be perfect depending on your use case. Can you please > provide more details/examples? Do you aim at a generic hive to Oracle import > tool using Spark? Sqoop would not be an alternative? > > On 20. Mar 2018, at 03:45, Gurusamy Thirupathy <mailto:thi

How to work around NoOffsetForPartitionException when using Spark Streaming

2018-06-01 Thread Martin Peng
Hi, We see below exception when using Spark Kafka streaming 0.10 on a normal Kafka topic. Not sure why offset missing in zk, but since Spark streaming override the offset reset policy to none in the code. I can not set the reset policy to latest(I don't really care data loss now). Is ther

Re: [Beginner][StructuredStreaming] Using Spark aggregation - WithWatermark on old data

2018-05-24 Thread karthikjay
My data looks like this: { "ts2" : "2018/05/01 00:02:50.041", "serviceGroupId" : "123", "userId" : "avv-0", "stream" : "", "lastUserActivity" : "00:02:50", "lastUserActivityCount" : "0" } { "ts2" : "2018/05/01 00:09:02.079", "serviceGroupId" : "123", "userId" : "avv-0", "strea

Write data from Hbase using Spark Failing with NPE

2018-05-23 Thread Alchemist
aI am using Spark to write data to Hbase, I can read data just fine but write is failing with following exception. I found simila issue that got resolved by adding *site.xml and hbase JARs. But it is npot working for me.       JavaPairRDD  tablePuts = hBaseRDD.mapToPair(new PairFunction

[Beginner][StructuredStreaming] Using Spark aggregation - WithWatermark on old data

2018-05-22 Thread karthikjay
I am doing the following aggregation on the data val channelChangesAgg = tunerDataJsonDF .withWatermark("ts2", "10 seconds") .groupBy(window(col("ts2"),"10 seconds"), col("env"), col("servicegroupid")) .agg(count("linetransactionid") as "count1

Getting Data From Hbase using Spark is Extremely Slow

2018-05-17 Thread SparkUser6
I have written four lines of simple spark program to process data in Phoenix table: queryString = getQueryFullString( );// Get data from Phoenix table select col from table JavaPairRDD phRDD = jsc.newAPIHadoopRDD( configuration, Ph

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Marco Mistroni
Hi Sorted ..I just replaced s3 with s3aI think I recall similar issues in the past with aws libraries. Thx anyway for getting back Kr On Wed, May 2, 2018, 4:57 PM Paul Tremblay wrote: > I would like to see the full error. However, S3 can give misleading > messages if you don't have the corr

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Paul Tremblay
I would like to see the full error. However, S3 can give misleading messages if you don't have the correct permissions. On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni wrote: > HI all > i am using the following code for persisting data into S3 (aws keys are > already stored in the environment vari

Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-04-24 Thread Marco Mistroni
HI all i am using the following code for persisting data into S3 (aws keys are already stored in the environment variables) dataFrame.coalesce(1).write.format("com.databricks.spark.csv").save(fileName) However, i keep on receiving an exception that the file does not exist here's what comes fro

[How To] Using Spark Session in internal called classes

2018-04-23 Thread Aakash Basu
Hi, I have created my own Model Tuner class which I want to use to tune models and return a Model object if the user expects. This Model Tuner is in a file which I would ideally import into another file and call the class and use it. Outer file {from where I'd be calling the Model Tuner): I am us

Re: How to bulk insert using spark streaming job

2018-04-19 Thread scorpio
You need to insert per partition per batch. Normally database drivers meant for spark have bulk update feature built in. They take a RDD and do a bulk insert per partition. In case db driver you are using doesn't provide this feature, you can aggregate records per partition and then send out to db

Re: How to bulk insert using spark streaming job

2018-04-19 Thread ayan guha
: > How to bulk insert using spark streaming job > > Sent from my iPhone > -- Best Regards, Ayan Guha

How to bulk insert using spark streaming job

2018-04-19 Thread amit kumar singh
How to bulk insert using spark streaming job Sent from my iPhone

Re: Accessing Hive Database (On Hadoop) using Spark

2018-04-15 Thread Nicolas Paris
Hi Sounds your configuration files are not well filed. What does : spark.sql("SHOW DATABASES").show(); outputs ? If you only have default database, such investigation there should help https://stackoverflow.com/questions/47257680/unable-to-get-existing-hive-tables-from-hivecontext-u

Accessing Hive Database (On Hadoop) using Spark

2018-04-15 Thread Rishikesh Gawade
dataset").show(); After this i built the project using Maven as follows: mvn clean package -DskipTests and a JAR was generated. After this, I tried running the project via spark-submit CLI using : spark-submit --class com.adbms.SpamFilter --master yarn ~/IdeaProjects/mlproject/target/mlproje

Merge query using spark sql

2018-04-02 Thread Deepak Sharma
I am using spark to run merge query in postgres sql. The way its being done now is save the data to be merged in postgres as temp tables. Now run the merge queries in postgres using java sql connection and statment . So basically this query runs in postgres. The queries are insert into source

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-20 Thread Gurusamy Thirupathy
im at a generic hive to > Oracle import tool using Spark? Sqoop would not be an alternative? > > On 20. Mar 2018, at 03:45, Gurusamy Thirupathy > wrote: > > Hi guha, > > Thanks for your quick response, option a and b are in our table already. > For option b, again the sa

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-20 Thread Jörn Franke
Oracle import tool using Spark? Sqoop would not be an alternative? > On 20. Mar 2018, at 03:45, Gurusamy Thirupathy wrote: > > Hi guha, > > Thanks for your quick response, option a and b are in our table already. For > option b, again the same problem, we don't kn

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-19 Thread Gurusamy Thirupathy
Hi guha, Thanks for your quick response, option a and b are in our table already. For option b, again the same problem, we don't know which column is date. Thanks, -G On Sun, Mar 18, 2018 at 9:36 PM, Deepak Sharma wrote: > The other approach would to write to temp table and then merge the dat

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread ayan guha
Hi The is not with spark in this case, it is with Oracle. If you do not know which columns to apply date-related conversion rule, then you have a problem. You should try either a) Define some config file where you can define table name, date column name and date-format @ source so that you can

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Deepak Sharma
The other approach would to write to temp table and then merge the data. But this may be expensive solution. Thanks Deepak On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy wrote: > Hi, > > I am trying to read data from Hive as DataFrame, then trying to write the > DF into the Oracle data base. I

Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Gurusamy Thirupathy
Hi, I am trying to read data from Hive as DataFrame, then trying to write the DF into the Oracle data base. In this case, the date field/column in hive is with Type Varchar(20) but the corresponding column type in Oracle is Date. While reading from hive , the hive table names are dynamically decid

DataFrameWriter in pyspark ignoring hdfs attributes (using spark-2.2.1-bin-hadoop2.7)?

2018-03-10 Thread Chuan-Heng Hsiao
hi all, I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode. (python version: 3.5.2 from ubuntu 16.04) I intended to have DataFrame write to hdfs with customized block-size but failed. However, the corresponding rdd can successfully write with the customized block-size. Could you help me

Re: Consuming Data in Parallel using Spark Streaming

2018-02-22 Thread naresh Goud
entiate records of one type of entity from other type of >entities. > > > > -Beejal > > > > *From:* naresh Goud [mailto:nareshgoud.du...@gmail.com] > *Sent:* Friday, February 23, 2018 8:56 AM > *To:* Vibhakar, Beejal > *Subject:* Re: Consuming Data in Parallel u

RE: Consuming Data in Parallel using Spark Streaming

2018-02-22 Thread Vibhakar, Beejal
records of one type of entity from other type of entities. -Beejal From: naresh Goud [mailto:nareshgoud.du...@gmail.com] Sent: Friday, February 23, 2018 8:56 AM To: Vibhakar, Beejal Subject: Re: Consuming Data in Parallel using Spark Streaming You will have the same behavior both in local and hadoop

Consuming Data in Parallel using Spark Streaming

2018-02-21 Thread Vibhakar, Beejal
I am trying to process data from 3 different Kafka topics using 3 InputDStream with a single StreamingContext. I am currently testing this under Sandbox where I see data processed from one Kafka topic followed by other. Question#1: I want to understand that when I run this program in Hadoop clu

Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-26 Thread Rick Moritz
Hi, We solved this the ugly way, when parsing external column definitions: private def columnTypeToFieldType(columnType: String): DataType = { columnType match { case "IntegerType" => IntegerType case "StringType" => StringType case "DateType" => DateType case "FloatType" => Flo

Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread Kurt Fehlhauer
Can you share your code and a sample of your data? WIthout seeing it, I can't give a definitive answer. I can offer some hints. If you have a column of strings you should either be able to create a new column casted to Integer. This can be accomplished two ways: df.withColumn("newColumn", df.curre

Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread kant kodali
It seems like its hard to construct a DataType given its String literal representation. dataframe.types() return column names and its corresponding Types. for example say I have an integer column named "sum" doing dataframe.dtypes() would return "sum" and "IntegerType" but this string representat

how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread kant kodali
Hi All, I have a datatype "IntegerType" represented as a String and now I want to create DataType object out of that. I couldn't find in the DataType or DataTypes api on how to do that? Thanks!

How to use schema from one of the columns of a dataset to parse another column and create a flattened dataset using Spark Streaming 2.2.0?

2017-12-23 Thread kant kodali
Hi All, How to use value (schema) of one of the columns of a dataset to parse another column and create a flattened dataset using Spark Streaming 2.2.0? I have the following *source data frame* that I create from reading messages from Kafka col1: string col2: json string col1

How to kill a query job when using spark thrift-server?

2017-11-27 Thread 张万新
Hi, I intend to use spark thrift-server as a service to support concurrent sql queries. But in our situation we need a way to kill arbitrary query job, is there an api to use here?

Using Spark 2.2.0 SparkSession extensions to optimize file filtering

2017-10-24 Thread Chris Luby
I have an external catalog that has additional information on my Parquet files that I want to match up with the parsed filters from the plan to prune the list of files included in the scan. I’m looking at doing this using the Spark 2.2.0 SparkSession extensions similar to the built in partition

Re: Checkpoints not cleaned using Spark streaming + watermarking + kafka

2017-09-22 Thread MathieuP
The expected setting to clean these files is : - spark.sql.streaming.minBatchesToRetain More info on structured streaming settings : https://github.com/jaceklaskowski/spark-structured-streaming-book/blob/master/spark-sql-streaming-properties.adoc -- Sent from: http://apache-spark-user-list.10

Checkpoints not cleaned using Spark streaming + watermarking + kafka

2017-09-21 Thread MathieuP
using spark-submit on a single master and writes on the local hard drive. It runs fine until the number of checkpoints files in "state" directory totally fills the disk. It is due to the fact that there is no more inode available (not a space issue ; but tens of thousands inodes are con

UnpicklingError while using spark streaming

2017-07-13 Thread lovemoon
| down votefavorite | spark2.1.1 & python2.7.11 I want to union another rdd in Dstream.transform() like below: sc = SparkContext() ssc = StreamingContext(sc, 1) init_rdd = sc.textFile('file:///home/zht/PycharmProjects/test/text_file.txt') lines = ssc.socketTextStream('localhost', ) lin

Re: Using Spark as a simulator

2017-07-07 Thread Steve Loughran
On 7 Jul 2017, at 08:37, Esa Heikkinen mailto:esa.heikki...@student.tut.fi>> wrote: I only want to simulate very huge "network" with even millions parallel time syncronized actors (state machines). There are also communication between actors via some (key-value pairs) database. I also want th

VS: VS: Using Spark as a simulator

2017-07-07 Thread Esa Heikkinen
4 Vastaanottaja: Esa Heikkinen Kopio: Mahesh Sawaiker; user@spark.apache.org Aihe: Re: VS: Using Spark as a simulator Spark dropped Akka some time ago... I think the main issue he will face is a library for simulating the state machines (randomly), storing a huge amount of files (HDFS is probably

Re: VS: Using Spark as a simulator

2017-07-07 Thread Jörn Franke
y: 21. kesäkuuta 2017 14:45 > Vastaanottaja: Esa Heikkinen; Jörn Franke > Kopio: user@spark.apache.org > Aihe: RE: Using Spark as a simulator > > Spark can help you to create one large file if needed, but hdfs itself will > provide abstraction over such things, so it’s a trivia

VS: Using Spark as a simulator

2017-07-06 Thread Esa Heikkinen
The spark was originally built on it (Akka). Esa Lähettäjä: Mahesh Sawaiker Lähetetty: 21. kesäkuuta 2017 14:45 Vastaanottaja: Esa Heikkinen; Jörn Franke Kopio: user@spark.apache.org Aihe: RE: Using Spark as a simulator Spark can help you to create one large file

<    1   2   3   4   5   6   7   8   9   10   >