Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Jörn Franke
You can try to shuffle to s3 using the cloud shuffle plugin for s3 (https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/) - the performance of the new plugin is for many spark jobs sufficient (it works also on EMR). Then you can use s3 lifecycle

Re: Cluster-mode job compute-time/cost metrics

2023-12-12 Thread Jörn Franke
It could be simpler and faster to use tagging of resources for billing: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags-billing.html That could also include other resources (eg s3). > Am 12.12.2023 um 04:47 schrieb Jack Wells : > >  > Hello Spark experts - I’m running

Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
It is just a goal… however I would not tune the no of regions or region size yet.Simply specify gc algorithm and max heap size.Try to tune other options only if there is a need, only one at at time (otherwise it is difficult to determine cause/effects) and have a performance testing framework in

Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
If you do tests with newer Java versions you can also try: - UseNUMA: -XX:+UseNUMA. See https://openjdk.org/jeps/345 You can also assess the new Java GC algorithms: - -XX:+UseShenandoahGC - works with terabyte of heaps - more memory efficient than zgc with heaps <32 GB. See also:

Re: Spark-submit without access to HDFS

2023-11-16 Thread Jörn Franke
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Jörn Franke
Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? Assumerole can make sense if you use an AWS IAM user and STS authentication, but this would make no sense within AWS for cross-account access as attaching the permissions to the Glue job role is

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
Identity federation may ease this compared to a secret store.Am 01.10.2023 um 08:27 schrieb Jon Rodríguez Aranguren :Dear Jörn Franke, Jayabindu Singh and Spark Community members,Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets in the same pod. We have implemented this across Azure, Google and AWS. Azure does require some extra work to make it work.On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke &

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jörn Franke
Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html). It is almost a guarantee to get your data stolen and your account manipulated. If

Re: Log4j2 upgrade

2022-01-12 Thread Jörn Franke
You cannot simply replace it - log4j2 has a slightly different API than log4j. The Spark source code needs to be changed in a couple of places > Am 12.01.2022 um 20:53 schrieb Amit Sharma : > >  > Hello, everyone. I am replacing log4j with log4j2 in my spark streaming > application. When i

Re: hive table with large column data size

2022-01-09 Thread Jörn Franke
It is not a good practice to do this. Just store a reference to the binary data stored on HDFS. > Am 09.01.2022 um 15:34 schrieb weoccc : > >  > Hi , > > I want to store binary data (such as images) into hive table but the binary > data column might be much larger than other columns per row.

Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Jörn Franke
Is it in any case appropriate to use log4j 1.x which is not maintained anymore and has other security vulnerabilities which won’t be fixed anymore ? > Am 13.12.2021 um 06:06 schrieb Sean Owen : > >  > Check the CVE - the log4j vulnerability appears to affect log4j 2, not 1.x. > There was

Re: Naming files while saving a Dataframe

2021-07-18 Thread Jörn Franke
Spark heavily depends on Hadoop writing files. You can try to set the Hadoop property: mapreduce.output.basename https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration-- > Am 18.07.2021 um 01:15 schrieb Eric Beabes : > >  > Mich - You're

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jörn Franke
It really depends on what your data scientists talk. I don’t think it makes sense for ad hoc data science things to impose a language on them, but let them choose. For more complex AI engineering things you can though apply different standards and criteria. And then it really depends on

Re: Merging Parquet Files

2020-08-31 Thread Jörn Franke
Why only one file? I would go more for files of specific size, eg data is split in 1gb files. The reason is also that if you need to transfer it (eg to other clouds etc) - having a large file of several terabytes is bad. It depends on your use case but you might look also at partitions etc. >

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-26 Thread Jörn Franke
Is the directory available on all nodes ? > Am 26.08.2020 um 22:08 schrieb kuassi.men...@oracle.com: > >  > Mich, > > All looks fine. > Perhaps some special chars in username or password? > >> it is recommended not to use such characters like '@', '.' in your password. > Best, Kuassi > On

Re: Are there some pitfalls in my spark structured streaming code which causes slow response after several hours running?

2020-07-18 Thread Jörn Franke
It depends a bit on the data as well, but have you investigated in SparkUI which executor/task becomes slowly? Could it be also the database from which you load data? > Am 18.07.2020 um 17:00 schrieb Yong Yuan : > >  > The spark job has the correct functions and logic. However, after several

Re: Mocking pyspark read writes

2020-07-07 Thread Jörn Franke
Write to a local temp directory via file:// ? > Am 07.07.2020 um 20:07 schrieb Dark Crusader : > >  > Hi everyone, > > I have a function which reads and writes a parquet file from HDFS. When I'm > writing a unit test for this function, I want to mock this read & write. > > How do you achieve

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Jörn Franke
By doing a select on the df ? > Am 25.06.2020 um 14:52 schrieb Tzahi File : > >  > Hi, > > I'm using pyspark to write df to s3, using the following command: > "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". > > Is there any way to get the partitions

Re: Reading TB of JSON file

2020-06-19 Thread Jörn Franke
Make every json object a line and then read t as jsonline not as multiline > Am 19.06.2020 um 14:37 schrieb Chetan Khatri : > >  > All transactions in JSON, It is not a single array. > >> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner >> wrote: >> It's an interesting problem. What is the

Re: Reading TB of JSON file

2020-06-18 Thread Jörn Franke
Depends on the data types you use. Do you have in jsonlines format? Then the amount of memory plays much less a role. Otherwise if it is one large object or array I would not recommend it. > Am 18.06.2020 um 15:12 schrieb Chetan Khatri : > >  > Hi Spark Users, > > I have a 50GB of JSON

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Jörn Franke
e hdfs is doing a better job at this. > Does this make sense? > > I would also like to add that we built an extra layer on S3 which might be > adding to even slower times. > > Thanks for your help. > >> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, wrote: >> Have y

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Jörn Franke
Have you looked in Spark UI why this is the case ? S3 Reading can take more time - it depends also what s3 url you are using : s3a vs s3n vs S3. It could help after some calculation to persist in-memory or on HDFS. You can also initially load from S3 and store on HDFS and work from there .

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-23 Thread Jörn Franke
t;> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any mo

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Jörn Franke
Is there a reason why different Scala (it seems at least 2.10/2.11) versions are mixed? This never works. Do you include by accident a dependency to with an old Scala version? Ie the Hbase datasource maybe? > Am 17.02.2020 um 22:15 schrieb Mich Talebzadeh : > >  > Thanks Muthu, > > > I am

Re: Does explode lead to more usage of memory

2020-01-18 Thread Jörn Franke
Why not two tables and then you can join them? This would be the standard way. it depends what your full use case is, what volumes / orders you expect on average, how aggregations and filters look like. The example below states that you do a Select all on the table. > Am 19.01.2020 um 01:50

Re: GraphX performance feedback

2019-11-25 Thread Jörn Franke
I think it depends what you want do. Interactive big data graph analytics are probably better of in Janusgraph or similar. Batch processing (once-off) can be still fine in graphx - you have though to carefully design the process. > Am 25.11.2019 um 20:04 schrieb mahzad kalantari : > >  > Hi

Re: Spark Cluster over yarn cluster monitoring

2019-10-27 Thread Jörn Franke
Use yarn queues: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html > Am 27.10.2019 um 06:41 schrieb Chetan Khatri : > >  > Could someone please help me to understand better.. > >> On Thu, Oct 17, 2019 at 7:41 PM Chetan Khatri >> wrote: >> Hi Users, >>

Re: Conflicting PySpark Storage Level Defaults?

2019-09-16 Thread Jörn Franke
I don’t know your full source code but you may missing an action so that it is indeed persisted. > Am 16.09.2019 um 02:07 schrieb grp : > > Hi There Spark Users, > > Curious what is going on here. Not sure if possible bug or missing > something. Extra eyes are much appreciated. > > Spark

Re: Control Sqoop job from Spark job

2019-09-03 Thread Jörn Franke
This I would not say. The only “issue” with Spark is that you need to build some functionality on top which is available in Sqoop out of the box, especially for import processes and if you need to define a lot of them. > Am 03.09.2019 um 09:30 schrieb Shyam P : > > Hi Mich, >Lot of people

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Jörn Franke
1) this is not a use case, but a technical solution. Hence nobody can tell you if it make sense or not 2) do an upsert in Cassandra. However keep in mind that the application submitting to the Kafka topic and the one consuming from the Kafka topic need to ensure that they process messages in

Re: Any advice how to do this usecase in spark sql ?

2019-08-13 Thread Jörn Franke
Have you tried to join both datasets, filter accordingly and then write the full dataset to your filesystem? Alternatively work with a NoSQL database that you update by key (eg it sounds a key/value store could be useful for you). However, it could be also that you need to do more depending on

Re: Spark scala/Hive scenario

2019-08-07 Thread Jörn Franke
You can use the map datatype on the Hive table for the columns that are uncertain: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes However, maybe you can share more concrete details, because there could be also other solutions. > Am

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Jörn Franke
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share some code? > Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade : > > Hi. > I am using Spark 2.3.2 and Hive 3.1.0. > Even if i use parquet files the result would be same, because after all > sparkSQL

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Jörn Franke
I would remove the all GC tuning and add it later once you found the underlying root cause. Usually more GC means you need to provide more memory, because something has changed (your application, spark Version etc.) We don’t have your full code to give exact advise, but you may want to rethink

Re: Spark SaveMode

2019-07-19 Thread Jörn Franke
This is not an issue of Spark, but the underlying database. The primary key constraint has a purpose and ignoring it would defeat that purpose. Then to handle your use case, you would need to make multiple decisions that may imply you don’t want to simply insert if not exist. Maybe you want to

Re: [Pyspark 2.3+] Timeseries with Spark

2019-06-13 Thread Jörn Franke
Time series can mean a lot of different things and algorithms. Can you describe more what you mean by time series use case, ie what is the input, what do you like to do with the input and what is the output? > Am 14.06.2019 um 06:01 schrieb Rishi Shah : > > Hi All, > > I have a time series

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread Jörn Franke
Depending on what accuracy is needed, hyperloglogs can be an interesting alternative https://en.m.wikipedia.org/wiki/HyperLogLog > Am 09.06.2019 um 15:59 schrieb big data : > > From m opinion, Bitmap is the best solution for active users calculation. > Other solution almost bases on

Re: writing into oracle database is very slow

2019-04-18 Thread Jörn Franke
What is the size of the data? How much time does it need on HDFS and how much on Oracle? How many partitions do you have on Oracle side? > Am 06.04.2019 um 16:59 schrieb Lian Jiang : > > Hi, > > My spark job writes into oracle db using: > df.coalesce(10).write.format("jdbc").option("url", url)

Re: Spark SQL API taking longer time than DF API.

2019-03-31 Thread Jörn Franke
Is the select taking longer or the saving to a file. You seem to only save in the second case to a file > Am 29.03.2019 um 15:10 schrieb neeraj bhadani : > > Hi Team, >I am executing same spark code using the Spark SQL API and DataFrame API, > however, Spark SQL is taking longer than

Re: Spark does not load all classes in fat jar

2019-03-18 Thread Jörn Franke
Fat jar with shading as the application not as an additional jar package > Am 18.03.2019 um 14:08 schrieb Jörn Franke : > > Maybe that class is already loaded as part of a core library of Spark? > > Do you have concrete class names? > > In doubt create a fat jar and s

Re: Spark does not load all classes in fat jar

2019-03-18 Thread Jörn Franke
Maybe that class is already loaded as part of a core library of Spark? Do you have concrete class names? In doubt create a fat jar and shade the dependencies in question > Am 18.03.2019 um 12:34 schrieb Federico D'Ambrosio : > > Hello everyone, > > We're having a serious issue, where we get

Re: Masking username in Spark with regexp_replace and reverse functions

2019-03-17 Thread Jörn Franke
For the approach below you have to check for collisions, ie different name lead to same masked value. You could hash it. However in order to avoid that one can just try different hashes you need to include in each name a different random factor. However, the anonymization problem is bigger,

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Jörn Franke
yarn application -kill applicationid ? > Am 10.02.2019 um 13:30 schrieb Serega Sheypak : > > Hi there! > I have weird issue that appears only when tasks fail at specific stage. I > would like to imitate failure on my own. > The plan is to run problematic app and then kill entire executor or

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-22 Thread Jörn Franke
You can try with Yarn node labels: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html Then you can whitelist nodes. > Am 19.01.2019 um 00:20 schrieb Serega Sheypak : > > Hi, is there any possibility to tell Scheduler to blacklist specific nodes in > advance?

Re: cache table vs. parquet table performance

2019-01-16 Thread Jörn Franke
I believe the in-memory solution misses the storage indexes that parquet / orc have. The in-memory solution is more suitable if you iterate in the whole set of data frequently. > Am 15.01.2019 um 19:20 schrieb Tomas Bartalos : > > Hello, > > I'm using spark-thrift server and I'm searching

Re: spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Jörn Franke
the jobs has completed > quite some time ago and the output directory is also updated at that time. > Thanks, > Akshay > > >> On Tue, Dec 25, 2018 at 5:30 PM Jörn Franke wrote: >> Do you have a lot of small files? Do you use S3 or similar? It could be that >>

Re: spark application takes significant some time to succeed even after all jobs are completed

2018-12-25 Thread Jörn Franke
Do you have a lot of small files? Do you use S3 or similar? It could be that Spark does some IO related tasks. > Am 25.12.2018 um 12:51 schrieb Akshay Mendole : > > Hi, > As you can see in the picture below, the application last job finished > at around 13:45 and I could see the output

Re: [SPARK SQL] Difference between 'Hive on spark' and Spark SQL

2018-12-20 Thread Jörn Franke
If you have already a lot of queries then it makes sense to look at Hive (in a recent version)+TEZ+Llap and all tables in ORC format partitioned and sorted on filter columns. That would be the most easiest way and can improve performance significantly . If you want to use Spark, eg because you

Re: Spark Scala reading from Google Cloud BigQuery table throws error

2018-12-18 Thread Jörn Franke
Maybe the guava version in your spark lib folder is not compatible (if your Spark version has a guava library)? In this case i propose to create a fat/uber jar potentially with a shaded guava dependency. > Am 18.12.2018 um 11:26 schrieb Mich Talebzadeh : > > Hi, > > I am writing a small test

Re: Zookeeper and Spark deployment for standby master

2018-11-26 Thread Jörn Franke
I guess it is the usual things - if the non zookeeper processes take too much memory , disk space etc it will negatively affect zookeeper and thus your whole running cluster. You will have to make for your specific architectural setting a risk assessment if this is acceptable. > Am 26.11.2018

Re: streaming pdf

2018-11-19 Thread Jörn Franke
And you have to write your own input format, but this is not so complicated (probably anyway recommended for the PDF case) > Am 20.11.2018 um 08:06 schrieb Jörn Franke : > > Well, I am not so sure about the use cases, but what about using > StreamingContext.fileStr

Re: streaming pdf

2018-11-19 Thread Jörn Franke
-scala.reflect.ClassTag-scala.reflect.ClassTag-scala.reflect.ClassTag- > Am 19.11.2018 um 09:22 schrieb Nicolas Paris : > >> On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote: >> Why does it have to be a stream? >> > > Right now I manage the pipelines as spark batch process

Re: streaming pdf

2018-11-18 Thread Jörn Franke
Why does it have to be a stream? > Am 18.11.2018 um 23:29 schrieb Nicolas Paris : > > Hi > > I have pdf to load into spark with at least > format. I have considered some options: > > - spark streaming does not provide a native file stream for binary with > variable size (binaryRecordStream

Re: writing to local files on a worker

2018-11-11 Thread Jörn Franke
Can you use JNI to call the c++ functionality directly from Java? Or you wrap this into a MR step outside Spark and use Hadoop Streaming (it allows you to use shell scripts as mapper and reducer)? You can also write temporary files for each partition and execute the software within a map

Re: [Spark SQL] INSERT OVERWRITE to a hive partitioned table (pointing to s3) from spark is too slow.

2018-11-04 Thread Jörn Franke
Can you share some relevant source code? > Am 05.11.2018 um 07:58 schrieb ehbhaskar : > > I have a pyspark job that inserts data into hive partitioned table using > `Insert Overwrite` statement. > > Spark job loads data quickly (in 15 mins) to temp directory (~/.hive-***) in > S3. But, it's

Re: How to avoid long-running jobs blocking short-running jobs

2018-11-03 Thread Jörn Franke
Hi, What does your Spark deployment architecture looks like? Standalone? Yarn? Mesos? Kubernetes? Those have resource managers (not middlewares) that allow to implement scenarios as you want to achieve. In any case you can try the FairScheduler of any of those solutions. Best regards > Am

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread Jörn Franke
A lot of small files is very inefficient itself and predicate push down will not help you much there unless you merge them into one large file (one large file can be much more efficiently processed). How did you validate that predicate pushdown did not work on Hive? You Hive Version is also

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke
How large are they? A lot of (small) files will cause significant delay in progressing - try to merge as much as possible into one file. Can you please share full source code in Hive and Spark as well as the versions you are using? > Am 31.10.2018 um 18:23 schrieb gpatcham : > > > > When

Re: dremel paper example schema

2018-10-31 Thread Jörn Franke
I would try with the same version as Spark uses first. I don’t have the changelog of parquet in my head (but you can find it ok the Internet), but it could be the cause of your issues. > Am 31.10.2018 um 12:26 schrieb lchorbadjiev : > > Hi Jorn, > > I am using Apache Spark 2.3.1. > > For

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-30 Thread Jörn Franke
Older versions of Spark had indeed a lower performance on Python and R due to a conversion need between JVM datatypes and python/r datatypes. This changed in Spark 2.2, I think, with the integration of Apache Arrow. However, what you do after the conversion in those languages can be still

Re: dremel paper example schema

2018-10-30 Thread Jörn Franke
Are you using the same parquet version as Spark uses? Are you using a recent version of Spark? Why don’t you create the file in Spark? > Am 30.10.2018 um 07:34 schrieb lchorbadjiev : > > Hi Gourav, > > the question in fact is are there any the limitations of Apache Spark > support for Parquet

Re: Is spark not good for ingesting into updatable databases?

2018-10-27 Thread Jörn Franke
Do you have some code that you can share? Maybe it is something in your code that unintentionally duplicates it? Maybe your source (eg the application putting it on Kafka?)duplicates them already? Once and only once processing needs to be done end to end. > Am 27.10.2018 um 02:10 schrieb

Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Jörn Franke
Why not directly access the S3 file from Spark? You need to configure the IAM roles so that the machine running the S3 code is allowed to access the bucket. > Am 24.10.2018 um 06:40 schrieb Divya Gehlot : > > Hi Omer , > Here are couple of the solutions which you can implement for your use

Re: Process Million Binary Files

2018-10-11 Thread Jörn Franke
I believe your use case can be better covered with an own data source reading PDF files. On Big Data platforms in general you have the issue that individual PDF files are very small and are a lot of them - this is not very efficient for those platforms. That could be also one source of your

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Jörn Franke
Generally please avoid System.out.println, but use a logger -even for examples. People may take these examples from here and put it in their production code. > Am 09.10.2018 um 15:39 schrieb Shubham Chaurasia : > > Alright, so it is a big project which uses a SQL store underneath. > I extracted

Re: Use SparkContext in Web Application

2018-10-04 Thread Jörn Franke
Depending on your model size you can store it as PFA or PMML and run the prediction in Java. For larger models you will need a custom solution , potentially using a spark thrift Server/spark job server/Livy and a cache to store predictions that have been already calculated (eg based on previous

Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Jörn Franke
Looks like a firewall issue > Am 03.10.2018 um 09:34 schrieb Aakash Basu : > > The stacktrace is below - > >> --- >> Py4JJavaError Traceback (most recent call last) >> in () >> > 1 df = >>

Re: How to access line fileName in loading file using the textFile method

2018-09-24 Thread Jörn Franke
You can create your own data source exactly doing this. Why is the file name important if the file content is the same? > On 24. Sep 2018, at 13:53, Soheil Pourbafrani wrote: > > Hi, My text data are in the form of text file. In the processing logic, I > need to know each word is from which

Re: Use Shared Variable in PySpark Executors

2018-09-22 Thread Jörn Franke
Do you want to calculate it and share it once with all other executors? Then a broadcast variable maybe interesting for you, > On 22. Sep 2018, at 16:33, Soheil Pourbafrani wrote: > > Hi, I want to do some processing with PySpark and save the results in a > variable of type tuple that should

Re: Time-Series Forecasting

2018-09-19 Thread Jörn Franke
What functionality do you need ? Ie which methods? > On 19. Sep 2018, at 18:01, Mina Aslani wrote: > > Hi, > I have a question for you. Do we have any Time-Series Forecasting library in > Spark? > > Best regards, > Mina -

Re: Drawing Big Data tech diagrams using Pen Tablets

2018-09-12 Thread Jörn Franke
You can try cloud services such as draw.io or similar. > On 12. Sep 2018, at 20:31, Mich Talebzadeh wrote: > > Hi Gourav, > > I have an IPAD that my son uses it and not me (for games). I don't see much > value in spending $$$ on Surface. Then I had montblanc augmented paper that > kinf of

Re: How to parallelize zip file processing?

2018-08-10 Thread Jörn Franke
Does the zip file contain only one file? I fear in this case you can only have one core. Do you mean by the way gzip? In this case you cannot decompress it in parallel... How is the zip file created ? Can’t you create several ones? > On 10. Aug 2018, at 22:54, mytramesh wrote: > > I know,

Re: Spark Sparser library

2018-08-10 Thread Jörn Franke
You need to include the library in your dependencies. Furthermore the * does not make sense in the end. > On 10. Aug 2018, at 07:48, umargeek wrote: > > Hi Team, > > Please let me know the spark Sparser library to use while submitting the > spark application to use below mentioned format, >

Re: Broadcast variable size limit?

2018-08-05 Thread Jörn Franke
I think if you need more then you should anyway think about something different than broadcast variable ... > On 5. Aug 2018, at 16:51, klrmowse wrote: > > is it currently still ~2GB (Integer.MAX_VALUE) ?? > > or am i misinformed, since that's what google-search and scouring this > mailing

Re: Do GraphFrames support streaming?

2018-07-14 Thread Jörn Franke
stion now would be can it be done in streaming fashion? Are you > talking about the union of two streaming dataframes and then constructing a > graphframe (also during streaming) ? > >> On Sat, Jul 14, 2018 at 8:07 AM, Jörn Franke wrote: >> For your use case one might

Re: Do GraphFrames support streaming?

2018-07-14 Thread Jörn Franke
nything else at this point but of course, it's > great to have. > > If we were to do this myself should I extend the GraphFrame? any suggestions? > > >> On Sun, Apr 29, 2018 at 3:24 AM, Jörn Franke wrote: >> What is the use case you are trying to solve? >&g

Re: Inferring Data driven Spark parameters

2018-07-03 Thread Jörn Franke
Don’t do this in your job. Create for different types of jobs different jobs and orchestrate them using oozie or similar. > On 3. Jul 2018, at 09:34, Aakash Basu wrote: > > Hi, > > Cluster - 5 node (1 Driver and 4 workers) > Driver Config: 16 cores, 32 GB RAM > Worker Config: 8 cores, 16 GB

Re: Dataframe reader does not read microseconds, but TimestampType supports microseconds

2018-07-02 Thread Jörn Franke
How do you read the files ? Do you have some source code ? It could be related to the Json data source. What Spark version do you use? > On 2. Jul 2018, at 09:03, Colin Williams > wrote: > > I'm confused as to why Sparks Dataframe reader does not support reading json > or similar with

Re: How to validate orc vectorization is working within spark application?

2018-06-19 Thread Jörn Franke
Full code? What is expected performance and actual ? What is the use case? > On 20. Jun 2018, at 05:33, umargeek wrote: > > Hi Folks, > > I would just require few pointers on the above query w.r.t vectorization > looking forward for support from the community. > > Thanks, > Umar > > > > --

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Jörn Franke
If it is in kB then spark will always schedule it to one node. As soon as it gets bigger you will see usage of more nodes. Hence increase your testing Dataset . > On 11. Jun 2018, at 12:22, Aakash Basu wrote: > > Jorn - The code is a series of feature engineering and model tuning >

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-11 Thread Jörn Franke
What is your code ? Maybe this one does an operation which is bound to a single host or your data volume is too small for multiple hosts. > On 11. Jun 2018, at 11:13, Aakash Basu wrote: > > Hi, > > I have submitted a job on 4 node cluster, where I see, most of the operations > happening at

Re: Spark / Scala code not recognising the path?

2018-06-09 Thread Jörn Franke
Why don’t you write the final name from the start? Ie save as the file it should be named. > On 9. Jun 2018, at 09:44, Abhijeet Kumar wrote: > > I need to rename the file. I can write a separate program for this, I think. > > Thanks, > Abhijeet Kumar >> On 09-Jun-2018,

Re: Spark / Scala code not recognising the path?

2018-06-09 Thread Jörn Franke
ease tell the estimated time. So, that my program will wait for > that time period. > > Thanks, > Abhijeet Kumar >> On 09-Jun-2018, at 12:01 PM, Jörn Franke wrote: >> >> You need some time until the information of the file creation is propagated. >> >>>

Re: Spark / Scala code not recognising the path?

2018-06-09 Thread Jörn Franke
You need some time until the information of the file creation is propagated. > On 9. Jun 2018, at 08:07, Abhijeet Kumar wrote: > > I'm modifying a CSV file which is inside HDFS and finally putting it back to > HDFS in Spark. > val fs=FileSystem.get(spark.sparkContext.hadoopConfiguration) >

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jörn Franke
get out of scope and their memory can be > released. > > Also, assuming that the variables are not daisy-chained/inter-related as that > too will not make it easy. > > > From: Jay > Date: Monday, June 4, 2018 at 9:41 PM > To: Shuporno Choudhury > Cc: "Jör

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jörn Franke
how it will affect whatever I am already doing? > Do you mean running a different spark-submit for each different dataset when > you say 'an independent python program for each process '? > >> On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] >> wrote: &g

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jörn Franke
Why don’t you modularize your code and write for each process an independent python program that is submitted via Spark? Not sure though if Spark local make sense. If you don’t have a cluster then a normal python program can be much better. > On 4. Jun 2018, at 21:37, Shuporno Choudhury >

Re: [External] Re: Sorting in Spark on multiple partitions

2018-06-04 Thread Jörn Franke
s across multiple nodes. > > Thanks & Regards, > Neha Jain > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Monday, June 4, 2018 10:48 AM > To: Sing, Jasbir > Cc: user@spark.apache.org; Patel, Payal ; Jain, > Neha T. > Subject: [External] Re: Sor

Re: [External] Re: Sorting in Spark on multiple partitions

2018-06-04 Thread Jörn Franke
rks first item in the partition as true other items in that partition as > false. > If my sorting order is disturbed, the flag is wrongly set. > > Please suggest what else could be done to fix this very basic scenario of > sorting in Spark across multiple partitions across multiple

Re: Sorting in Spark on multiple partitions

2018-06-03 Thread Jörn Franke
You partition by userid, why do you then sort again by userid in the partition? Can you try to remove userid from the sort? How do you check if the sort is correct or not? What is the underlying objective of the sort? Do you have more information on schema and data? > On 4. Jun 2018, at

Re: Why Spark JDBC Writing in a sequential order

2018-05-25 Thread Jörn Franke
Can your database receive the writes concurrently ? Ie do you make sure that each executor writes into a different partition at database side ? > On 25. May 2018, at 16:42, Yong Zhang wrote: > > Spark version 2.2.0 > > > We are trying to write a DataFrame to remote

Re: Time series data

2018-05-24 Thread Jörn Franke
There is not one answer to this. It really depends what kind of time Series analysis you do with the data and what time series database you are using. Then it also depends what Etl you need to do. You seem to also need to join data - is it with existing data of the same type or do you join

Re:

2018-05-16 Thread Jörn Franke
How many rows do you have in total? > On 16. May 2018, at 11:36, Davide Brambilla > wrote: > > Hi all, >we have a dataframe with 1000 partitions and we need to write the > dataframe into a MySQL using this command: > > df.coalesce(20) >

Re: [Java] impact of java 10 on spark dev

2018-05-16 Thread Jörn Franke
First thing would be that scala supports them. Then for other things someone might need to redesign the Spark source code to leverage modules - this could be a rather handy feature to have a small but very well designed core (core, ml, graph etc) around which others write useful modules. > On

Re: spark sql StackOverflow

2018-05-15 Thread Jörn Franke
3000 filters don’t look like something reasonable. This is very difficult to test and verify as well as impossible to maintain. Could it be that your filters are another table that you should join with ? The example is a little bit artificial to understand the underlying business case. Can you

Re: Measure performance time in some spark transformations.

2018-05-13 Thread Jörn Franke
Can’t you find this in the Spark UI or timeline server? > On 13. May 2018, at 00:31, Guillermo Ortiz Fernández > wrote: > > I want to measure how long it takes some different transformations in Spark > as map, joinWithCassandraTable and so on. Which one is the

Re: ordered ingestion not guaranteed

2018-05-11 Thread Jörn Franke
What DB do you have? You have some options, such as 1) use a key value store (they can be accessed very efficiently) to see if there has been a newer key already processed - if yes then ignore value if no then insert into database 2) redesign the key to include the timestamp and find out the

Re: A naive ML question

2018-04-29 Thread Jörn Franke
;>> STARTED, PENDING, CANCELLED, COMPLETED, SETTLED etc... >>> >>> Thanks, >>> kant >>> >>>> On Sat, Apr 28, 2018 at 4:11 AM, Jörn Franke <jornfra...@gmail.com> wrote: >>>> What do you mean by “how it evolved over time” ? A tran

Re: Do GraphFrames support streaming?

2018-04-29 Thread Jörn Franke
What is the use case you are trying to solve? You want to load graph data from a streaming window in separate graphs - possible but requires probably a lot of memory. You want to update an existing graph with new streaming data and then fully rerun an algorithms -> look at Janusgraph You want

  1   2   3   4   5   6   >