Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Jörn Franke
Is it in any case appropriate to use log4j 1.x which is not maintained anymore and has other security vulnerabilities which won’t be fixed anymore ? > Am 13.12.2021 um 06:06 schrieb Sean Owen : > >  > Check the CVE - the log4j vulnerability appears to affect log4j 2, not 1.x. > There was

Re: Hive external table not working in sparkSQL when subdirectories are present

2019-08-07 Thread Jörn Franke
Do you use the HiveContext in Spark? Do you configure the same options there? Can you share some code? > Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade : > > Hi. > I am using Spark 2.3.2 and Hive 3.1.0. > Even if i use parquet files the result would be same, because after all > sparkSQL

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Jörn Franke
I would remove the all GC tuning and add it later once you found the underlying root cause. Usually more GC means you need to provide more memory, because something has changed (your application, spark Version etc.) We don’t have your full code to give exact advise, but you may want to rethink

Re: Custom datasource: when acquire and release a lock?

2019-05-26 Thread Jörn Franke
What does your data source structure look like? Can’t you release it at the end of the build scan method? What technology is used in the transactional data endpoint? > Am 24.05.2019 um 15:36 schrieb Abhishek Somani : > > Hi experts, > > I am trying to create a custom Spark Datasource(v1) to

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-19 Thread Jörn Franke
Also on AWS and probably some more cloud providers > Am 19.03.2019 um 19:45 schrieb Steve Loughran : > > > you might want to look at the work on FPGA resources; again it should just be > a resource available by a scheduler. Key thing is probably just to keep the > docs generic > >

Re: [External] Re: [Spark RPC] Help how to debug sudden performance issue

2019-03-11 Thread Jörn Franke
almost exactly 100ms to > process 1 result (as seen by the consecutive TID’s below) or any logging I > may be able to turn on to narrow the search. > > There are no errors or warnings in the logs. > > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Mon

Re: [Spark RPC] Help how to debug sudden performance issue

2019-03-11 Thread Jörn Franke
Well it is a little bit difficult to say, because a lot of things are mixing up here. What function is calculated? Does it need a lot of memory? Could it be that you run out of memory and some spillover happens and you have a lot of IO to disk which is blocking? Related to that could be 1

Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-09 Thread Jörn Franke
Maybe it is better to introduce a new datatype that supports negative scale, otherwise the migration and testing efforts for organizations running Spark application becomes too large. Of course the current decimal will be kept as it is. > Am 07.01.2019 um 15:08 schrieb Marco Gaido : > > In

Re: Self join

2018-12-11 Thread Jörn Franke
I don’t know your exact underlying business problem, but maybe a graph solution, such as Spark Graphx meets better your requirements. Usually self-joins are done to address some kind of graph problem (even if you would not describe it as such) and is for these kind of problems much more

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Jörn Franke
not re-apply pushed filters. If data source lies, many things can > go wrong... > >> On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote: >> Well even if it has to apply it again, if pushdown is activated then it will >> be much less cost for spark to see if the filter h

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Jörn Franke
hat). > > Is there any other option I am not considering? > > Best regards, > Alessandro > > Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke ha scritto: >> BTW. Even for json a pushdown can make sense to avoid that data is >> unnecessary ending in Spark ( because i

Re: Pushdown in DataSourceV2 question

2018-12-08 Thread Jörn Franke
BTW. Even for json a pushdown can make sense to avoid that data is unnecessary ending in Spark ( because it would cause unnecessary overhead). In the datasource v2 api you need to implement a SupportsPushDownFilter > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama : > > Hi, > > I'm a support

Re: Pushdown in DataSourceV2 question

2018-12-08 Thread Jörn Franke
It was already available before DataSourceV2, but I think it might have been an internal/semi-official API (eg json is an internal datasource since some time now). The filters were provided to the datasource, but you will never know if the datasource has indeed leveraged them or if for other

Re: Spark Utf 8 encoding

2018-11-10 Thread Jörn Franke
Is the original file indeed utf-8? Especially Windows environments tend to mess up the files (E.g. Java on Windows does not use by default UTF-8). However, also the software that processed the data before could have modified it. > Am 10.11.2018 um 02:17 schrieb lsn24 : > > Hello, > > Per the

Re: Coalesce behaviour

2018-10-15 Thread Jörn Franke
This is not fully correct. If you have less files then you need to move some data to some other nodes, because not all the data is there for writing (even the case for the same node, but then it is easier from a network perspective). Hence a shuffling is needed. > Am 15.10.2018 um 05:04

Re: Remove Flume support in 3.0.0?

2018-10-10 Thread Jörn Franke
I think it makes sense to remove it. If it is not too much effort and the architecture of the flume source is not considered as too strange one may extract it as a separate project and put it on github in a dedicated non-supported repository. This would enable distributors and other companies

Re: Support for Second level of concurrency

2018-09-25 Thread Jörn Franke
What is the ultimate goal of this algorithm? There could be already algorithms that can do this within Spark. You could also put a message on Kafka (or another broker) and have spark applications listen to them to trigger further computation. This would be also more controlled and can be done

Re: Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Jörn Franke
Can’t you remove the dependency to the databricks CSV data source? Spark has them now integrated since some versions so it is not needed. > On 31. Aug 2018, at 05:52, Srabasti Banerjee > wrote: > > Hi, > > I am trying to run below code to read file as a dataframe onto a Stream (for > Spark

Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Jörn Franke
hanging issues with MPI-based > programs). > > >> On Mon, May 7, 2018 at 10:05 PM Jörn Franke <jornfra...@gmail.com> wrote: >> Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA >> scheduling, so it might be worth to have the last point generic that

Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Jörn Franke
Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA scheduling, so it might be worth to have the last point generic that not only the Spark scheduler, but all supported schedulers can use GPU. For the other 2 points I just wonder if it makes sense to address this in the ml

Re: Custom datasource as a wrapper for existing ones?

2018-05-03 Thread Jörn Franke
we might go for an external >> library that most likely have to be reimplemented twice in Python… >> Or there might be a way to force our lib execution in the same JVM as Spark >> uses. To be seen… Again the most elegant way would be the datasource. >> >> Cheers, >&g

Re: Custom datasource as a wrapper for existing ones?

2018-05-02 Thread Jörn Franke
Some note on the internal API - it used to change with each release which was quiet annoying because other data sources (Avro, HadoopOffice etc) had to follow up in this. In the end it is an internal API and thus does not guarantee to be stable. If you want to have something stable you have to

Re: Custom datasource as a wrapper for existing ones?

2018-05-02 Thread Jörn Franke
Spark at some point in time used for the formats shipped with Spark (eg parquet) an internal API that is not the data source API. You can look on how this is implemented for Parquet and co in the Spark source code. Maybe this is the issue you are facing? Have you tried to put your

Re: Best way to Hive to Spark migration

2018-04-05 Thread Jörn Franke
And the usual hint when migrating - do not migrate only but also optimize the ETL process design - this brings the most benefit s > On 5. Apr 2018, at 08:18, Jörn Franke <jornfra...@gmail.com> wrote: > > Ok this is not much detail, but you are probably best off if y

Re: Best way to Hive to Spark migration

2018-04-05 Thread Jörn Franke
suchst cost based optimizer. > On 5. Apr 2018, at 08:02, Pralabh Kumar <pralabhku...@gmail.com> wrote: > > Hi > > I have lot of ETL jobs (complex ones) , since they are SLA critical , I am > planning them to migrate to spark. > >> On Thu, Apr 5, 2018

Re: Best way to Hive to Spark migration

2018-04-04 Thread Jörn Franke
You need to provide more context on what you do currently in Hive and what do you expect from the migration. > On 5. Apr 2018, at 05:43, Pralabh Kumar wrote: > > Hi Spark group > > What's the best way to Migrate Hive to Spark > > 1) Use HiveContext of Spark > 2) Use

Re: Spark scala development in Sbt vs Maven

2018-03-05 Thread Jörn Franke
I think most of the scala development in Spark happens with sbt - in the open source world. However, you can do it with Gradle and Maven as well. It depends on your organization etc. what is your standard. Some things might be more cumbersome too reach in non-sbt scala scenarios, but this is

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Jörn Franke
ned in sorted order on some column) could make a big >>> difference. Probably the simplest argument for a lot of time being spent >>> sorting (in some use cases) is the fact it's still one of the standard >>> benchmarks. >>> >>> On Mon, Dec

Re: Spark Data Frame. PreSorded partitions

2017-12-04 Thread Jörn Franke
I do not think that the data source api exposes such a thing. You can however proposes to the data source api 2 to be included. However there are some caveats , because sorted can mean two different things (weak vs strict order). Then, is really a lot of time lost because of sorting? The best

Re: SparkSQL not support CharType

2017-11-23 Thread Jörn Franke
Or bytetype depending on the use case > On 23. Nov 2017, at 10:18, Herman van Hövell tot Westerflier > wrote: > > You need to use a StringType. The CharType and VarCharType are there to > ensure compatibility with Hive and ORC; they should not be used anywhere

Re: is there a way for removing hadoop from spark

2017-11-12 Thread Jörn Franke
running somewhere. > > >> On Sun, 12 Nov 2017 at 17:17 Jörn Franke <jornfra...@gmail.com> wrote: >> Why do you even mind? >> >> > On 11. Nov 2017, at 18:42, Cristian Lorenzetto >> > <cristian.lorenze...@gmail.com> wrote: >> >

Re: is there a way for removing hadoop from spark

2017-11-12 Thread Jörn Franke
Why do you even mind? > On 11. Nov 2017, at 18:42, Cristian Lorenzetto > wrote: > > Considering the case i neednt hdfs, it there a way for removing completely > hadoop from spark? > Is YARN the unique dependency in spark? > is there no java or scala (jdk

Re: Task failures and other problems

2017-11-09 Thread Jörn Franke
Maybe contact Oracle support? Do you have maybe accidentally configured some firewall rules? Routing issues? Maybe only one of the nodes... > On 9. Nov 2017, at 20:04, Jan-Hendrik Zab wrote: > > > Hello! > > This might not be the perfect list for the issue, but I tried

Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Jörn Franke
version 2.0.1 with MapR distribution. > > Writing every table to parquet and reading it could be very much time > consuming, currently entire job could take ~8 hours on 8 node of 100 Gig ram > 20 core cluster, not only used utilized by me but by larger team. > > Thanks > >

Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Jörn Franke
Hi, Do you have a more detailed log/error message? Also, can you please provide us details on the tables (no of rows, columns, size etc). Is this just a one time thing or something regular? If it is a one time thing then I would tend more towards putting each table in HDFS (parquet or ORC) and

Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jörn Franke
Scala 2.12 is not yet supported on Spark - this means also not JDK9: https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-14220 If you look at the Oracle support then jdk 9 is anyway only supported for 6 months. JDK 8 is Lts (5 years) JDK 18.3 will be only 6 months and JDK 18.9 is

Re: Spark-XML maintenance

2017-10-26 Thread Jörn Franke
I would address databricks with this issue - it is their repository > On 26. Oct 2017, at 18:43, comtef wrote: > > I've used spark for a couple of years and I found a way to contribute to the > cause :). > I've found a blocker in Spark XML extension >

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Jörn Franke
Try sparksession.conf().set > On 28. Jul 2017, at 12:19, Chetan Khatri wrote: > > Hey Dev/ USer, > > I am working with Spark 2.0.1 and with dynamic partitioning with Hive facing > below issue: > > org.apache.hadoop.hive.ql.metadata.HiveException: > Number of

Re: is there a way to persist the lineages generated by spark?

2017-04-06 Thread Jörn Franke
I do think this is the right way, you will have to do testing with test data verifying that the expected output of the calculation is the output. Even if the logical Plan Is correct your calculation might not be. E.g. There can be bugs in Spark, in the UI or (what is very often) the client

Re: spark sql versus interactive hive versus hive

2017-02-11 Thread Jörn Franke
I think this is a rather simplistic view. All the tools to computation in-memory in the end. For certain type of computation and usage patterns it makes sense to keep them in memory. For example, most of the machine learning approaches require to include the same data in several iterative

Re: Maximum limit for akka.frame.size be greater than 500 MB ?

2017-01-29 Thread Jörn Franke
Which Spark version are you using? What are you trying to do exactly and what is the input data? As far as I know, akka has been dropped in recent Spark versions. > On 30 Jan 2017, at 00:44, aravasai wrote: > > I have a spark job running on 2 terabytes of data which

Re: MLlib mission and goals

2017-01-24 Thread Jörn Franke
I also agree with Joseph and Sean. With respect to spark-packages. I think the issue is that you have to manually add it, although it basically fetches the package from Maven Central (or custom upload). From an organizational perspective there are other issues. E.g. You have to download it

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-15 Thread Jörn Franke
Hi, What about yarn or mesos used in combination with Spark. The have also cgroups. Or a kubernetes etc deployment. > On 15 Dec 2016, at 17:37, Hegner, Travis wrote: > > Hello Spark Devs, > > > I have finally completed a mostly working proof of concept. I do not want

Re: Dynamic Graph Handling

2016-10-24 Thread Jörn Franke
Maybe titandb ?! It uses Hbase to store graphs and solr (on HDFS) to index graphs. I am not 100% sure it supports it, but probably. It can also integrate Spark, but analytics on a given graph only. Otherwise you need to go for dedicated graph system. > On 24 Oct 2016, at 16:41, Marco

Re: Memory usage by Spark jobs

2016-09-22 Thread Jörn Franke
You should take also into account that spark has different option to represent data in-memory, such as Java serialized objects, Kyro serialized, Tungsten (columnar optionally compressed) etc. the tungsten thing depends heavily on the underlying data and sorting especially if compressed. Then,

Re: Bitmap Indexing to increase OLAP query performance

2016-06-29 Thread Jörn Franke
Is it the traditional bitmap indexing? I would not recommend it for big data. You could use bloom filters and min/max indexes in-memory which look to be more appropriate. However, if you want to use bitmap indexes then you would have to do it as you say. However, bitmap indexes may consume a

Re: Question about Bloom Filter in Spark 2.0

2016-06-22 Thread Jörn Franke
You should see at it both levels: there is one bloom filter for Orc data and one for data in-memory. It is already a good step towards an integration of format and in-memory representation for columnar data. > On 22 Jun 2016, at 14:01, BaiRan wrote: > > After building

Re: Structured Streaming partition logic with respect to storage and fileformat

2016-06-21 Thread Jörn Franke
Based on the underlying Hadoop FileFormat. This one does it mostly based on blocksize. You can change this though. > On 21 Jun 2016, at 12:19, Sachin Aggarwal wrote: > > > when we use readStream to read data as Stream, how spark decides the no of > RDD and

Re: Spark performance comparison for research

2016-02-29 Thread Jörn Franke
I am not sure what you compare here. You would need to provide additional details, such as algorithms and functionality supported by your framework. For instance, Spark has built-in fault-tolerance and is a generic framework, which has advantage with respect to development and operations, but

Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Jörn Franke
How did you configure YARN queues? What scheduler? Preemption ? > On 19 Feb 2016, at 06:51, Prabhu Joseph wrote: > > Hi All, > >When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a > single Spark Context, the jobs take more time to complete

Re: 回复: Spark 1.6.0 + Hive + HBase

2016-01-28 Thread Jörn Franke
Probably a newer Hive version makes a lot of sense here - at least 1.2.1. What storage format are you using? I think the old Hive version had a bug where it always scanned all partitions unless you limit it in the on clause of the query to a certain partition (eg on date=20201119) > On 28 Jan

Re: OLAP query using spark dataframe with cassandra

2015-11-08 Thread Jörn Franke
Is there any distributor supporting these software components in combination? If no and your core business is not software then you may want to look for something else, because it might not make sense to build up internal know-how in all of these areas. In any case - it depends all highly on

Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Jörn Franke
Would it be possible to use views to address some of your requirements? Alternatively it might be better to parse it yourself. There are open source libraries for it, if you need really a complete sql parser. Do you want to do it on sub queries? > On 05 Nov 2015, at 23:34, Yana Kadiyska

Re: SIGTERM 15 Issue : Spark Streaming for ingesting huge text files using custom Receiver

2015-09-12 Thread Jörn Franke
I am not sure what are you trying to achieve here. Have you thought about using flume? Additionally maybe something like rsync? Le sam. 12 sept. 2015 à 0:02, Varadhan, Jawahar a écrit : > Hi all, >I have a coded a custom receiver which receives kafka messages.

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Jörn Franke
Well what do you do in case of failure? I think one should use a professional ingestion tool that ideally does not need to reload everything in case of failure and verifies that the file has been transferred correctly via checksums. I am not sure if Flume supports ftp, but Ssh,scp should be