Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Georg Heiler
Would you plan to keep the existing indexing mechanism then? https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index For me, it always even when trying to use the distributed version resulted in various window functions being

Re: pip/conda distribution headless mode

2020-08-30 Thread Georg Heiler
Many thanks. Best, Georg Am Mo., 31. Aug. 2020 um 01:12 Uhr schrieb Xiao Li : > Hi, Georg, > > This is being tracked by https://issues.apache.org/jira/browse/SPARK-32017 You > can leave comments in the JIRA. > > Thanks, > > Xiao > > On Sun, Aug 30, 2020 at 3:06 PM

pip/conda distribution headless mode

2020-08-30 Thread Georg Heiler
Hi, I want to use pyspark as distributed via conda in headless mode. It looks like the hadoop binaries are bundles (= pip distributes a default version) https://stackoverflow.com/questions/63661404/bootstrap-spark-itself-on-yarn. I want to ask if it would be possible to A) distribute the

custom FileStreamSource which reads from one partition onwards

2019-09-20 Thread Georg Heiler
Hi, to my best knowledge, the existing FileStreamSource reads all the files in a directory (hive table). However, I need to be able to specify an initial partition it should start from (i.e. like a Kafka offset/initial warmed-up state) and then only read data which is semantically (i.e. using a

Re: Custom Window Function

2019-01-25 Thread Georg Heiler
Hi, https://stackoverflow.com/questions/32100973/how-to-define-and-use-a-user-defined-aggregate-function-in-spark-sql has a good overview and the best sample I have found so far. (besides spark source code). Best, Georg Am Mi., 23. Jan. 2019 um 17:16 Uhr schrieb Georg Heiler < georg.kf.

Custom Window Function

2019-01-23 Thread Georg Heiler
Hi, I want to write custom window functions in spark which are also optimisable for catalyst. Can you provide some hints where to start? Also posting to DEVLIST as I believe this is a rather exotic topic. Best, Georg

Derby dependency missing for spark-hive form 2.2.1 onwards

2018-03-31 Thread Georg Heiler
Hi, I noticed that spark standalone (locally for development) will no longer support the integrated hive megastore as some driver classes for derby seem to be missing from 2.2.1 and onwards (2.3.0). It works just fine for 2.2.0 or previous versions to execute the following script:

Re: Compiling Spark UDF at runtime

2018-01-12 Thread Georg Heiler
You could store the jar in hdfs. Then even in yarn cluster mode your give workaround should work. Michael Shtelma schrieb am Fr. 12. Jan. 2018 um 12:58: > Hi all, > > I would like to be able to compile Spark UDF at runtime. Right now I > am using Janino for that. > My problem

Re: Schema Evolution in Apache Spark

2018-01-11 Thread Georg Heiler
Isn't this related to the data format used, i.e. parquet, Avro, ... which already support changing schema? Dongjoon Hyun schrieb am Fr., 12. Jan. 2018 um 02:30 Uhr: > Hi, All. > > A data schema can evolve in several ways and Apache Spark 2.3 already > supports the

custom column types for JDBC datasource writer

2017-07-05 Thread Georg Heiler
Hi, is it possible to somehow make spark not use VARCHAR(255) but something bigger i.e. CLOB for Strings? If not, is it at least possible to catch the exception which is thrown. To me, it seems that spark is catching and logging it - so I can no longer intervene and handle it:

spark encoder not working for UDF

2017-06-25 Thread Georg Heiler
Hi, I have a custom spark kayo encoder but that one is not in scope for the UDFs to work. https://stackoverflow.com/questions/44735235/spark-custom-kryo-encoder-not-providing-schema-for-udf Regards, Georg

Re: spark messing up handling of native dependency code?

2017-06-05 Thread Georg Heiler
I read http://techblog.applift.com/upgrading-spark an conducted further research. I think there is some problem with the class loader. Unfortunately, so far, I did not get it to work. Georg Heiler <georg.kf.hei...@gmail.com> schrieb am Sa., 3. Juni 2017 um 08:27 Uhr: > When tested wi

Re: spark messing up handling of native dependency code?

2017-06-03 Thread Georg Heiler
d safe,so using > from workers is most likely a gamble. > On 06/03/2017 01:26 AM, Georg Heiler wrote: > > Hi, > > There is a weird problem with spark when handling native dependency code: > I want to use a library (JAI) with spark to parse some spatial raster > files. Unfo

spark messing up handling of native dependency code?

2017-06-02 Thread Georg Heiler
Hi, There is a weird problem with spark when handling native dependency code: I want to use a library (JAI) with spark to parse some spatial raster files. Unfortunately, there are some strange issues. JAI only works when running via the build tool i.e. `sbt run` when executed in spark. When

Generic datasets implicit encoder missing

2017-05-29 Thread Georg Heiler
Hi, Anyone knows what is wrong with using a generic https://stackoverflow.com/q/44247874/2587904 to construct a dataset? Even though the implicits are imported, they are missing. Regards Georg

Re: Spark Local Pipelines

2017-03-13 Thread Georg Heiler
Great idea. I see the same problem. I would suggest checking the following projects as a kick start as well ( not only mleap) https://github.com/ucbrise/clipper and https://github.com/Hydrospheredata/mist Regards Georg Asher Krim schrieb am So. 12. März 2017 um 23:21: > Hi

Re: Spark Job Performance monitoring approaches

2017-02-15 Thread Georg Heiler
I know of the following tools https://sites.google.com/site/sparkbigdebug/home https://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark https://github.com/SparkMonitor/varOne https://github.com/groupon/sparklint Chetan Khatri

Re: handling of empty partitions

2017-01-11 Thread Georg Heiler
I see that there is the possibility to improve and make the algorithm more fault tolerant as outlined by both of you. Could you explain a little bit more why +--++ | foo| bar| +--++ |2016-01-01| first|

Re: handling of empty partitions

2017-01-09 Thread Georg Heiler
Hi Liang-Chi Hsieh, Strange: As the "toCarry" returned is the following when I tested your codes: Map(1 -> Some(FooBar(Some(2016-01-04),lastAssumingSameDate)), 0 -> Some(FooBar(Some(2016-01-02),second))) For me it always looked like: ## carry Map(2 -> None, 5 -> None, 4 ->

Re: modifications to ALS.scala

2016-12-08 Thread Georg Heiler
You can write some code e.g. A custom estimator transformer in sparks namespace. http://stackoverflow.com/a/40785438/2587904 might help you get started. Be aware that using private e.g. Spark internal api might be subjected to change from release to release. You definitely will require spark

Re: modifications to ALS.scala

2016-12-07 Thread Georg Heiler
What about putting a custom als implementation into sparks name space? harini schrieb am Do. 8. Dez. 2016 um 00:01: > Hi all, I am trying to implement ALS with a slightly modified objective > function, which will require minor changes to fit -> train -> > computeFactors

Re: SparkUI via proxy

2016-11-24 Thread Georg Heiler
Sehr Port forwarding will help you out. marco rocchi schrieb am Do. 24. Nov. 2016 um 16:33: > Hi, > I'm working with Apache Spark in order to develop my master thesis.I'm new > in spark and working with cluster. I searched through internet but I didn't >

Re: Develop custom Estimator / Transformer for pipeline

2016-11-20 Thread Georg Heiler
16 at 7:39 AM Georg Heiler <georg.kf.hei...@gmail.com> > wrote: > > Yes that would be really great. Thanks a lot > Holden Karau <hol...@pigscanfly.ca> schrieb am Fr. 18. Nov. 2016 um 07:38: > > Hi Greg, > > So while the post isn't 100% finished if you would want

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Georg Heiler
ce. The shared > Params in SPARK-7146 are not necessary to create a custom algorithm; they > are just niceties. > > Though there aren't great docs yet, you should be able to follow existing > examples. And I'd like to add more docs in the future! > > Good luck, > Joseph > &g

Develop custom Estimator / Transformer for pipeline

2016-11-16 Thread Georg Heiler
HI, I want to develop a library with custom Estimator / Transformers for spark. So far not a lot of documentation could be found but http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib Suggest that: Generally speaking, there is no documentation because as