Re: [DISCUSS][SQL] Control the number of output files

2018-08-06 Thread lukas nalezenec
Hi Koert, There is no such Jira yet. We need SPARK-23889 before. You can find some mentions in the design document inside 23889. Best regards Lukas 2018-08-06 18:34 GMT+02:00 Koert Kuipers : > i went through the jiras targeting 2.4.0 trying to find a feature where > spark would

Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread shane knapp
job configured, build running: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/3/ on the bright(er) side, since i tested the crap out of this build on the new ubuntu nodes, i've set this new job to run there. :) shane On

Re: Handle BlockMissingException in pyspark

2018-08-06 Thread John Zhuge
BlockMissingException typically indicates the HDFS file is corrupted. Might be an HDFS issue, Hadoop mailing list is a better bet: u...@hadoop.apache.org. Capture at the full stack trace in executor log. If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693` to determine

Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread shane knapp
i'll get something set up quickly by hand today, and make a TODO to get the job config checked in to the jenkins job builder configs later this week. shane On Sun, Aug 5, 2018 at 7:10 AM, Sean Owen wrote: > Shane et al - could we get a test job in Jenkins to test the Scala 2.12 > build? I

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Sean Owen
... and we still have a few snags with Scala 2.12 support at https://issues.apache.org/jira/browse/SPARK-25029 There is some hope of resolving it on the order of a week, so for the moment, seems worth holding 2.4 for. On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler wrote: > Hi All, > > I'd like to

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Bryan Cutler
Hi All, I'd like to request a few days extension to the code freeze to complete the upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several key improvements and bug fixes. The RC vote just passed this morning and code changes are complete in

Re: [DISCUSS][SQL] Control the number of output files

2018-08-06 Thread Koert Kuipers
i went through the jiras targeting 2.4.0 trying to find a feature where spark would coalesce/repartition by size (so merge small files automatically), but didn't find it. can someone point me to it? thank you. best, koert On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers wrote: > lukas, > what is

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-06 Thread antonkulaga
I have the same problem with gene expressions data ( javascript:portalClient.browseDatasets.downloadFile('GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz','gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz') where I have tens of thousands genes as

[Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-06 Thread makatun
It is well known that wide tables are not the most efficient way to organize data. However, sometimes we have to deal with extremely wide tables featuring thousands of columns. For example, loading data from legacy systems. *We have performed an investigation of how the number of columns affects

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-08-06 Thread Saisai Shao
Yes, there'll be an RC4, still waiting for the fix of one issue. Yuval Itzchakov 于2018年8月6日周一 下午6:10写道: > Are there any plans to create an RC4? There's an important Kafka Source > leak > fix I've merged back to the 2.3 branch. > > > > -- > Sent from:

Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Stavros Kontopoulos
The root cause for a case where closure cleaner is involved is described here: https://github.com/apache/spark/pull/22004/files#r207753682 but I am also waiting for some feedback from Lukas Rytz why this even worked in 2.11. If it is something that needs fix and can be fixed we will fix and add

Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Mridul Muralidharan
A spark user’s expectation would be that any closure which worked in 2.11 will continue to work in 2.12 (exhibiting same behavior wrt functionality, serializability, etc). If there are behavioral changes, we will need to understand what they are - but expection would be that they are minimal (if

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-08-06 Thread Yuval Itzchakov
Are there any plans to create an RC4? There's an important Kafka Source leak fix I've merged back to the 2.3 branch. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail:

Re: Why is SQLImplicits an abstract class rather than a trait?

2018-08-06 Thread assaf.mendelson
The import will work for the trait but not for anyone implementing the trait. As for not having a master, it was just an example, the full example contains some configurations. Thanks, Assaf -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Handle BlockMissingException in pyspark

2018-08-06 Thread Divay Jindal
Hi , I am running pyspark in dockerized jupyter environment , I am constantly getting this error : ``` Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 33 in stage 25.0 failed

Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Stavros Kontopoulos
Closure cleaner's initial purpose AFAIK is to clean the dependencies brought in with outer pointers (compiler's side effect). With LMFs in Scala 2.12 there are no outer pointers, that is why in the new design document we kept the implementation minimal focusing on the return statements (it was