[ https://issues.apache.org/jira/browse/SPARK-12553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073780#comment-15073780 ]
malouke commented on SPARK-12553: --------------------------------- h sean, sorry i am new user forgot to put I beg your pardon. but I use parquet file if that's your question? > join is absloutly slow > ----------------------- > > Key: SPARK-12553 > URL: https://issues.apache.org/jira/browse/SPARK-12553 > Project: Spark > Issue Type: Bug > Environment: cloudera cdh 5 > centos 6 > Reporter: malouke > > Hello , > I have 7 tables to join with a left join, I did this: > start = time.time () > df_test = hc.sql ("select * from rapexp201412 left join CLIENT1412 is > rapNUMCNT CLINMCLI = \ > left join SRN1412 is SRNSIRET CLISIRET = \ > left join bodacc2014 is SRNSIREN bodSORCS = \ > left join sinagr2014 is rapNUMCNT sinagNUMCNT = \ > left join sinfix2014 is rapNUMCNT sinfiNUMCNT = \ > left join sinimag2014 is rapNUMCNT sinimNUMCNT = \ > left join up2014 is rapNUMCNT up2NUMCNT = \ > left join upagr2014 is rapNUMCNT upaNUMCNT = \ > left join aeveh is rapNUMCNT aevNUMCNT = \ > left join premiums are rapNUMCNT = priNUMCNT ") > time.time () - start > take : 2.289154052734375s > after I do: > df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\ > , partitionBy = "date_part") > df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\ > , partitionBy = "date_part") > here is the configuration of my pyspark: > sc._conf.set (u'spark.dynamicAllocation.enabled 'u'false') \ > .set (u'spark.eventLog.enabled 'u'true') \ > .set (u'spark.shuffle.service.enabled 'u'false') \ > .set (u'spark.yarn.historyServer.address 'u'http: > //prssncdhna02.bigplay.bigdata: 18088') \ > .set (u'spark.driver.port 'u'54330') \ > .set (u'spark.eventLog.dir 'u'hdfs: // bigplay-nameservice / user / spark / > applicationHistory') \ > .set (u'spark.blockManager.port 'u'54332') \ > .set (u'spark.yarn.jar 'u'local: > /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar > ') \ > .set (u'spark.dynamicAllocation.executorIdleTimeout 'u'60') \ > .set (u'spark.serializer 'u'org.apache.spark.serializer.KryoSerializer') \ > .set (u'spark.authenticate 'u'false') \ > .set (u'spark.serializer.objectStreamReset 'u'100') \ > .set (u'spark.submit.deployMode 'u'client') \ > .set (u'spark.executor.memory 'u'4g') \ > .set (u'spark.master 'u'yarn client') \ > .set (u'spark.driver.memory 'u'10g') \ > .set (u'spark.driver.extraLibraryPath 'u' / opt / Cloudera / parcels / > CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \ > .set (u'spark.dynamicAllocation.schedulerBacklogTimeout 'U'1') \ > .set (u'spark.executor.instances 'u'8') \ > .set (u'spark.shuffle.service.port 'u'7337') \ > .set (u'spark.fileserver.port 'u'54331') \ > .set (u'spark.app.name 'u'PySparkShell') \ > .set (u'spark.yarn.config.gatewayPath 'u' / opt / Cloudera / parcels') \ > .set (u'spark.rdd.compress 'u'True') \ > .set (u'spark.yarn.config.replacementPath 'u' {{}} /../../ HADOOP_COMMON_HOME > .. ') \ > .set (u'spark.yarn.isPython 'u'true') \ > .set (u'spark.dynamicAllocation.minExecutors 'u'0') \ > .set(u'spark.executor.extraLibraryPath',u'/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native')\ > .set (u'spark.ui.proxyBase 'u' / proxy / application_1450819756020_0615 ') \ > .set (u'spark.yarn.am.extraLibraryPath 'u' / opt / Cloudera / parcels / > CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \ > .set (u'hadoop.major.version 'u'yarn') \ > .set (u'spark.version 'u'1.5.2') > I do not understand why the join does not work > Thank you beforehand -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org