[jira] [Commented] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns
[ https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110603#comment-15110603 ] malouke commented on SPARK-12954: - hi sean, where i can ask question ? > pyspark API 1.3.0 how we can patitionning by columns > --- > > Key: SPARK-12954 > URL: https://issues.apache.org/jira/browse/SPARK-12954 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: spark 1.3.0 > cloudera manger > linux platfrome > pyspark >Reporter: malouke >Priority: Blocker > Labels: documentation, features, performance, test > > hi, > before posting this question i try lot of things , but i dont found solution. > i have 9 table and i join thems with two ways: > -1 first test with df.join(df2, df.id == df.id2,'left_outer') > -2 sqlcontext.sql("select * from t1 left join t2 on id_t1=id_t2") > after that i want partition by date the result of join : > -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of > at most two tables evry thiings ok. but when i join more than three tables i > dont have result after severals hours . > - in pyspark 1.3.0 i dont found in api one function let me partition by dat > columns > Q: some one can help me to resolve this probleme > thank you in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns
[ https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110596#comment-15110596 ] malouke commented on SPARK-12954: - ok sorry, > pyspark API 1.3.0 how we can patitionning by columns > --- > > Key: SPARK-12954 > URL: https://issues.apache.org/jira/browse/SPARK-12954 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: spark 1.3.0 > cloudera manger > linux platfrome > pyspark >Reporter: malouke >Priority: Blocker > Labels: documentation, features, performance, test > > hi, > before posting this question i try lot of things , but i dont found solution. > i have 9 table and i join thems with two ways: > -1 first test with df.join(df2, df.id == df.id2,'left_outer') > -2 sqlcontext.sql("select * from t1 left join t2 on id_t1=id_t2") > after that i want partition by date the result of join : > -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of > at most two tables evry thiings ok. but when i join more than three tables i > dont have result after severals hours . > - in pyspark 1.3.0 i dont found in api one function let me partition by dat > columns > Q: some one can help me to resolve this probleme > thank you in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns
malouke created SPARK-12954: --- Summary: pyspark API 1.3.0 how we can patitionning by columns Key: SPARK-12954 URL: https://issues.apache.org/jira/browse/SPARK-12954 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Environment: spark 1.3.0 cloudera manger linux platfrome pyspark Reporter: malouke Priority: Blocker hi, before posting this question i try lot of things , but i dont found solution. i have 9 table and i join thems with two ways: -1 first test with df.join(df2, df.id == df.id2,'left_outer') -2 sqlcontext.sql("select * from t1 left join t2 on id_t1=id_t2") after that i want partition by date the result of join : -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of at most two tables evry thiings ok. but when i join more than three tables i dont have result after severals hours . - in pyspark 1.3.0 i dont found in api one function let me partition by dat columns Q: some one can help me to resolve this probleme thank you in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12553) join is absloutly slow
[ https://issues.apache.org/jira/browse/SPARK-12553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073780#comment-15073780 ] malouke commented on SPARK-12553: - h sean, sorry i am new user forgot to put I beg your pardon. but I use parquet file if that's your question? > join is absloutly slow > --- > > Key: SPARK-12553 > URL: https://issues.apache.org/jira/browse/SPARK-12553 > Project: Spark > Issue Type: Bug > Environment: cloudera cdh 5 > centos 6 >Reporter: malouke > > Hello , > I have 7 tables to join with a left join, I did this: > start = time.time () > df_test = hc.sql ("select * from rapexp201412 left join CLIENT1412 is > rapNUMCNT CLINMCLI = \ > left join SRN1412 is SRNSIRET CLISIRET = \ > left join bodacc2014 is SRNSIREN bodSORCS = \ > left join sinagr2014 is rapNUMCNT sinagNUMCNT = \ > left join sinfix2014 is rapNUMCNT sinfiNUMCNT = \ > left join sinimag2014 is rapNUMCNT sinimNUMCNT = \ > left join up2014 is rapNUMCNT up2NUMCNT = \ > left join upagr2014 is rapNUMCNT upaNUMCNT = \ > left join aeveh is rapNUMCNT aevNUMCNT = \ > left join premiums are rapNUMCNT = priNUMCNT ") > time.time () - start > take : 2.289154052734375s > after I do: > df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\ > , partitionBy = "date_part") > df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\ > , partitionBy = "date_part") > here is the configuration of my pyspark: > sc._conf.set (u'spark.dynamicAllocation.enabled 'u'false') \ > .set (u'spark.eventLog.enabled 'u'true') \ > .set (u'spark.shuffle.service.enabled 'u'false') \ > .set (u'spark.yarn.historyServer.address 'u'http: > //prssncdhna02.bigplay.bigdata: 18088') \ > .set (u'spark.driver.port 'u'54330') \ > .set (u'spark.eventLog.dir 'u'hdfs: // bigplay-nameservice / user / spark / > applicationHistory') \ > .set (u'spark.blockManager.port 'u'54332') \ > .set (u'spark.yarn.jar 'u'local: > /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar > ') \ > .set (u'spark.dynamicAllocation.executorIdleTimeout 'u'60') \ > .set (u'spark.serializer 'u'org.apache.spark.serializer.KryoSerializer') \ > .set (u'spark.authenticate 'u'false') \ > .set (u'spark.serializer.objectStreamReset 'u'100') \ > .set (u'spark.submit.deployMode 'u'client') \ > .set (u'spark.executor.memory 'u'4g') \ > .set (u'spark.master 'u'yarn client') \ > .set (u'spark.driver.memory 'u'10g') \ > .set (u'spark.driver.extraLibraryPath 'u' / opt / Cloudera / parcels / > CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \ > .set (u'spark.dynamicAllocation.schedulerBacklogTimeout 'U'1') \ > .set (u'spark.executor.instances 'u'8') \ > .set (u'spark.shuffle.service.port 'u'7337') \ > .set (u'spark.fileserver.port 'u'54331') \ > .set (u'spark.app.name 'u'PySparkShell') \ > .set (u'spark.yarn.config.gatewayPath 'u' / opt / Cloudera / parcels') \ > .set (u'spark.rdd.compress 'u'True') \ > .set (u'spark.yarn.config.replacementPath 'u' {{}} /../../ HADOOP_COMMON_HOME > .. ') \ > .set (u'spark.yarn.isPython 'u'true') \ > .set (u'spark.dynamicAllocation.minExecutors 'u'0') \ > .set(u'spark.executor.extraLibraryPath',u'/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native')\ > .set (u'spark.ui.proxyBase 'u' / proxy / application_1450819756020_0615 ') \ > .set (u'spark.yarn.am.extraLibraryPath 'u' / opt / Cloudera / parcels / > CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \ > .set (u'hadoop.major.version 'u'yarn') \ > .set (u'spark.version 'u'1.5.2') > I do not understand why the join does not work > Thank you beforehand -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12553) join is absloutly slow
malouke created SPARK-12553: --- Summary: join is absloutly slow Key: SPARK-12553 URL: https://issues.apache.org/jira/browse/SPARK-12553 Project: Spark Issue Type: Bug Environment: cloudera cdh 5 centos 6 Reporter: malouke Hello , I have 7 tables to join with a left join, I did this: start = time.time () df_test = hc.sql ("select * from rapexp201412 left join CLIENT1412 is rapNUMCNT CLINMCLI = \ left join SRN1412 is SRNSIRET CLISIRET = \ left join bodacc2014 is SRNSIREN bodSORCS = \ left join sinagr2014 is rapNUMCNT sinagNUMCNT = \ left join sinfix2014 is rapNUMCNT sinfiNUMCNT = \ left join sinimag2014 is rapNUMCNT sinimNUMCNT = \ left join up2014 is rapNUMCNT up2NUMCNT = \ left join upagr2014 is rapNUMCNT upaNUMCNT = \ left join aeveh is rapNUMCNT aevNUMCNT = \ left join premiums are rapNUMCNT = priNUMCNT ") time.time () - start take : 2.289154052734375s after I do: df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\ , partitionBy = "date_part") df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\ , partitionBy = "date_part") here is the configuration of my pyspark: sc._conf.set (u'spark.dynamicAllocation.enabled 'u'false') \ .set (u'spark.eventLog.enabled 'u'true') \ .set (u'spark.shuffle.service.enabled 'u'false') \ .set (u'spark.yarn.historyServer.address 'u'http: //prssncdhna02.bigplay.bigdata: 18088') \ .set (u'spark.driver.port 'u'54330') \ .set (u'spark.eventLog.dir 'u'hdfs: // bigplay-nameservice / user / spark / applicationHistory') \ .set (u'spark.blockManager.port 'u'54332') \ .set (u'spark.yarn.jar 'u'local: /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar ') \ .set (u'spark.dynamicAllocation.executorIdleTimeout 'u'60') \ .set (u'spark.serializer 'u'org.apache.spark.serializer.KryoSerializer') \ .set (u'spark.authenticate 'u'false') \ .set (u'spark.serializer.objectStreamReset 'u'100') \ .set (u'spark.submit.deployMode 'u'client') \ .set (u'spark.executor.memory 'u'4g') \ .set (u'spark.master 'u'yarn client') \ .set (u'spark.driver.memory 'u'10g') \ .set (u'spark.driver.extraLibraryPath 'u' / opt / Cloudera / parcels / CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \ .set (u'spark.dynamicAllocation.schedulerBacklogTimeout 'U'1') \ .set (u'spark.executor.instances 'u'8') \ .set (u'spark.shuffle.service.port 'u'7337') \ .set (u'spark.fileserver.port 'u'54331') \ .set (u'spark.app.name 'u'PySparkShell') \ .set (u'spark.yarn.config.gatewayPath 'u' / opt / Cloudera / parcels') \ .set (u'spark.rdd.compress 'u'True') \ .set (u'spark.yarn.config.replacementPath 'u' {{}} /../../ HADOOP_COMMON_HOME .. ') \ .set (u'spark.yarn.isPython 'u'true') \ .set (u'spark.dynamicAllocation.minExecutors 'u'0') \ .set(u'spark.executor.extraLibraryPath',u'/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native')\ .set (u'spark.ui.proxyBase 'u' / proxy / application_1450819756020_0615 ') \ .set (u'spark.yarn.am.extraLibraryPath 'u' / opt / Cloudera / parcels / CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \ .set (u'hadoop.major.version 'u'yarn') \ .set (u'spark.version 'u'1.5.2') I do not understand why the join does not work Thank you beforehand -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org