[jira] [Commented] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns

2016-01-21 Thread malouke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110603#comment-15110603
 ] 

malouke commented on SPARK-12954:
-

hi sean,
where i can ask question ?

> pyspark API 1.3.0  how we can patitionning by columns  
> ---
>
> Key: SPARK-12954
> URL: https://issues.apache.org/jira/browse/SPARK-12954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: spark 1.3.0
> cloudera manger 
> linux platfrome
> pyspark  
>Reporter: malouke
>Priority: Blocker
>  Labels: documentation, features, performance, test
>
> hi,
> before posting this question i try lot of things , but i dont found solution.
> i have 9 table and i join thems with two ways:
>  -1 first test with df.join(df2, df.id == df.id2,'left_outer')
> -2 sqlcontext.sql("select * from t1 left join t2 on  id_t1=id_t2")
> after that i want  partition by date the result of join :
> -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of 
> at most two tables evry thiings ok. but when i  join more than three tables i 
> dont have result after severals hours .
> - in pyspark 1.3.0 i dont found in api one function let me  partition by dat 
> columns 
> Q: some one can help me to resolve this probleme  
> thank you in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns

2016-01-21 Thread malouke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110596#comment-15110596
 ] 

malouke commented on SPARK-12954:
-

ok sorry,

> pyspark API 1.3.0  how we can patitionning by columns  
> ---
>
> Key: SPARK-12954
> URL: https://issues.apache.org/jira/browse/SPARK-12954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: spark 1.3.0
> cloudera manger 
> linux platfrome
> pyspark  
>Reporter: malouke
>Priority: Blocker
>  Labels: documentation, features, performance, test
>
> hi,
> before posting this question i try lot of things , but i dont found solution.
> i have 9 table and i join thems with two ways:
>  -1 first test with df.join(df2, df.id == df.id2,'left_outer')
> -2 sqlcontext.sql("select * from t1 left join t2 on  id_t1=id_t2")
> after that i want  partition by date the result of join :
> -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of 
> at most two tables evry thiings ok. but when i  join more than three tables i 
> dont have result after severals hours .
> - in pyspark 1.3.0 i dont found in api one function let me  partition by dat 
> columns 
> Q: some one can help me to resolve this probleme  
> thank you in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns

2016-01-21 Thread malouke (JIRA)
malouke created SPARK-12954:
---

 Summary: pyspark API 1.3.0  how we can patitionning by columns  
 Key: SPARK-12954
 URL: https://issues.apache.org/jira/browse/SPARK-12954
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
 Environment: spark 1.3.0
cloudera manger 
linux platfrome
pyspark  
Reporter: malouke
Priority: Blocker


hi,
before posting this question i try lot of things , but i dont found solution.

i have 9 table and i join thems with two ways:
 -1 first test with df.join(df2, df.id == df.id2,'left_outer')
-2 sqlcontext.sql("select * from t1 left join t2 on  id_t1=id_t2")

after that i want  partition by date the result of join :
-in pyspark 1.5.2 i try partitionBy if table it's not comming from result of at 
most two tables evry thiings ok. but when i  join more than three tables i dont 
have result after severals hours .
- in pyspark 1.3.0 i dont found in api one function let me  partition by dat 
columns 


Q: some one can help me to resolve this probleme  
thank you in advance 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12553) join is absloutly slow

2015-12-29 Thread malouke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073780#comment-15073780
 ] 

malouke commented on SPARK-12553:
-

h sean,
sorry i am new user forgot to put I beg your pardon.
but I use parquet file
if that's your question?


> join is absloutly slow 
> ---
>
> Key: SPARK-12553
> URL: https://issues.apache.org/jira/browse/SPARK-12553
> Project: Spark
>  Issue Type: Bug
> Environment: cloudera cdh 5 
> centos 6 
>Reporter: malouke
>
> Hello ,
> I have 7 tables to join with a left join, I did this:
> start = time.time ()
> df_test = hc.sql ("select * from rapexp201412 left join CLIENT1412 is 
> rapNUMCNT CLINMCLI = \
>    left join SRN1412 is SRNSIRET CLISIRET = \
>    left join bodacc2014 is SRNSIREN bodSORCS = \
>    left join sinagr2014 is rapNUMCNT sinagNUMCNT = \
>    left join sinfix2014 is rapNUMCNT sinfiNUMCNT = \
>    left join sinimag2014 is rapNUMCNT sinimNUMCNT = \
>    left join up2014 is rapNUMCNT up2NUMCNT = \
>    left join upagr2014 is rapNUMCNT upaNUMCNT = \
>    left join aeveh is rapNUMCNT aevNUMCNT = \
>    left join premiums are rapNUMCNT = priNUMCNT ")
> time.time () - start
> take : 2.289154052734375s
> after I do:
> df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\
>   , partitionBy = "date_part")
> df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\
>   , partitionBy = "date_part")
> here is the configuration of my pyspark:
> sc._conf.set (u'spark.dynamicAllocation.enabled 'u'false') \
>  .set (u'spark.eventLog.enabled 'u'true') \
>  .set (u'spark.shuffle.service.enabled 'u'false') \
>  .set (u'spark.yarn.historyServer.address 'u'http: 
> //prssncdhna02.bigplay.bigdata: 18088') \
> .set (u'spark.driver.port 'u'54330') \
> .set (u'spark.eventLog.dir 'u'hdfs: // bigplay-nameservice / user / spark / 
> applicationHistory') \
> .set (u'spark.blockManager.port 'u'54332') \
>  .set (u'spark.yarn.jar 'u'local: 
> /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar
>  ') \
>  .set (u'spark.dynamicAllocation.executorIdleTimeout 'u'60') \
>  .set (u'spark.serializer 'u'org.apache.spark.serializer.KryoSerializer') \
> .set (u'spark.authenticate 'u'false') \
>  .set (u'spark.serializer.objectStreamReset 'u'100') \
>  .set (u'spark.submit.deployMode 'u'client') \
>  .set (u'spark.executor.memory 'u'4g') \
>  .set (u'spark.master 'u'yarn client') \
>  .set (u'spark.driver.memory 'u'10g') \
>  .set (u'spark.driver.extraLibraryPath 'u' / opt / Cloudera / parcels / 
> CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \
>  .set (u'spark.dynamicAllocation.schedulerBacklogTimeout 'U'1') \
> .set (u'spark.executor.instances 'u'8') \
>  .set (u'spark.shuffle.service.port 'u'7337') \
> .set (u'spark.fileserver.port 'u'54331') \
>  .set (u'spark.app.name 'u'PySparkShell') \
> .set (u'spark.yarn.config.gatewayPath 'u' / opt / Cloudera / parcels') \
> .set (u'spark.rdd.compress 'u'True') \
> .set (u'spark.yarn.config.replacementPath 'u' {{}} /../../ HADOOP_COMMON_HOME 
> .. ') \
> .set (u'spark.yarn.isPython 'u'true') \
> .set (u'spark.dynamicAllocation.minExecutors 'u'0') \
> .set(u'spark.executor.extraLibraryPath',u'/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native')\
> .set (u'spark.ui.proxyBase 'u' / proxy / application_1450819756020_0615 ') \
> .set (u'spark.yarn.am.extraLibraryPath 'u' / opt / Cloudera / parcels / 
> CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \
> .set (u'hadoop.major.version 'u'yarn') \
> .set (u'spark.version 'u'1.5.2')
> I do not understand why the join does not work
> Thank you beforehand



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12553) join is absloutly slow

2015-12-29 Thread malouke (JIRA)
malouke created SPARK-12553:
---

 Summary: join is absloutly slow 
 Key: SPARK-12553
 URL: https://issues.apache.org/jira/browse/SPARK-12553
 Project: Spark
  Issue Type: Bug
 Environment: cloudera cdh 5 
centos 6 
Reporter: malouke


Hello ,
I have 7 tables to join with a left join, I did this:

start = time.time ()
df_test = hc.sql ("select * from rapexp201412 left join CLIENT1412 is rapNUMCNT 
CLINMCLI = \
   left join SRN1412 is SRNSIRET CLISIRET = \
   left join bodacc2014 is SRNSIREN bodSORCS = \
   left join sinagr2014 is rapNUMCNT sinagNUMCNT = \
   left join sinfix2014 is rapNUMCNT sinfiNUMCNT = \
   left join sinimag2014 is rapNUMCNT sinimNUMCNT = \
   left join up2014 is rapNUMCNT up2NUMCNT = \
   left join upagr2014 is rapNUMCNT upaNUMCNT = \
   left join aeveh is rapNUMCNT aevNUMCNT = \
   left join premiums are rapNUMCNT = priNUMCNT ")

time.time () - start
take : 2.289154052734375s


after I do:


df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\
  , partitionBy = "date_part")




df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\
  , partitionBy = "date_part")
here is the configuration of my pyspark:
sc._conf.set (u'spark.dynamicAllocation.enabled 'u'false') \
 .set (u'spark.eventLog.enabled 'u'true') \
 .set (u'spark.shuffle.service.enabled 'u'false') \
 .set (u'spark.yarn.historyServer.address 'u'http: 
//prssncdhna02.bigplay.bigdata: 18088') \
.set (u'spark.driver.port 'u'54330') \
.set (u'spark.eventLog.dir 'u'hdfs: // bigplay-nameservice / user / spark / 
applicationHistory') \
.set (u'spark.blockManager.port 'u'54332') \
 .set (u'spark.yarn.jar 'u'local: 
/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar
 ') \
 .set (u'spark.dynamicAllocation.executorIdleTimeout 'u'60') \
 .set (u'spark.serializer 'u'org.apache.spark.serializer.KryoSerializer') \
.set (u'spark.authenticate 'u'false') \
 .set (u'spark.serializer.objectStreamReset 'u'100') \
 .set (u'spark.submit.deployMode 'u'client') \
 .set (u'spark.executor.memory 'u'4g') \
 .set (u'spark.master 'u'yarn client') \
 .set (u'spark.driver.memory 'u'10g') \
 .set (u'spark.driver.extraLibraryPath 'u' / opt / Cloudera / parcels / 
CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \
 .set (u'spark.dynamicAllocation.schedulerBacklogTimeout 'U'1') \
.set (u'spark.executor.instances 'u'8') \
 .set (u'spark.shuffle.service.port 'u'7337') \
.set (u'spark.fileserver.port 'u'54331') \
 .set (u'spark.app.name 'u'PySparkShell') \
.set (u'spark.yarn.config.gatewayPath 'u' / opt / Cloudera / parcels') \
.set (u'spark.rdd.compress 'u'True') \
.set (u'spark.yarn.config.replacementPath 'u' {{}} /../../ HADOOP_COMMON_HOME 
.. ') \
.set (u'spark.yarn.isPython 'u'true') \
.set (u'spark.dynamicAllocation.minExecutors 'u'0') \
.set(u'spark.executor.extraLibraryPath',u'/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native')\
.set (u'spark.ui.proxyBase 'u' / proxy / application_1450819756020_0615 ') \
.set (u'spark.yarn.am.extraLibraryPath 'u' / opt / Cloudera / parcels / 
CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \
.set (u'hadoop.major.version 'u'yarn') \
.set (u'spark.version 'u'1.5.2')

I do not understand why the join does not work
Thank you beforehand




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org