[jira] [Reopened] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application
[ https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hosein reopened SPARK-20352: > PySpark SparkSession initialization take longer every iteration in a single > application > --- > > Key: SPARK-20352 > URL: https://issues.apache.org/jira/browse/SPARK-20352 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.1.0 > Environment: Ubuntu 12 > Spark 2.1 > JRE 8.0 > Python 2.7 >Reporter: hosein > > I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. > Run spark-sumbit my_code.py without any additional configuration parameters. > In a while loop I start SparkSession, analyze data and then stop the context > and this process repeats every 10 seconds. > {code} > while True: > spark = > SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' > , '5g').getOrCreate() > sc = spark.sparkContext > #some process and analyze > spark.stop() > {code} > When program starts, it works perfectly. > but when it works for many hours. spark initialization take long time. it > makes 10 or 20 seconds for just initializing spark. > So what is the problem ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application
[ https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970906#comment-15970906 ] hosein commented on SPARK-20352: I monitor execution time of every line in my code and this line: spark = SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' , '5g').getOrCreate() take too long (20 or more seconds) if my code runs for hours. > PySpark SparkSession initialization take longer every iteration in a single > application > --- > > Key: SPARK-20352 > URL: https://issues.apache.org/jira/browse/SPARK-20352 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.1.0 > Environment: Ubuntu 12 > Spark 2.1 > JRE 8.0 > Python 2.7 >Reporter: hosein > > I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. > Run spark-sumbit my_code.py without any additional configuration parameters. > In a while loop I start SparkSession, analyze data and then stop the context > and this process repeats every 10 seconds. > {code} > while True: > spark = > SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' > , '5g').getOrCreate() > sc = spark.sparkContext > #some process and analyze > spark.stop() > {code} > When program starts, it works perfectly. > but when it works for many hours. spark initialization take long time. it > makes 10 or 20 seconds for just initializing spark. > So what is the problem ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application
[ https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hosein updated SPARK-20352: --- Environment: Ubuntu 12 Spark 2.1 JRE 8.0 Python 2.7 was: linux ubunto 12 spark 2.1 JRE 8.0 > PySpark SparkSession initialization take longer every iteration in a single > application > --- > > Key: SPARK-20352 > URL: https://issues.apache.org/jira/browse/SPARK-20352 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.1.0 > Environment: Ubuntu 12 > Spark 2.1 > JRE 8.0 > Python 2.7 >Reporter: hosein > Fix For: 2.1.0 > > > I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. > Run spark-sumbit my_code.py without any additional configuration parameters. > In a while loop I start SparkSession, analyze data and then stop the context > and this process repeats every 10 seconds. > # > while True: > spark = > SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' > , '5g').getOrCreate() > sc = spark.sparkContext > #some process and analyze > spark.stop() > ### > When program starts, it works perfectly. > but when it works for many hours. spark initialization take long time. it > makes 10 or 20 seconds for just initializing spark. > So what is the problem ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application
[ https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hosein updated SPARK-20352: --- Description: I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. Run spark-sumbit my_code.py without any additional configuration parameters. In a while loop I start SparkSession, analyze data and then stop the context and this process repeats every 10 seconds. {code} while True: spark = SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' , '5g').getOrCreate() sc = spark.sparkContext #some process and analyze spark.stop() {code} When program starts, it works perfectly. but when it works for many hours. spark initialization take long time. it makes 10 or 20 seconds for just initializing spark. So what is the problem ? was: I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. Run spark-sumbit my_code.py without any additional configuration parameters. In a while loop I start SparkSession, analyze data and then stop the context and this process repeats every 10 seconds. # while True: spark = SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' , '5g').getOrCreate() sc = spark.sparkContext #some process and analyze spark.stop() ### When program starts, it works perfectly. but when it works for many hours. spark initialization take long time. it makes 10 or 20 seconds for just initializing spark. So what is the problem ? > PySpark SparkSession initialization take longer every iteration in a single > application > --- > > Key: SPARK-20352 > URL: https://issues.apache.org/jira/browse/SPARK-20352 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.1.0 > Environment: Ubuntu 12 > Spark 2.1 > JRE 8.0 > Python 2.7 >Reporter: hosein > Fix For: 2.1.0 > > > I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. > Run spark-sumbit my_code.py without any additional configuration parameters. > In a while loop I start SparkSession, analyze data and then stop the context > and this process repeats every 10 seconds. > {code} > while True: > spark = > SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' > , '5g').getOrCreate() > sc = spark.sparkContext > #some process and analyze > spark.stop() > {code} > When program starts, it works perfectly. > but when it works for many hours. spark initialization take long time. it > makes 10 or 20 seconds for just initializing spark. > So what is the problem ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application
[ https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hosein updated SPARK-20352: --- Environment: linux ubunto 12 spark 2.1 JRE 8.0 was: linux ubuntu 12 pyspark > PySpark SparkSession initialization take longer every iteration in a single > application > --- > > Key: SPARK-20352 > URL: https://issues.apache.org/jira/browse/SPARK-20352 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.1.0 > Environment: linux ubunto 12 > spark 2.1 > JRE 8.0 >Reporter: hosein > Fix For: 2.1.0 > > > I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. > Run spark-sumbit my_code.py without any additional configuration parameters. > In a while loop I start SparkSession, analyze data and then stop the context > and this process repeats every 10 seconds. > # > while True: > spark = > SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' > , '5g').getOrCreate() > sc = spark.sparkContext > #some process and analyze > spark.stop() > ### > When program starts, it works perfectly. > but when it works for many hours. spark initialization take long time. it > makes 10 or 20 seconds for just initializing spark. > So what is the problem ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application
[ https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hosein updated SPARK-20352: --- Environment: linux ubuntu 12 pyspark was: linux ubunto 12 pyspark > PySpark SparkSession initialization take longer every iteration in a single > application > --- > > Key: SPARK-20352 > URL: https://issues.apache.org/jira/browse/SPARK-20352 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.1.0 > Environment: linux ubuntu 12 > pyspark >Reporter: hosein > Fix For: 2.1.0 > > > I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. > Run spark-sumbit my_code.py without any additional configuration parameters. > In a while loop I start SparkSession, analyze data and then stop the context > and this process repeats every 10 seconds. > # > while True: > spark = > SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' > , '5g').getOrCreate() > sc = spark.sparkContext > #some process and analyze > spark.stop() > ### > When program starts, it works perfectly. > but when it works for many hours. spark initialization take long time. it > makes 10 or 20 seconds for just initializing spark. > So what is the problem ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application
hosein created SPARK-20352: -- Summary: PySpark SparkSession initialization take longer every iteration in a single application Key: SPARK-20352 URL: https://issues.apache.org/jira/browse/SPARK-20352 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.1.0 Environment: linux ubunto 12 pyspark Reporter: hosein Fix For: 2.1.0 I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. Run spark-sumbit my_code.py without any additional configuration parameters. In a while loop I start SparkSession, analyze data and then stop the context and this process repeats every 10 seconds. # while True: spark = SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' , '5g').getOrCreate() sc = spark.sparkContext #some process and analyze spark.stop() ### When program starts, it works perfectly. but when it works for many hours. spark initialization take long time. it makes 10 or 20 seconds for just initializing spark. So what is the problem ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873140#comment-15873140 ] hosein commented on SPARK-19655: I think I should not use spark for my case... > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873131#comment-15873131 ] hosein commented on SPARK-19655: if I want to count 100 million data, 100 million 1 returned over network for just count? > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873121#comment-15873121 ] hosein commented on SPARK-19655: I surprised too : ) if you have Vertica database you can test this part of code and monitor queries in Vertica, in my experience, select 1 appered > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873118#comment-15873118 ] hosein commented on SPARK-19655: how can I get count result from my Vertica table? is there any optimized solution for do that ? > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873110#comment-15873110 ] hosein edited comment on SPARK-19655 at 2/18/17 11:11 AM: -- I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I supposed if I take JDBC driver jar file to Spark and define JDBC url in my code, Spark works with this driver ... was (Author: hosein_ey): I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I supposed if I take Spark JDBC jar file and define JDBC url in it, Spark works with this driver ... > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873110#comment-15873110 ] hosein edited comment on SPARK-19655 at 2/18/17 11:09 AM: -- I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I supposed if I take Spark JDBC jar file and define JDBC url in it, Spark works with this driver ... was (Author: hosein_ey): I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I suppose if I take Spark JDBC jar file and define JDBC url in it, Spark works with this driver ... > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873110#comment-15873110 ] hosein edited comment on SPARK-19655 at 2/18/17 11:07 AM: -- I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I suppose if I take Spark JDBC jar file and define JDBC url in it, Spark works with this driver ... was (Author: hosein_ey): I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I suppose if I take spark JDBC jar file and define JDBC url in it, spark works with this driver ... > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873110#comment-15873110 ] hosein edited comment on SPARK-19655 at 2/18/17 11:06 AM: -- I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I suppose if I take spark JDBC jar file and define JDBC url in it, spark works with this driver ... was (Author: hosein_ey): I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I suppose if I take spark JDBC jar file and define JDBC url in spark. spark works with this driver ... > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873110#comment-15873110 ] hosein edited comment on SPARK-19655 at 2/18/17 11:06 AM: -- I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ I suppose if I take spark JDBC jar file and define JDBC url in spark. spark works with this driver ... was (Author: hosein_ey): I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873110#comment-15873110 ] hosein commented on SPARK-19655: I connect to Vertica by JDBC and downloaded it's driver from this link: https://my.vertica.com/download/vertica/client-drivers/ > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873093#comment-15873093 ] hosein edited comment on SPARK-19655 at 2/18/17 10:37 AM: -- I have a Vertica database with 100 million rows and I run this code in spark: df = spark.read.format("jdbc").option("url" , vertica_jdbc_url).option("dbtable", 'test_table') .option("user", "spark_user").option("password" , "password").load() result = df.filter(df['id'] > 100).count() print result I monitor queries in Vertica and spark code generates this query in Vertica: SELECT 1 FROM test_table WHERE ("id" > 100) this query returns about 100 million "1" and I think this is not suitable was (Author: hosein_ey): I have a Vertica database with 100 million rows and I run this code in spark: df = spark.read.format("jdbc").option("url" , vertica_jdbc_url).option("dbtable", 'test_table') .option("user", "spark_user").option("password" , "password").load() result = df.filter(df['id'] > 100).count() print result I monitor queries in Vertica and spark code generates this query in Vertica: SELECT 1 FROM test_table WHERE ("int_id" > 100) this query returns about 100 million "1" and I think this is not suitable > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873093#comment-15873093 ] hosein edited comment on SPARK-19655 at 2/18/17 10:36 AM: -- I have a Vertica database with 100 million rows and I run this code in spark: df = spark.read.format("jdbc").option("url" , vertica_jdbc_url).option("dbtable", 'test_table') .option("user", "spark_user").option("password" , "password").load() result = df.filter(df['id'] > 100).count() print result I monitor queries in Vertica and spark code generates this query in Vertica: SELECT 1 FROM test_table WHERE ("int_id" > 100) this query returns about 100 million "1" and I think this is not suitable was (Author: hosein_ey): I have a Vertica database with 100 million rows and I run this code in spark: df = spark.read.format("jdbc").option("url" , vertica_jdbc_url).option("dbtable", 'test_table') .option("user", "spark_user").option("password" , "password").load() result = df.filter(df['id'] > 100).count() print result I monitor queries in Vertica and spark code generates this query in Vertica: SELECT 1 FROM test_table WHERE ("int_id" > 100) this query returns about 100 million "1" and I think this is not suitable > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873093#comment-15873093 ] hosein edited comment on SPARK-19655 at 2/18/17 10:36 AM: -- I have a Vertica database with 100 million rows and I run this code in spark: df = spark.read.format("jdbc").option("url" , vertica_jdbc_url).option("dbtable", 'test_table') .option("user", "spark_user").option("password" , "password").load() result = df.filter(df['id'] > 100).count() print result I monitor queries in Vertica and spark code generates this query in Vertica: SELECT 1 FROM test_table WHERE ("int_id" > 100) this query returns about 100 million "1" and I think this is not suitable was (Author: hosein_ey): I have a Vertica database with 100 million rows and I run this code in spark: df = spark.read.format("jdbc").option("url" , vertica_jdbc_url).option("dbtable", 'test_table') .option("user", "spark_user").option("password" , "password").load() result = df.filter(df['id'] > 100).count() print result I monitor queries in Vertica and spark code generates this query in Vertica: SELECT 1 FROM test_table WHERE ("int_id" > 100) this query returns about 100 million "1" and I think this is not suitable > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873093#comment-15873093 ] hosein commented on SPARK-19655: I have a Vertica database with 100 million rows and I run this code in spark: df = spark.read.format("jdbc").option("url" , vertica_jdbc_url).option("dbtable", 'test_table') .option("user", "spark_user").option("password" , "password").load() result = df.filter(df['id'] > 100).count() print result I monitor queries in Vertica and spark code generates this query in Vertica: SELECT 1 FROM test_table WHERE ("int_id" > 100) this query returns about 100 million "1" and I think this is not suitable > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hosein updated SPARK-19655: --- Summary: select count(*) , requests 1 for each row (was: select count(*) , requests 1 foreach row) > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count(*) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19655) select count(*) , requests 1 for each row
[ https://issues.apache.org/jira/browse/SPARK-19655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hosein updated SPARK-19655: --- Description: when I want query select count( * ) by JDBC and monitor queries in database side, I see spark requests: select 1 for destination table it means 1 for each row and it is not optimized was: when I want query select count(*) by JDBC and monitor queries in database side, I see spark requests: select 1 for destination table it means 1 for each row and it is not optimized > select count(*) , requests 1 for each row > - > > Key: SPARK-19655 > URL: https://issues.apache.org/jira/browse/SPARK-19655 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: hosein >Priority: Minor > > when I want query select count( * ) by JDBC and monitor queries in database > side, I see spark requests: select 1 for destination table > it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19655) select count(*) , requests 1 foreach row
hosein created SPARK-19655: -- Summary: select count(*) , requests 1 foreach row Key: SPARK-19655 URL: https://issues.apache.org/jira/browse/SPARK-19655 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: hosein Priority: Minor when I want query select count(*) by JDBC and monitor queries in database side, I see spark requests: select 1 for destination table it means 1 for each row and it is not optimized -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org