[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400993#comment-15400993 ] Reynold Xin commented on SPARK-6305: [~srowen] looked into this in the past and he didn't get everything working. Sean -can you share more? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16812) Open up SparkILoop.getAddedJars
[ https://issues.apache.org/jira/browse/SPARK-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16812. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Open up SparkILoop.getAddedJars > --- > > Key: SPARK-16812 > URL: https://issues.apache.org/jira/browse/SPARK-16812 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.1, 2.1.0 > > > SparkILoop.getAddedJars is a useful method to use so we can programmatically > get the list of jars added. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
[ https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400981#comment-15400981 ] Apache Spark commented on SPARK-16818: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/14427 > Exchange reuse incorrectly reuses scans over different sets of partitions > - > > Key: SPARK-16818 > URL: https://issues.apache.org/jira/browse/SPARK-16818 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Critical > Fix For: 2.1.0 > > > This happens because the file scan operator does not take into account > partition pruning in its implementation of `sameResult()`. As a result, > executions may be incorrect on self-joins over the same base file relation. > Here's a minimal test case to reproduce: > {code} > spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in > 2.0 > withTempPath { path => > val tempDir = path.getCanonicalPath > spark.range(10) > .selectExpr("id % 2 as a", "id % 3 as b", "id as c") > .write > .partitionBy("a") > .parquet(tempDir) > val df = spark.read.parquet(tempDir) > val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") > val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") > checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, > 10, 5) :: Nil) > {code} > When exchange reuse is on, the result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6| 6| > | 1| 4| 4| > | 2|10|10| > +---+--+--+ > {code} > The correct result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6|12| > | 1| 4| 8| > | 2|10| 5| > +---+--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
[ https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400966#comment-15400966 ] Reynold Xin commented on SPARK-16818: - I've merged this in master, but this still needs to be merged into branch-2.0. > Exchange reuse incorrectly reuses scans over different sets of partitions > - > > Key: SPARK-16818 > URL: https://issues.apache.org/jira/browse/SPARK-16818 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Critical > Fix For: 2.1.0 > > > This happens because the file scan operator does not take into account > partition pruning in its implementation of `sameResult()`. As a result, > executions may be incorrect on self-joins over the same base file relation. > Here's a minimal test case to reproduce: > {code} > spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in > 2.0 > withTempPath { path => > val tempDir = path.getCanonicalPath > spark.range(10) > .selectExpr("id % 2 as a", "id % 3 as b", "id as c") > .write > .partitionBy("a") > .parquet(tempDir) > val df = spark.read.parquet(tempDir) > val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") > val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") > checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, > 10, 5) :: Nil) > {code} > When exchange reuse is on, the result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6| 6| > | 1| 4| 4| > | 2|10|10| > +---+--+--+ > {code} > The correct result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6|12| > | 1| 4| 8| > | 2|10| 5| > +---+--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
[ https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16818: Fix Version/s: (was: 2.0.1) > Exchange reuse incorrectly reuses scans over different sets of partitions > - > > Key: SPARK-16818 > URL: https://issues.apache.org/jira/browse/SPARK-16818 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Critical > Fix For: 2.1.0 > > > This happens because the file scan operator does not take into account > partition pruning in its implementation of `sameResult()`. As a result, > executions may be incorrect on self-joins over the same base file relation. > Here's a minimal test case to reproduce: > {code} > spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in > 2.0 > withTempPath { path => > val tempDir = path.getCanonicalPath > spark.range(10) > .selectExpr("id % 2 as a", "id % 3 as b", "id as c") > .write > .partitionBy("a") > .parquet(tempDir) > val df = spark.read.parquet(tempDir) > val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") > val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") > checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, > 10, 5) :: Nil) > {code} > When exchange reuse is on, the result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6| 6| > | 1| 4| 4| > | 2|10|10| > +---+--+--+ > {code} > The correct result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6|12| > | 1| 4| 8| > | 2|10| 5| > +---+--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
[ https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16818. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.1.0 2.0.1 > Exchange reuse incorrectly reuses scans over different sets of partitions > - > > Key: SPARK-16818 > URL: https://issues.apache.org/jira/browse/SPARK-16818 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > This happens because the file scan operator does not take into account > partition pruning in its implementation of `sameResult()`. As a result, > executions may be incorrect on self-joins over the same base file relation. > Here's a minimal test case to reproduce: > {code} > spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in > 2.0 > withTempPath { path => > val tempDir = path.getCanonicalPath > spark.range(10) > .selectExpr("id % 2 as a", "id % 3 as b", "id as c") > .write > .partitionBy("a") > .parquet(tempDir) > val df = spark.read.parquet(tempDir) > val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") > val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") > checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, > 10, 5) :: Nil) > {code} > When exchange reuse is on, the result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6| 6| > | 1| 4| 4| > | 2|10|10| > +---+--+--+ > {code} > The correct result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6|12| > | 1| 4| 8| > | 2|10| 5| > +---+--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16275) Implement all the Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400942#comment-15400942 ] Xiao Li commented on SPARK-16275: - It sounds like both of you are fine to remove the Hive's hash UDF. Will submit a PR to resolve it. > Implement all the Hive fallback functions > - > > Key: SPARK-16275 > URL: https://issues.apache.org/jira/browse/SPARK-16275 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.0, Spark falls back to Hive for only the following built-in > functions: > {code} > "elt", "hash", "java_method", "histogram_numeric", > "map_keys", "map_values", > "parse_url", "percentile", "percentile_approx", "reflect", "sentences", > "stack", "str_to_map", > "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", > "xpath_long", > "xpath_number", "xpath_short", "xpath_string", > // table generating function > "inline", "posexplode" > {code} > The goal of the ticket is to implement all of these in Spark so we don't need > to fall back into Hive's UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16275) Implement all the Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400924#comment-15400924 ] Wenchen Fan commented on SPARK-16275: - It's weird if we have 2 hash implementations, one for hive compatibility, one for internal usage(shuffle, bucket, etc.) I'd like to update those values with our own hash function. > Implement all the Hive fallback functions > - > > Key: SPARK-16275 > URL: https://issues.apache.org/jira/browse/SPARK-16275 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.0, Spark falls back to Hive for only the following built-in > functions: > {code} > "elt", "hash", "java_method", "histogram_numeric", > "map_keys", "map_values", > "parse_url", "percentile", "percentile_approx", "reflect", "sentences", > "stack", "str_to_map", > "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", > "xpath_long", > "xpath_number", "xpath_short", "xpath_string", > // table generating function > "inline", "posexplode" > {code} > The goal of the ticket is to implement all of these in Spark so we don't need > to fall back into Hive's UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16819) Exception in thread “main” org.apache.spark.SparkException: Application application finished with failed status
[ https://issues.apache.org/jira/browse/SPARK-16819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asmaa Ali updated SPARK-16819: --- Description: What is the reason of this exception ?! cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit --class SparkBWA --master yarn-cluster --deploy-mode cluster --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip --verbose ./SparkBWA.jar -algorithm mem -reads paired -index /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastqhb Output_ERR000589 Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar Adding default property: spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog Adding default property: spark.eventLog.enabled=true Adding default property: spark.driver.maxResultSize=1920m Adding default property: spark.shuffle.service.enabled=true Adding default property: spark.yarn.historyServer.address=cluster-cancerdetector-m:18080 Adding default property: spark.sql.parquet.cacheMetadata=false Adding default property: spark.driver.memory=3840m Adding default property: spark.dynamicAllocation.maxExecutors=1 Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0 Adding default property: spark.yarn.am.memoryOverhead=558 Adding default property: spark.yarn.am.memory=5586m Adding default property: spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar Adding default property: spark.master=yarn-client Adding default property: spark.executor.memory=5586m Adding default property: spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog Adding default property: spark.dynamicAllocation.enabled=true Adding default property: spark.executor.cores=2 Adding default property: spark.yarn.executor.memoryOverhead=558 Adding default property: spark.dynamicAllocation.minExecutors=1 Adding default property: spark.dynamicAllocation.initialExecutors=1 Adding default property: spark.akka.frameSize=512 Parsed arguments: master yarn-cluster deployMode cluster executorMemory 1500m executorCores 1 totalExecutorCores null propertiesFile /usr/lib/spark/conf/spark-defaults.conf driverMemory1500m driverCores null driverExtraClassPathnull driverExtraLibraryPath null driverExtraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar supervise false queue null numExecutorsnull files null pyFiles null archivesfile:/home/cancerdetector/SparkBWA/build/./bwa.zip mainClass SparkBWA primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar nameSparkBWA childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastqhb Output_ERR000589] jarsnull packagesnull packagesExclusions null repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file /usr/lib/spark/conf/spark-defaults.conf: spark.yarn.am.memoryOverhead -> 558 spark.driver.memory -> 1500m spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar spark.executor.memory -> 5586m spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080 spark.eventLog.enabled -> true spark.scheduler.minRegisteredResourcesRatio -> 0.0 spark.dynamicAllocation.maxExecutors -> 1 spark.akka.frameSize -> 512 spark.executor.extraJavaOptions -> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar spark.sql.parquet.cacheMetadata -> false spark.shuffle.service.enabled -> true spark.history.fs.logDirectory -> hdfs://cluster-cancerdetector-m/user/spark/eventlog spark.dynamicAllocation.initialExecutors -> 1 spark.dynamicAllocation.minExecutors -> 1 spark.yarn.executor.memoryOverhead -> 558 spark.driver.extraJavaOptions -> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar spark.eventLog.dir -> hdfs://cluster-cancerdetector-m/user/spark/eventlog spark.yarn.am.memory -> 5586m spark.driver.maxResultSize -> 1920m spark.master -> yarn-client spark.dynamicAllocation.enabled -> true spark.executor.cores -> 2 Main class: org.apache.spark.deploy.yarn.Client Arguments: --name SparkBWA --driver-memory 1500m --executor-memory 1500m --executor-cores 1 --archives file:/home/can
[jira] [Updated] (SPARK-16819) Exception in thread “main” org.apache.spark.SparkException: Application application finished with failed status
[ https://issues.apache.org/jira/browse/SPARK-16819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asmaa Ali updated SPARK-16819: --- Description: What is the reason of this exception ?! cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit --class SparkBWA --master yarn-cluster --deploy-mode cluster --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip --verbose ./SparkBWA.jar -algorithm mem -reads paired -index /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastqhb Output_ERR000589 Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar Adding default property: spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog Adding default property: spark.eventLog.enabled=true Adding default property: spark.driver.maxResultSize=1920m Adding default property: spark.shuffle.service.enabled=true Adding default property: spark.yarn.historyServer.address=cluster-cancerdetector-m:18080 Adding default property: spark.sql.parquet.cacheMetadata=false Adding default property: spark.driver.memory=3840m Adding default property: spark.dynamicAllocation.maxExecutors=1 Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0 Adding default property: spark.yarn.am.memoryOverhead=558 Adding default property: spark.yarn.am.memory=5586m Adding default property: spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar Adding default property: spark.master=yarn-client Adding default property: spark.executor.memory=5586m Adding default property: spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog Adding default property: spark.dynamicAllocation.enabled=true Adding default property: spark.executor.cores=2 Adding default property: spark.yarn.executor.memoryOverhead=558 Adding default property: spark.dynamicAllocation.minExecutors=1 Adding default property: spark.dynamicAllocation.initialExecutors=1 Adding default property: spark.akka.frameSize=512 Parsed arguments: master yarn-cluster deployMode cluster executorMemory 1500m executorCores 1 totalExecutorCores null propertiesFile /usr/lib/spark/conf/spark-defaults.conf driverMemory1500m driverCores null driverExtraClassPathnull driverExtraLibraryPath null driverExtraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar supervise false queue null numExecutorsnull files null pyFiles null archivesfile:/home/cancerdetector/SparkBWA/build/./bwa.zip mainClass SparkBWA primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar nameSparkBWA childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastqhb Output_ERR000589] jarsnull packagesnull packagesExclusions null repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file /usr/lib/spark/conf/spark-defaults.conf: spark.yarn.am.memoryOverhead -> 558 spark.driver.memory -> 1500m spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar spark.executor.memory -> 5586m spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080 spark.eventLog.enabled -> true spark.scheduler.minRegisteredResourcesRatio -> 0.0 spark.dynamicAllocation.maxExecutors -> 1 spark.akka.frameSize -> 512 spark.executor.extraJavaOptions -> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar spark.sql.parquet.cacheMetadata -> false spark.shuffle.service.enabled -> true spark.history.fs.logDirectory -> hdfs://cluster-cancerdetector-m/user/spark/eventlog spark.dynamicAllocation.initialExecutors -> 1 spark.dynamicAllocation.minExecutors -> 1 spark.yarn.executor.memoryOverhead -> 558 spark.driver.extraJavaOptions -> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar spark.eventLog.dir -> hdfs://cluster-cancerdetector-m/user/spark/eventlog spark.yarn.am.memory -> 5586m spark.driver.maxResultSize -> 1920m spark.master -> yarn-client spark.dynamicAllocation.enabled -> true spark.executor.cores -> 2 Main class: org.apache.spark.deploy.yarn.Client Arguments: --name SparkBWA --driver-memory 1500m --executor-memory 1500m --executor-cores 1 --archives file:/home/can
[jira] [Created] (SPARK-16819) Exception in thread “main” org.apache.spark.SparkException: Application application finished with failed status
Asmaa Ali created SPARK-16819: -- Summary: Exception in thread “main” org.apache.spark.SparkException: Application application finished with failed status Key: SPARK-16819 URL: https://issues.apache.org/jira/browse/SPARK-16819 Project: Spark Issue Type: Question Components: Streaming, YARN Reporter: Asmaa Ali What is the reason of this exception ?! cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit --class SparkBWA --master yarn-cluster -- deploy-mode cluster --conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip --verbose ./SparkBWA.jar -algorithm mem -reads paired -index /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastqhb Output_ERR000589-> added --deploy-mode cluster Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar Adding default property: spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog Adding default property: spark.eventLog.enabled=true Adding default property: spark.driver.maxResultSize=1920m Adding default property: spark.shuffle.service.enabled=true Adding default property: spark.yarn.historyServer.address=cluster-cancerdetector-m:18080 Adding default property: spark.sql.parquet.cacheMetadata=false Adding default property: spark.driver.memory=3840m Adding default property: spark.dynamicAllocation.maxExecutors=1 Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0 Adding default property: spark.yarn.am.memoryOverhead=558 Adding default property: spark.yarn.am.memory=5586m Adding default property: spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar Adding default property: spark.master=yarn-client Adding default property: spark.executor.memory=5586m Adding default property: spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog Adding default property: spark.dynamicAllocation.enabled=true Adding default property: spark.executor.cores=2 Adding default property: spark.yarn.executor.memoryOverhead=558 Adding default property: spark.dynamicAllocation.minExecutors=1 Adding default property: spark.dynamicAllocation.initialExecutors=1 Adding default property: spark.akka.frameSize=512 Parsed arguments: master yarn-cluster deployMode cluster executorMemory 1500m executorCores 1 totalExecutorCores null propertiesFile /usr/lib/spark/conf/spark-defaults.conf driverMemory1500m driverCores null driverExtraClassPathnull driverExtraLibraryPath null driverExtraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar supervise false queue null numExecutorsnull files null pyFiles null archivesfile:/home/cancerdetector/SparkBWA/build/./bwa.zip mainClass SparkBWA primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar nameSparkBWA childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastqhb Output_ERR000589- --deploy-mode cluster] jarsnull packagesnull packagesExclusions null repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file /usr/lib/spark/conf/spark-defaults.conf: spark.yarn.am.memoryOverhead -> 558 spark.driver.memory -> 1500m spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar spark.executor.memory -> 5586m spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080 spark.eventLog.enabled -> true spark.scheduler.minRegisteredResourcesRatio -> 0.0 spark.dynamicAllocation.maxExecutors -> 1 spark.akka.frameSize -> 512 spark.executor.extraJavaOptions -> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar spark.sql.parquet.cacheMetadata -> false spark.shuffle.service.enabled -> true spark.history.fs.logDirectory -> hdfs://cluster-cancerdetector-m/user/spark/eventlog spark.dynamicAllocation.initialExecutors -> 1 spark.dynamicAllocation.minExecutors -> 1 spark.yarn.executor.memoryOverhead -> 558 spark.driver.extraJavaOptions -> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar spark.eventLog.dir -> hdfs://cluster-cancerdetector-m/user/spark/eventlog spark.yarn.am.memory -> 5586m spark.driver.maxResul
[jira] [Commented] (SPARK-16475) Broadcast Hint for SQL Queries
[ https://issues.apache.org/jira/browse/SPARK-16475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400881#comment-15400881 ] Apache Spark commented on SPARK-16475: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/14426 > Broadcast Hint for SQL Queries > -- > > Key: SPARK-16475 > URL: https://issues.apache.org/jira/browse/SPARK-16475 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin > Attachments: BroadcastHintinSparkSQL.pdf > > > Broadcast hint is a way for users to manually annotate a query and suggest to > the query optimizer the join method. It is very useful when the query > optimizer cannot make optimal decision with respect to join methods due to > conservativeness or the lack of proper statistics. > The DataFrame API has broadcast hint since Spark 1.5. However, we do not have > an equivalent functionality in SQL queries. We propose adding Hive-style > broadcast hint to Spark SQL. > For more information, please see the attached document. One note about the > doc: in addition to supporting "MAPJOIN", we should also support > "BROADCASTJOIN" and "BROADCAST" in the comment, e.g. the following should be > accepted: > {code} > SELECT /*+ MAPJOIN(b) */ ... > SELECT /*+ BROADCASTJOIN(b) */ ... > SELECT /*+ BROADCAST(b) */ ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
[ https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16818: Assignee: Apache Spark > Exchange reuse incorrectly reuses scans over different sets of partitions > - > > Key: SPARK-16818 > URL: https://issues.apache.org/jira/browse/SPARK-16818 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Critical > > This happens because the file scan operator does not take into account > partition pruning in its implementation of `sameResult()`. As a result, > executions may be incorrect on self-joins over the same base file relation. > Here's a minimal test case to reproduce: > {code} > spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in > 2.0 > withTempPath { path => > val tempDir = path.getCanonicalPath > spark.range(10) > .selectExpr("id % 2 as a", "id % 3 as b", "id as c") > .write > .partitionBy("a") > .parquet(tempDir) > val df = spark.read.parquet(tempDir) > val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") > val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") > checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, > 10, 5) :: Nil) > {code} > When exchange reuse is on, the result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6| 6| > | 1| 4| 4| > | 2|10|10| > +---+--+--+ > {code} > The correct result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6|12| > | 1| 4| 8| > | 2|10| 5| > +---+--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
[ https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400853#comment-15400853 ] Apache Spark commented on SPARK-16818: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/14425 > Exchange reuse incorrectly reuses scans over different sets of partitions > - > > Key: SPARK-16818 > URL: https://issues.apache.org/jira/browse/SPARK-16818 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eric Liang >Priority: Critical > > This happens because the file scan operator does not take into account > partition pruning in its implementation of `sameResult()`. As a result, > executions may be incorrect on self-joins over the same base file relation. > Here's a minimal test case to reproduce: > {code} > spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in > 2.0 > withTempPath { path => > val tempDir = path.getCanonicalPath > spark.range(10) > .selectExpr("id % 2 as a", "id % 3 as b", "id as c") > .write > .partitionBy("a") > .parquet(tempDir) > val df = spark.read.parquet(tempDir) > val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") > val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") > checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, > 10, 5) :: Nil) > {code} > When exchange reuse is on, the result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6| 6| > | 1| 4| 4| > | 2|10|10| > +---+--+--+ > {code} > The correct result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6|12| > | 1| 4| 8| > | 2|10| 5| > +---+--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
[ https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16818: Assignee: (was: Apache Spark) > Exchange reuse incorrectly reuses scans over different sets of partitions > - > > Key: SPARK-16818 > URL: https://issues.apache.org/jira/browse/SPARK-16818 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Eric Liang >Priority: Critical > > This happens because the file scan operator does not take into account > partition pruning in its implementation of `sameResult()`. As a result, > executions may be incorrect on self-joins over the same base file relation. > Here's a minimal test case to reproduce: > {code} > spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in > 2.0 > withTempPath { path => > val tempDir = path.getCanonicalPath > spark.range(10) > .selectExpr("id % 2 as a", "id % 3 as b", "id as c") > .write > .partitionBy("a") > .parquet(tempDir) > val df = spark.read.parquet(tempDir) > val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") > val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") > checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, > 10, 5) :: Nil) > {code} > When exchange reuse is on, the result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6| 6| > | 1| 4| 4| > | 2|10|10| > +---+--+--+ > {code} > The correct result is > {code} > +---+--+--+ > | b|sum(c)|sum(c)| > +---+--+--+ > | 0| 6|12| > | 1| 4| 8| > | 2|10| 5| > +---+--+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16275) Implement all the Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400854#comment-15400854 ] Xiao Li commented on SPARK-16275: - https://github.com/apache/hive/blob/15bdce43db4624a63be1f648e46d1f2baa1c67de/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638-L748 This is the hash function of Hive. The implementation sounds ok, but I might need to check it with [~cloud_fan]. Not all the data types (e.g. Union) are supported. It is highly related to the data types. I am not exactly sure whether we have the same value ranges for each data type. To make sure they always generate the same result. The test cases might be a lot. > Implement all the Hive fallback functions > - > > Key: SPARK-16275 > URL: https://issues.apache.org/jira/browse/SPARK-16275 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.0, Spark falls back to Hive for only the following built-in > functions: > {code} > "elt", "hash", "java_method", "histogram_numeric", > "map_keys", "map_values", > "parse_url", "percentile", "percentile_approx", "reflect", "sentences", > "stack", "str_to_map", > "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", > "xpath_long", > "xpath_number", "xpath_short", "xpath_string", > // table generating function > "inline", "posexplode" > {code} > The goal of the ticket is to implement all of these in Spark so we don't need > to fall back into Hive's UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions
Eric Liang created SPARK-16818: -- Summary: Exchange reuse incorrectly reuses scans over different sets of partitions Key: SPARK-16818 URL: https://issues.apache.org/jira/browse/SPARK-16818 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Eric Liang Priority: Critical This happens because the file scan operator does not take into account partition pruning in its implementation of `sameResult()`. As a result, executions may be incorrect on self-joins over the same base file relation. Here's a minimal test case to reproduce: {code} spark.conf.set("spark.sql.exchange.reuse", true) // defaults to true in 2.0 withTempPath { path => val tempDir = path.getCanonicalPath spark.range(10) .selectExpr("id % 2 as a", "id % 3 as b", "id as c") .write .partitionBy("a") .parquet(tempDir) val df = spark.read.parquet(tempDir) val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum") val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum") checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, 10, 5) :: Nil) {code} When exchange reuse is on, the result is {code} +---+--+--+ | b|sum(c)|sum(c)| +---+--+--+ | 0| 6| 6| | 1| 4| 4| | 2|10|10| +---+--+--+ {code} The correct result is {code} +---+--+--+ | b|sum(c)|sum(c)| +---+--+--+ | 0| 6|12| | 1| 4| 8| | 2|10| 5| +---+--+--+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16275) Implement all the Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400849#comment-15400849 ] Xiao Li commented on SPARK-16275: - Let me check it. Thanks! > Implement all the Hive fallback functions > - > > Key: SPARK-16275 > URL: https://issues.apache.org/jira/browse/SPARK-16275 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.0, Spark falls back to Hive for only the following built-in > functions: > {code} > "elt", "hash", "java_method", "histogram_numeric", > "map_keys", "map_values", > "parse_url", "percentile", "percentile_approx", "reflect", "sentences", > "stack", "str_to_map", > "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", > "xpath_long", > "xpath_number", "xpath_short", "xpath_string", > // table generating function > "inline", "posexplode" > {code} > The goal of the ticket is to implement all of these in Spark so we don't need > to fall back into Hive's UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16275) Implement all the Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400846#comment-15400846 ] Reynold Xin commented on SPARK-16275: - How difficult would it be to provide a native hash implementation that returns the same result? If it is difficult, I'm fine with us updating all of those to the values returned by our own native hash function. > Implement all the Hive fallback functions > - > > Key: SPARK-16275 > URL: https://issues.apache.org/jira/browse/SPARK-16275 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.0, Spark falls back to Hive for only the following built-in > functions: > {code} > "elt", "hash", "java_method", "histogram_numeric", > "map_keys", "map_values", > "parse_url", "percentile", "percentile_approx", "reflect", "sentences", > "stack", "str_to_map", > "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", > "xpath_long", > "xpath_number", "xpath_short", "xpath_string", > // table generating function > "inline", "posexplode" > {code} > The goal of the ticket is to implement all of these in Spark so we don't need > to fall back into Hive's UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16275) Implement all the Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400844#comment-15400844 ] Xiao Li commented on SPARK-16275: - Yeah, many queries are using it. Below is the list: auto_join_nulls auto_join0 auto_join1 auto_join2 auto_join3 auto_join4 auto_join5 auto_join6 auto_join7 auto_join8 auto_join9 auto_join10 auto_join11 auto_join12 auto_join13 auto_join14 auto_join14_hadoop20 auto_join15 auto_join17 auto_join18 auto_join19 auto_join20 auto_join22 auto_join25 auto_join30 auto_join31 correlationoptimizer1 correlationoptimizer2 correlationoptimizer3 correlationoptimizer4 multiMapJoin1 orc_dictionary_threshold udf_hash > Implement all the Hive fallback functions > - > > Key: SPARK-16275 > URL: https://issues.apache.org/jira/browse/SPARK-16275 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.0, Spark falls back to Hive for only the following built-in > functions: > {code} > "elt", "hash", "java_method", "histogram_numeric", > "map_keys", "map_values", > "parse_url", "percentile", "percentile_approx", "reflect", "sentences", > "stack", "str_to_map", > "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", > "xpath_long", > "xpath_number", "xpath_short", "xpath_string", > // table generating function > "inline", "posexplode" > {code} > The goal of the ticket is to implement all of these in Spark so we don't need > to fall back into Hive's UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16275) Implement all the Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400783#comment-15400783 ] Reynold Xin commented on SPARK-16275: - What do we use Hive's hash function for? Are there queries in the Hive compatibility suite that is using it? > Implement all the Hive fallback functions > - > > Key: SPARK-16275 > URL: https://issues.apache.org/jira/browse/SPARK-16275 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.0, Spark falls back to Hive for only the following built-in > functions: > {code} > "elt", "hash", "java_method", "histogram_numeric", > "map_keys", "map_values", > "parse_url", "percentile", "percentile_approx", "reflect", "sentences", > "stack", "str_to_map", > "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", > "xpath_long", > "xpath_number", "xpath_short", "xpath_string", > // table generating function > "inline", "posexplode" > {code} > The goal of the ticket is to implement all of these in Spark so we don't need > to fall back into Hive's UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16817) Enable storing of shuffle data in Alluxio
Tim Bisson created SPARK-16817: -- Summary: Enable storing of shuffle data in Alluxio Key: SPARK-16817 URL: https://issues.apache.org/jira/browse/SPARK-16817 Project: Spark Issue Type: New Feature Reporter: Tim Bisson If one is using Alluxio for storage, it would also be useful if Spark can store shuffle spill data in Alluxio. For example: spark.local.dir="alluxio://host:port/path" Several users on the Alluxio mailing list have asked for this feature: https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/90pRZWRVi0s/mgLWLS5aAgAJ https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/s9H93PnDebw/v_1_FMjR7vEJ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400751#comment-15400751 ] snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:43 PM: - Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started. Is this going to be a scala, java or python codebase? Any updates from the Uber community? was (Author: snehil.w): Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started. Any updates from the Uber community? > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16518) Schema Compatibility of Parquet Data Source
[ https://issues.apache.org/jira/browse/SPARK-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400755#comment-15400755 ] Chanh Le edited comment on SPARK-16518 at 7/30/16 5:38 PM: --- Did we have a patch for that? Why before release we didn't this case? If I change int to bigint it's fine but If I use int it thows the error. {code} CREATE EXTERNAL TABLE os (os_id bigint, os_name String) STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/OS'; {code} 0: jdbc:hive2://master1:1> select * from os limit 1; ++--+--+ | os_id | os_name | ++--+--+ | 15 | Solaris | ++--+--+ 1 row selected (0.514 seconds) {code} CREATE EXTERNAL TABLE os (os_id int, os_name String) STORED AS PARQUET LOCATION 'alluxio://master2:19998/etl_info/OS'; {code} -> throws the same error. was (Author: giaosuddau): Did we have a patch for that? Right now I have this error too. > Schema Compatibility of Parquet Data Source > --- > > Key: SPARK-16518 > URL: https://issues.apache.org/jira/browse/SPARK-16518 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are not checking the schema compatibility. Different file > formats behave differently. This JIRA just summarizes what I observed for > parquet data source tables. > *Scenario 1 Data type mismatch*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int, col2 int)}} > *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the error we > got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most > recent failure: > Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62) > {noformat} > *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the error we > got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): > org.apache.spark.SparkException: > Failed merging schema of file > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-0-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet: > root > |-- a: integer (nullable = false) > |-- b: string (nullable = true) > {noformat} > *Scenario 2 More columns in append dataset*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int, col2 string, col3 int)}} > *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the schema > of the resultset is {{(col1 int, col2 string)}}. > *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema of > the resultset is {{(col1 int, col2 string, col3 int)}}. > *Scenario 3 Less columns in append dataset*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int)}} >*Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the > schema of the resultset is {{(col1 int, col2 string)}}. >*Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema > of the resultset is {{(col1 int)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16275) Implement all the Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400756#comment-15400756 ] Xiao Li commented on SPARK-16275: - [~rxin] What is the plan for {{hash}}? If we use our version, it breaks a lot of test cases in {{HiveCompatibilitySuite}}. For resolving the failure of test cases, we can migrate them into a separate test suite based on our hash function. This is just a labor job. Do you think this is OK? Thanks! > Implement all the Hive fallback functions > - > > Key: SPARK-16275 > URL: https://issues.apache.org/jira/browse/SPARK-16275 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > As of Spark 2.0, Spark falls back to Hive for only the following built-in > functions: > {code} > "elt", "hash", "java_method", "histogram_numeric", > "map_keys", "map_values", > "parse_url", "percentile", "percentile_approx", "reflect", "sentences", > "stack", "str_to_map", > "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", > "xpath_long", > "xpath_number", "xpath_short", "xpath_string", > // table generating function > "inline", "posexplode" > {code} > The goal of the ticket is to implement all of these in Spark so we don't need > to fall back into Hive's UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400751#comment-15400751 ] snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:28 PM: - Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started & where we are at right now with this feature design. Any updates from the Uber community? was (Author: snehil.w): Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started & where we are at right now with this feature design. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16518) Schema Compatibility of Parquet Data Source
[ https://issues.apache.org/jira/browse/SPARK-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400755#comment-15400755 ] Chanh Le commented on SPARK-16518: -- Did we have a patch for that? Right now I have this error too. > Schema Compatibility of Parquet Data Source > --- > > Key: SPARK-16518 > URL: https://issues.apache.org/jira/browse/SPARK-16518 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are not checking the schema compatibility. Different file > formats behave differently. This JIRA just summarizes what I observed for > parquet data source tables. > *Scenario 1 Data type mismatch*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int, col2 int)}} > *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the error we > got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most > recent failure: > Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62) > {noformat} > *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the error we > got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): > org.apache.spark.SparkException: > Failed merging schema of file > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-0-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet: > root > |-- a: integer (nullable = false) > |-- b: string (nullable = true) > {noformat} > *Scenario 2 More columns in append dataset*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int, col2 string, col3 int)}} > *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the schema > of the resultset is {{(col1 int, col2 string)}}. > *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema of > the resultset is {{(col1 int, col2 string, col3 int)}}. > *Scenario 3 Less columns in append dataset*: > The existing schema is {{(col1 int, col2 string)}} > The schema of appending dataset is {{(col1 int)}} >*Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the > schema of the resultset is {{(col1 int, col2 string)}}. >*Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema > of the resultset is {{(col1 int)}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400751#comment-15400751 ] snehil suresh wakchaure edited comment on SPARK-5992 at 7/30/16 5:31 PM: - Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started. Any updates from the Uber community? was (Author: snehil.w): Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started & where we are at right now with this feature design. Any updates from the Uber community? > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400751#comment-15400751 ] snehil suresh wakchaure commented on SPARK-5992: Hello, just curious to know if I can contribute to this project too although I am new at it. I Can use some pointers to get started & where we are at right now with this feature design. > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16800) Fix Java Examples that throw exception
[ https://issues.apache.org/jira/browse/SPARK-16800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16800. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14405 [https://github.com/apache/spark/pull/14405] > Fix Java Examples that throw exception > -- > > Key: SPARK-16800 > URL: https://issues.apache.org/jira/browse/SPARK-16800 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Affects Versions: 2.0.0 >Reporter: Bryan Cutler >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Some Java examples fail to run due to an exception thrown when using > mllib.linalg.Vectors instead of ml.linalg.Vectors. Also, some have incorrect > data types defined in the schema that cause an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16800) Fix Java Examples that throw exception
[ https://issues.apache.org/jira/browse/SPARK-16800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16800: -- Assignee: Bryan Cutler > Fix Java Examples that throw exception > -- > > Key: SPARK-16800 > URL: https://issues.apache.org/jira/browse/SPARK-16800 > Project: Spark > Issue Type: Sub-task > Components: Examples, ML >Affects Versions: 2.0.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Some Java examples fail to run due to an exception thrown when using > mllib.linalg.Vectors instead of ml.linalg.Vectors. Also, some have incorrect > data types defined in the schema that cause an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist
[ https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16696: -- Assignee: Weichen Xu Affects Version/s: (was: 2.1.0) Priority: Minor (was: Major) > unused broadcast variables should call destroy instead of unpersist > --- > > Key: SPARK-16696 > URL: https://issues.apache.org/jira/browse/SPARK-16696 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.1 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Unused broadcast variables should call destroy() instead of unpersist() so > that the memory can released in time, even in driver-side. > currently, several algorithm in ML, such as KMeans, Word2Vec, has this > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16696) unused broadcast variables should call destroy instead of unpersist
[ https://issues.apache.org/jira/browse/SPARK-16696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16696. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14333 [https://github.com/apache/spark/pull/14333] > unused broadcast variables should call destroy instead of unpersist > --- > > Key: SPARK-16696 > URL: https://issues.apache.org/jira/browse/SPARK-16696 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.1 >Reporter: Weichen Xu > Fix For: 2.1.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > Unused broadcast variables should call destroy() instead of unpersist() so > that the memory can released in time, even in driver-side. > currently, several algorithm in ML, such as KMeans, Word2Vec, has this > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16816) Add api to get JavaSparkContext from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400681#comment-15400681 ] Apache Spark commented on SPARK-16816: -- User 'phalodi' has created a pull request for this issue: https://github.com/apache/spark/pull/14421 > Add api to get JavaSparkContext from SparkSession > - > > Key: SPARK-16816 > URL: https://issues.apache.org/jira/browse/SPARK-16816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: sandeep purohit >Priority: Minor > Labels: patch > Fix For: 2.0.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > In this improvement the user can directly get the JavaSparkContext from the > SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16816) Add api to get JavaSparkContext from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16816: Assignee: Apache Spark > Add api to get JavaSparkContext from SparkSession > - > > Key: SPARK-16816 > URL: https://issues.apache.org/jira/browse/SPARK-16816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: sandeep purohit >Assignee: Apache Spark >Priority: Minor > Labels: patch > Fix For: 2.0.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > In this improvement the user can directly get the JavaSparkContext from the > SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16816) Add api to get JavaSparkContext from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16816: Assignee: (was: Apache Spark) > Add api to get JavaSparkContext from SparkSession > - > > Key: SPARK-16816 > URL: https://issues.apache.org/jira/browse/SPARK-16816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: sandeep purohit >Priority: Minor > Labels: patch > Fix For: 2.0.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > In this improvement the user can directly get the JavaSparkContext from the > SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16816) Add api to get JavaSparkContext from SparkSession
sandeep purohit created SPARK-16816: --- Summary: Add api to get JavaSparkContext from SparkSession Key: SPARK-16816 URL: https://issues.apache.org/jira/browse/SPARK-16816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: sandeep purohit Priority: Minor Fix For: 2.0.0 In this improvement the user can directly get the JavaSparkContext from the SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14204) [SQL] Failure to register URL-derived JDBC driver on executors in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-14204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400668#comment-15400668 ] Apache Spark commented on SPARK-14204: -- User 'mchalek' has created a pull request for this issue: https://github.com/apache/spark/pull/14420 > [SQL] Failure to register URL-derived JDBC driver on executors in cluster mode > -- > > Key: SPARK-14204 > URL: https://issues.apache.org/jira/browse/SPARK-14204 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Kevin McHale >Assignee: Kevin McHale > Labels: JDBC, SQL > Fix For: 1.6.2 > > > DataFrameReader JDBC methods throw an IllegalStateException when: > 1. the JDBC driver is contained in a user-provided jar, and > 2. the user does not specify which driver to use, but rather allows spark > to determine the driver from the JDBC URL. > This broke some of our database ETL jobs at @premisedata when we upgraded > from 1.6.0 to 1.6.1. > I have tracked the problem down to a regression introduced in the fix for > SPARK-12579: > https://github.com/apache/spark/commit/7f37c1e45d52b7823d566349e2be21366d73651f#diff-391379a5ec51082e2ae1209db15c02b3R53 > The issue is that DriverRegistry.register is not called on the executors for > a JDBC driver that is derived from the JDBC path. > The problem can be demonstrated within spark-shell, provided you're in > cluster mode and you've deployed a JDBC driver (e.g. postgresql.Driver) via > the --jars argument: > {code} > import > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.createConnectionFactory > val factory = > createConnectionFactory("jdbc:postgresql://whatever.you.want/database?user=user&password=password", > new java.util.Properties) > sc.parallelize(1 to 100).foreach { _ => factory() } // throws exception > {code} > A sufficient fix is to apply DriverRegistry.register to the `driverClass` > variable, rather than to `userSpecifiedDriverClass`, at the code link > provided above. I will submit a PR for this shortly. > In the meantime, a temporary workaround is to manually specify the JDBC > driver class in the Properties object passed to DataFrameReader.jdbc, or in > the options used in other entry points, which will force the executors to > register the class properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400658#comment-15400658 ] Jay Teguh Wijaya Purwanto edited comment on SPARK-16700 at 7/30/16 12:34 PM: - When using `Row` object, but with multiple struct types, also returns similar error: {code} _struct = [ SparkTypes.StructField('string_field', SparkTypes.StringType(), True), SparkTypes.StructField('long_field', SparkTypes.LongType(), True), SparkTypes.StructField('double_field', SparkTypes.DoubleType(), True) ] _rdd = sc.parallelize([Row(string_field='1', long_field=1, double_field=1.1)]) ## Both methods do not work: # _schema = SparkTypes.StructType() # for _s in _struct: # _schema.add(_s) _schema = SparkTypes.StructType(_struct) _df = sqlContext.createDataFrame(_rdd, schema=_schema) _df.take(1) {code} Returned error: {code} DoubleType can not accept object '1' in type {code} was (Author: jaycode): When using `Row` object, but with multiple struct types, also returns similar error: ``` _struct = [ SparkTypes.StructField('string_field', SparkTypes.StringType(), True), SparkTypes.StructField('long_field', SparkTypes.LongType(), True), SparkTypes.StructField('double_field', SparkTypes.DoubleType(), True) ] _rdd = sc.parallelize([Row(string_field='1', long_field=1, double_field=1.1)]) ## Both methods do not work: # _schema = SparkTypes.StructType() # for _s in _struct: # _schema.add(_s) _schema = SparkTypes.StructType(_struct) _df = sqlContext.createDataFrame(_rdd, schema=_schema) _df.take(1) ``` Returned error: ``` DoubleType can not accept object '1' in type ``` > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400658#comment-15400658 ] Jay Teguh Wijaya Purwanto commented on SPARK-16700: --- When using `Row` object, but with multiple struct types, also returns similar error: ``` _struct = [ SparkTypes.StructField('string_field', SparkTypes.StringType(), True), SparkTypes.StructField('long_field', SparkTypes.LongType(), True), SparkTypes.StructField('double_field', SparkTypes.DoubleType(), True) ] _rdd = sc.parallelize([Row(string_field='1', long_field=1, double_field=1.1)]) ## Both methods do not work: # _schema = SparkTypes.StructType() # for _s in _struct: # _schema.add(_s) _schema = SparkTypes.StructType(_struct) _df = sqlContext.createDataFrame(_rdd, schema=_schema) _df.take(1) ``` Returned error: ``` DoubleType can not accept object '1' in type ``` > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16694) Use for/foreach rather than map for Unit expressions whose side effects are required
[ https://issues.apache.org/jira/browse/SPARK-16694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16694. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14332 [https://github.com/apache/spark/pull/14332] > Use for/foreach rather than map for Unit expressions whose side effects are > required > > > Key: SPARK-16694 > URL: https://issues.apache.org/jira/browse/SPARK-16694 > Project: Spark > Issue Type: Improvement > Components: Examples, MLlib, Spark Core, SQL, Streaming >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 2.1.0 > > > {{map}} is misused in many places where {{foreach}} is intended. This caused > a bug in https://issues.apache.org/jira/browse/SPARK-16664 and might be a > latent bug elsewhere; it's also easy to find with IJ inspections. Worth > patching up. > To illustrate the general problem, {{map}} happens to work in Scala where the > collection isn't lazy, but will fail to execute the code when it is. {{map}} > also causes a collection of {{Unit}} to be created pointlessly. > {code} > scala> val foo = Seq(1,2,3) > foo: Seq[Int] = List(1, 2, 3) > scala> foo.map(println) > 1 > 2 > 3 > res0: Seq[Unit] = List((), (), ()) > scala> foo.view.map(println) > res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) > scala> foo.view.foreach(println) > 1 > 2 > 3 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-16797. - Resolution: Duplicate Fix Version/s: (was: 2.0.0) > Repartiton call w/ 0 partitions drops data > -- > > Key: SPARK-16797 > URL: https://issues.apache.org/jira/browse/SPARK-16797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Bryan Jeffrey >Priority: Minor > Labels: easyfix > > When you call RDD.repartition(0) or DStream.repartition(0), the input data > silently dropped. This should not silently fail; instead an exception should > be thrown to alert the user to the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-16797: --- > Repartiton call w/ 0 partitions drops data > -- > > Key: SPARK-16797 > URL: https://issues.apache.org/jira/browse/SPARK-16797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Bryan Jeffrey >Priority: Minor > Labels: easyfix > > When you call RDD.repartition(0) or DStream.repartition(0), the input data > silently dropped. This should not silently fail; instead an exception should > be thrown to alert the user to the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16815) Dataset[List[T]] leads to ArrayStoreException
TobiasP created SPARK-16815: --- Summary: Dataset[List[T]] leads to ArrayStoreException Key: SPARK-16815 URL: https://issues.apache.org/jira/browse/SPARK-16815 Project: Spark Issue Type: Bug Components: SQL Reporter: TobiasP Priority: Minor {noformat} scala> spark.sqlContext.createDataset(sc.parallelize(List(1) :: Nil)).collect java.lang.ArrayStoreException: scala.collection.mutable.WrappedArray$ofRef at scala.collection.mutable.ArrayBuilder$ofRef.$plus$eq(ArrayBuilder.scala:87) at scala.collection.mutable.ArrayBuilder$ofRef.$plus$eq(ArrayBuilder.scala:56) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2218) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2568) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2217) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:) at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2581) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2198) ... 48 elided {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400542#comment-15400542 ] Dongjoon Hyun commented on SPARK-16797: --- Oh, I see. Never mind, [~bjeffrey]. > Repartiton call w/ 0 partitions drops data > -- > > Key: SPARK-16797 > URL: https://issues.apache.org/jira/browse/SPARK-16797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Bryan Jeffrey >Priority: Minor > Labels: easyfix > Fix For: 2.0.0 > > > When you call RDD.repartition(0) or DStream.repartition(0), the input data > silently dropped. This should not silently fail; instead an exception should > be thrown to alert the user to the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16807) Optimize some ABS() statements
[ https://issues.apache.org/jira/browse/SPARK-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400532#comment-15400532 ] Kazuaki Ishizaki commented on SPARK-16807: -- Interesting if we can ensure {{x - y}} is not {{infinite}} or {{NaN}}. Since I am not familiar with SQL, I do not know that how we can ensure this condition in Spark SQL. In general, this generated code seems to take care {{filter_value6}} is {{infinite}} or {{NaN}}. It would be good to read [abs(float)|https://docs.oracle.com/javase/7/docs/api/java/lang/Math.html#abs(float)] and [nanSafeCompareFloats|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1615]. > Optimize some ABS() statements > -- > > Key: SPARK-16807 > URL: https://issues.apache.org/jira/browse/SPARK-16807 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sylvain Zimmer >Priority: Minor > > I'm not a Catalyst expert, but I think some use cases for the ABS() function > could generate simpler code. > This is the code generated when doing something like {{ABS(x - y) > 0}} or > {{ABS(x - y) = 0}} in Spark SQL: > {code} > /* 267 */ float filter_value6 = -1.0f; > /* 268 */ filter_value6 = agg_value27 - agg_value32; > /* 269 */ float filter_value5 = -1.0f; > /* 270 */ filter_value5 = (float)(java.lang.Math.abs(filter_value6)); > /* 271 */ > /* 272 */ boolean filter_value4 = false; > /* 273 */ filter_value4 = > org.apache.spark.util.Utils.nanSafeCompareFloats(filter_value5, 0.0f) > 0; > /* 274 */ if (!filter_value4) continue; > {code} > Maybe it could all be simplified to something like this? > {code} > filter_value4 = (agg_value27 != agg_value32) > {code} > (Of course you could write {{x != y}} directly in the SQL query, but the > {{0}} in my example could be a configurable threshold, not something you can > hardcode) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites
[ https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400512#comment-15400512 ] Apache Spark commented on SPARK-16814: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/14419 > Fix deprecated use of ParquetWriter in Parquet test suites > -- > > Key: SPARK-16814 > URL: https://issues.apache.org/jira/browse/SPARK-16814 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > > Replace deprecated ParquetWriter with the new builders -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites
[ https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16814: Assignee: Apache Spark > Fix deprecated use of ParquetWriter in Parquet test suites > -- > > Key: SPARK-16814 > URL: https://issues.apache.org/jira/browse/SPARK-16814 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk >Assignee: Apache Spark > > Replace deprecated ParquetWriter with the new builders -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites
[ https://issues.apache.org/jira/browse/SPARK-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16814: Assignee: (was: Apache Spark) > Fix deprecated use of ParquetWriter in Parquet test suites > -- > > Key: SPARK-16814 > URL: https://issues.apache.org/jira/browse/SPARK-16814 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > > Replace deprecated ParquetWriter with the new builders -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16814) Fix deprecated use of ParquetWriter in Parquet test suites
holdenk created SPARK-16814: --- Summary: Fix deprecated use of ParquetWriter in Parquet test suites Key: SPARK-16814 URL: https://issues.apache.org/jira/browse/SPARK-16814 Project: Spark Issue Type: Sub-task Components: SQL Reporter: holdenk Replace deprecated ParquetWriter with the new builders -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org