[jira] [Closed] (SPARK-15216) Add a new Dataset API explainCodegen
[ https://issues.apache.org/jira/browse/SPARK-15216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-15216. --- Resolution: Won't Fix Closing this because I don't think it makes sense to have a top level method for something so developer facing. Otherwise the public APIs will be littered with internal developer facing methods. > Add a new Dataset API explainCodegen > > > Key: SPARK-15216 > URL: https://issues.apache.org/jira/browse/SPARK-15216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > val ds = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDS().groupByKey(_._1).agg( > expr("avg(_2)").as[Double], > ComplexResultAgg.toColumn) > ds.explainCodegen() > {noformat} > Reading codegen output is important for developers to debug. So far, > outputting codegen results is available in the SQL interface by `EXPLAIN > CODEGEN`. However, in the Dataset/DataFrame APIs, we face the same issue. We > can add a new API for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15210) Add missing @DeveloperApi annotation in sql.types
[ https://issues.apache.org/jira/browse/SPARK-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15210. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Add missing @DeveloperApi annotation in sql.types > - > > Key: SPARK-15210 > URL: https://issues.apache.org/jira/browse/SPARK-15210 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > @DeveloperApi annotation for {{AbstractDataType}} {{MapType}} > {{UserDefinedType}} are missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14341) Throw exception on unsupported Create/Drop Macro DDL commands
[ https://issues.apache.org/jira/browse/SPARK-14341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14341: Component/s: SQL > Throw exception on unsupported Create/Drop Macro DDL commands > - > > Key: SPARK-14341 > URL: https://issues.apache.org/jira/browse/SPARK-14341 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Bo Meng >Assignee: Bo Meng >Priority: Minor > > According to > [SPARK-14123|https://issues.apache.org/jira/browse/SPARK-14123], we need to > throw exception for Create/Drop Macro DDL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10653) Remove unnecessary things from SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10653: Assignee: Apache Spark > Remove unnecessary things from SparkEnv > --- > > Key: SPARK-10653 > URL: https://issues.apache.org/jira/browse/SPARK-10653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > > As of the writing of this message, there are at least two things that can be > removed from it: > {code} > @DeveloperApi > class SparkEnv ( > val executorId: String, > private[spark] val rpcEnv: RpcEnv, > val serializer: Serializer, > val closureSerializer: Serializer, > val cacheManager: CacheManager, > val mapOutputTracker: MapOutputTracker, > val shuffleManager: ShuffleManager, > val broadcastManager: BroadcastManager, > val blockTransferService: BlockTransferService, // this one can go > val blockManager: BlockManager, > val securityManager: SecurityManager, > val httpFileServer: HttpFileServer, > val sparkFilesDir: String, // this one maybe? It's only used in 1 place. > val metricsSystem: MetricsSystem, > val shuffleMemoryManager: ShuffleMemoryManager, > val executorMemoryManager: ExecutorMemoryManager, // this can go > val outputCommitCoordinator: OutputCommitCoordinator, > val conf: SparkConf) extends Logging { > ... > } > {code} > We should avoid adding to this infinite list of things in SparkEnv's > constructors if they're not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10653) Remove unnecessary things from SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10653: Assignee: (was: Apache Spark) > Remove unnecessary things from SparkEnv > --- > > Key: SPARK-10653 > URL: https://issues.apache.org/jira/browse/SPARK-10653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > As of the writing of this message, there are at least two things that can be > removed from it: > {code} > @DeveloperApi > class SparkEnv ( > val executorId: String, > private[spark] val rpcEnv: RpcEnv, > val serializer: Serializer, > val closureSerializer: Serializer, > val cacheManager: CacheManager, > val mapOutputTracker: MapOutputTracker, > val shuffleManager: ShuffleManager, > val broadcastManager: BroadcastManager, > val blockTransferService: BlockTransferService, // this one can go > val blockManager: BlockManager, > val securityManager: SecurityManager, > val httpFileServer: HttpFileServer, > val sparkFilesDir: String, // this one maybe? It's only used in 1 place. > val metricsSystem: MetricsSystem, > val shuffleMemoryManager: ShuffleMemoryManager, > val executorMemoryManager: ExecutorMemoryManager, // this can go > val outputCommitCoordinator: OutputCommitCoordinator, > val conf: SparkConf) extends Logging { > ... > } > {code} > We should avoid adding to this infinite list of things in SparkEnv's > constructors if they're not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14499) Add tests to make sure drop partitions of an external table will not delete data
[ https://issues.apache.org/jira/browse/SPARK-14499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-14499. - Resolution: Resolved Fix Version/s: 2.0.0 > Add tests to make sure drop partitions of an external table will not delete > data > > > Key: SPARK-14499 > URL: https://issues.apache.org/jira/browse/SPARK-14499 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li > Fix For: 2.0.0 > > > This is a follow-up of SPARK-14132 > (https://github.com/apache/spark/pull/12220#issuecomment-207625166) to > address https://github.com/apache/spark/pull/12220#issuecomment-207612627. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14499) Add tests to make sure drop partitions of an external table will not delete data
[ https://issues.apache.org/jira/browse/SPARK-14499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-14499: - change the status > Add tests to make sure drop partitions of an external table will not delete > data > > > Key: SPARK-14499 > URL: https://issues.apache.org/jira/browse/SPARK-14499 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li > Fix For: 2.0.0 > > > This is a follow-up of SPARK-14132 > (https://github.com/apache/spark/pull/12220#issuecomment-207625166) to > address https://github.com/apache/spark/pull/12220#issuecomment-207612627. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10653) Remove unnecessary things from SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276764#comment-15276764 ] Apache Spark commented on SPARK-10653: -- User 'ajbozarth' has created a pull request for this issue: https://github.com/apache/spark/pull/12970 > Remove unnecessary things from SparkEnv > --- > > Key: SPARK-10653 > URL: https://issues.apache.org/jira/browse/SPARK-10653 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > As of the writing of this message, there are at least two things that can be > removed from it: > {code} > @DeveloperApi > class SparkEnv ( > val executorId: String, > private[spark] val rpcEnv: RpcEnv, > val serializer: Serializer, > val closureSerializer: Serializer, > val cacheManager: CacheManager, > val mapOutputTracker: MapOutputTracker, > val shuffleManager: ShuffleManager, > val broadcastManager: BroadcastManager, > val blockTransferService: BlockTransferService, // this one can go > val blockManager: BlockManager, > val securityManager: SecurityManager, > val httpFileServer: HttpFileServer, > val sparkFilesDir: String, // this one maybe? It's only used in 1 place. > val metricsSystem: MetricsSystem, > val shuffleMemoryManager: ShuffleMemoryManager, > val executorMemoryManager: ExecutorMemoryManager, // this can go > val outputCommitCoordinator: OutputCommitCoordinator, > val conf: SparkConf) extends Logging { > ... > } > {code} > We should avoid adding to this infinite list of things in SparkEnv's > constructors if they're not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15166) Move hive-specific conf setting from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15166. --- Resolution: Fixed Fix Version/s: 2.0.0 > Move hive-specific conf setting from SparkSession > - > > Key: SPARK-15166 > URL: https://issues.apache.org/jira/browse/SPARK-15166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15199) Disallow Dropping Build-in Functions
[ https://issues.apache.org/jira/browse/SPARK-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-15199. - Resolution: Resolved Fix Version/s: 2.0.0 > Disallow Dropping Build-in Functions > > > Key: SPARK-15199 > URL: https://issues.apache.org/jira/browse/SPARK-15199 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > As Hive and the major RDBMS behaves, the built-in functions are not allowed > to drop. In the current implementation, users can drop the built-in > functions. However, after dropping the built-in functions, users are unable > to add them back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15210) Add missing @DeveloperApi annotation in sql.types
[ https://issues.apache.org/jira/browse/SPARK-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15210: -- Assignee: zhengruifeng > Add missing @DeveloperApi annotation in sql.types > - > > Key: SPARK-15210 > URL: https://issues.apache.org/jira/browse/SPARK-15210 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.0.0 > > > @DeveloperApi annotation for {{AbstractDataType}} {{MapType}} > {{UserDefinedType}} are missing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15220) Add hyperlink to "running application" and "completed application"
[ https://issues.apache.org/jira/browse/SPARK-15220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15220. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Add hyperlink to "running application" and "completed application" > -- > > Key: SPARK-15220 > URL: https://issues.apache.org/jira/browse/SPARK-15220 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Mao, Wei >Priority: Minor > Fix For: 2.0.0 > > > Add hyperlink to "running application" and "completed application", so user > can jump to application table directly, In my environment, I set up 1000+ > works and it's painful to scroll down to skip worker list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14946) Spark 2.0 vs 1.6.1 Query Time(out)
[ https://issues.apache.org/jira/browse/SPARK-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276743#comment-15276743 ] Davies Liu commented on SPARK-14946: [~raymond.honderd...@sizmek.com] It seems that the second job (scan the bigger table) did not get started, could you try to disable the broadcast join by set spark.sql.autoBroadcastJoinThreshold to 0? > Spark 2.0 vs 1.6.1 Query Time(out) > -- > > Key: SPARK-14946 > URL: https://issues.apache.org/jira/browse/SPARK-14946 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Raymond Honderdors >Priority: Critical > Attachments: Query Plan 1.6.1.png, screenshot-spark_2.0.png, > spark-defaults.conf, spark-env.sh, version 1.6.1 screen 1 - thrift collect = > true.png, version 1.6.1 screen 1 thrift collect = false.png, version 1.6.1 > screen 2 thrift collect =false.png, version 2.0 -screen 1 thrift collect = > false.png, version 2.0 screen 2 thrift collect = true.png, versiuon 2.0 > screen 1 thrift collect = true.png > > > I run a query using JDBC driver running it on version 1.6.1 it return after 5 > – 6 min , the same query against version 2.0 fails after 2h (due to timeout) > for details on how to reproduce (also see comments below) > here is what I tried > I run the following query: select * from pe_servingdata sd inner join > pe_campaigns_gzip c on sd.campaignid = c.campaign_id ; > (with and without a counter and group by on campaigne_id) > I run spark 1.6.1 and Thriftserver > then running the sql from beeline or squirrel, after a few min I get answer > (0 row) it is correct due to the fact my data did not have matching campaign > ids in both tables > when I run spark 2.0 and Thriftserver, I once again run the sql statement and > after 2:30 min it gives up, bit already after 30/60 sec I stop seeing > activity on the spark ui > (sorry for the delay in competing the description of the bug, I was on and > off work due to national holidays) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15067) YARN executors are launched with fixed perm gen size
[ https://issues.apache.org/jira/browse/SPARK-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15067. --- Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > YARN executors are launched with fixed perm gen size > > > Key: SPARK-15067 > URL: https://issues.apache.org/jira/browse/SPARK-15067 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0, 1.6.1 >Reporter: Renato Falchi Brandão >Assignee: Sean Owen >Priority: Minor > Fix For: 2.0.0 > > > It is impossible to change the executors max perm gen size using the property > "spark.executor.extraJavaOptions" when you are running on YARN. > When the JVM option "-XX:MaxPermSize" is set through the property > "spark.executor.extraJavaOptions", Spark put it properly in the shell command > that will start the JVM container but, in the ending of command, it sets > again this option using a fixed value of 256m, as you can see in the log I've > extracted: > 2016-04-30 17:20:12 INFO ExecutorRunnable:58 - > === > YARN executor launch context: > env: > CLASSPATH -> > {{PWD}}{{PWD}}/__spark__.jar$HADOOP_CONF_DIR/usr/hdp/current/hadoop-client/*/usr/hdp/current/hadoop-client/lib/*/usr/hdp/current/hadoop-hdfs-client/*/usr/hdp/current/hadoop-hdfs-client/lib/*/usr/hdp/current/hadoop-yarn-client/*/usr/hdp/current/hadoop-yarn-client/lib/*/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/*:/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/current/hadoop/lib/hadoop-lzo-0.6.0.jar:/etc/hadoop/conf/secure > SPARK_LOG_URL_STDERR -> > http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stderr?start=-4096 > SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1456962126505_329993 > SPARK_YARN_CACHE_FILES_FILE_SIZES -> 191719054,166 > SPARK_USER -> h_loadbd > SPARK_YARN_CACHE_FILES_VISIBILITIES -> PUBLIC,PUBLIC > SPARK_YARN_MODE -> true > SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1459806496093,1459808508343 > SPARK_LOG_URL_STDOUT -> > http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stdout?start=-4096 > SPARK_YARN_CACHE_FILES -> > hdfs://x/user/datalab/hdp/spark/lib/spark-assembly-1.6.0.2.3.4.1-10-hadoop2.7.1.2.3.4.1-10.jar#__spark__.jar,hdfs://tlvcluster/user/datalab/hdp/spark/conf/hive-site.xml#hive-site.xml > command: > {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m > -Xmx6144m '-XX:+PrintGCDetails' '-XX:MaxPermSize=1024M' > '-XX:+PrintGCTimeStamps' -Djava.io.tmpdir={{PWD}}/tmp > '-Dspark.akka.timeout=30' '-Dspark.driver.port=62875' > '-Dspark.rpc.askTimeout=30' '-Dspark.rpc.lookupTimeout=30' > -Dspark.yarn.app.container.log.dir= -XX:MaxPermSize=256m > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@10.125.81.42:62875 --executor-id 1 --hostname > x0668sl.x.br --cores 1 --app-id application_1456962126505_329993 > --user-class-path file:$PWD/__app__.jar 1> /stdout 2> > /stderr > Analyzing the code is possible to see that all the options set in the > property "spark.executor.extraJavaOptions" are enclosed, one by one, in > single quotes (ExecutorRunnable.scala:151) before the launcher take the > decision if a default value has to be provided or not for the option > "-XX:MaxPermSize" (ExecutorRunnable.scala:202). > This decision is taken examining all the options set and looking for a string > starting with the value "-XX:MaxPermSize" (CommandBuilderUtils.java:328). If > that value is not found, the default value is set. > A string option starting without single quote will never be found, then, a > default value will always be provided. > A possible solution is change the source code of CommandBuilderUtils.java in > the line 328: > From-> if (arg.startsWith("-XX:MaxPermSize=")) > To-> if (arg.indexOf("-XX:MaxPermSize=") > -1) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15225) Replace SQLContext with SparkSession in Encoder documentation
[ https://issues.apache.org/jira/browse/SPARK-15225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15225. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Replace SQLContext with SparkSession in Encoder documentation > - > > Key: SPARK-15225 > URL: https://issues.apache.org/jira/browse/SPARK-15225 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.0.0 > > > Encoder's doc mentions sqlContext.implicits._. We should use > sparkSession.implicits._ instead now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15223: -- Assignee: Philipp Hoffmann > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Assignee: Philipp Hoffmann >Priority: Minor > Fix For: 1.6.2, 2.0.0 > > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15223: -- Target Version/s: 1.6.2, 2.0.0 (was: 2.0.0) > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Assignee: Philipp Hoffmann >Priority: Minor > Fix For: 1.6.2, 2.0.0 > > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15223: -- Fix Version/s: 1.6.2 > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Assignee: Philipp Hoffmann >Priority: Minor > Fix For: 1.6.2, 2.0.0 > > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15223. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Priority: Trivial > Fix For: 2.0.0 > > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15223: -- Priority: Minor (was: Trivial) > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Priority: Minor > Fix For: 2.0.0 > > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276723#comment-15276723 ] Miles Crawford commented on SPARK-11293: Also biting us in 1.6.1 - we have to repartition our dataset into many thousands of partitions to avoid the following stack: {code} 2016-05-08 16:05:11,941 ERROR org.apache.spark.executor.Executor: Managed memory leak detected; size = 5748783225 bytes, TID = 1283 2016-05-08 16:05:11,948 ERROR org.apache.spark.executor.Executor: Exception in task 116.4 in stage 1.0 (TID 1283) java.lang.OutOfMemoryError: Unable to acquire 2383 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at org.apache.spark.scheduler.Task.run(Task.scala:89) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ~[d56f3336b4a0fcc71fe8beb90052dbafd0e88a749bdb4bbb15d37894cf443364-spark-core_2.11-1.6.1.jar:1.6.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]{code} > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15093) create/delete/rename directory for InMemoryCatalog operations if needed
[ https://issues.apache.org/jira/browse/SPARK-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-15093. --- Resolution: Fixed Fix Version/s: 2.0.0 > create/delete/rename directory for InMemoryCatalog operations if needed > --- > > Key: SPARK-15093 > URL: https://issues.apache.org/jira/browse/SPARK-15093 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15221) error: not found: value sqlContext when starting Spark 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276693#comment-15276693 ] Vijay Parmar commented on SPARK-15221: -- Thank you, Sean! I am new to Spark and learning the things. Still not sure about this issue as when I ran the command to execute spark all this log gets generated. Will take care of the things you have pointed out. > error: not found: value sqlContext when starting Spark 1.6.1 > > > Key: SPARK-15221 > URL: https://issues.apache.org/jira/browse/SPARK-15221 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.0.4, 8 GB RAM, 1 Processor >Reporter: Vijay Parmar >Priority: Blocker > Labels: build, newbie > > When I start Spark (version 1.6.1), at the very end I am getting the > following error message: > :16: error: not found: value sqlContext > import sqlContext.implicits._ > ^ > :16: error: not found: value sqlContext > import sqlContext.sql > I have gone through some content on the web about editing the /.bashrc file > and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. > Also tried editing the /etc/hosts file with :- > $ sudo vi /etc/hosts > ... > 127.0.0.1 > ... > but still the issue persists. Is it the issue with the build or something > else? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15220) Add hyperlink to "running application" and "completed application"
[ https://issues.apache.org/jira/browse/SPARK-15220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276687#comment-15276687 ] Alex Bozarth commented on SPARK-15220: -- I'll take a look at this in my free moments today, seems quick > Add hyperlink to "running application" and "completed application" > -- > > Key: SPARK-15220 > URL: https://issues.apache.org/jira/browse/SPARK-15220 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Mao, Wei >Priority: Minor > > Add hyperlink to "running application" and "completed application", so user > can jump to application table directly, In my environment, I set up 1000+ > works and it's painful to scroll down to skip worker list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15227) InputStream stop-start semantics + empty implementations
Stas Levin created SPARK-15227: -- Summary: InputStream stop-start semantics + empty implementations Key: SPARK-15227 URL: https://issues.apache.org/jira/browse/SPARK-15227 Project: Spark Issue Type: Improvement Components: Input/Output, Streaming Affects Versions: 1.6.1 Reporter: Stas Levin Priority: Minor Hi, Seems like quite a few InputStream(s) currently leave the start and stop methods empty. I was hoping to hear your thoughts on: 1. Whether there were any particular reasons to leave these methods empty ? 2. Do the stop/start semantics of InputStream(s) aim to support pause-resume use-cases, or is it a one way ticket? A pause-resume kind of thing could be really useful for cases where one wishes to load new offline data for the streaming app to run on top of, without restarting the entire app. Thanks a lot, Stas -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14898) MultivariateGaussian could use Cholesky in calculateCovarianceConstants
[ https://issues.apache.org/jira/browse/SPARK-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276632#comment-15276632 ] Miao Wang commented on SPARK-14898: --- I agree. I tried to understand what I should do in this JIRA. It seems that we don't have to change anything in terms of using Cholesky. Thanks! > MultivariateGaussian could use Cholesky in calculateCovarianceConstants > --- > > Key: SPARK-14898 > URL: https://issues.apache.org/jira/browse/SPARK-14898 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml.stat.distribution.MultivariateGaussian, > calculateCovarianceConstants uses SVD. It might be more efficient to use > Cholesky. We should check other numerical libraries and see if we should > switch to Cholesky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN closed SPARK-15122. -- Verified successfully in 0508 build. Thanks! > TPC-DS Qury 41 fails with The correlated scalar subquery can only contain > equality predicates > - > > Key: SPARK-15122 > URL: https://issues.apache.org/jira/browse/SPARK-15122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Assignee: Herman van Hovell >Priority: Critical > Fix For: 2.0.0 > > > The official TPC-DS query 41 fails with the following error: > {noformat} > Error in query: The correlated scalar subquery can only contain equality > predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) > && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) > || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra > large || (((i_category#36 = Women) && ((i_color#41 = brown) || > (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && > ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && > ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || > (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || > (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = > cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 > = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = > i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || > (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && > ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = > Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = > Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) > || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = > frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = > petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 > = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = > Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; > {noformat} > The output plans showed the following errors > {noformat} > == Parsed Logical Plan == > 'GlobalLimit 100 > +- 'LocalLimit 100 >+- 'Sort ['i_product_name ASC], true > +- 'Distinct > +- 'Project ['i_product_name] > +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + > 40))) && (scalar-subquery#1 [] > 0)) >: +- 'SubqueryAlias scalar-subquery#1 [] >: +- 'Project ['count(1) AS item_cnt#0] >:+- 'Filter ((('i_manufact = 'i1.i_manufact) && > ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && > ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = > extra large || ((('i_category = Women) && (('i_color = brown) || > ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && > (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && > (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || > ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || > ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && > ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size > = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = > Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = > Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra > large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = > papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || > ('i_size = small) || 'i_category = Men) && (('i_color = orange) || > ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && > (('i_size = petite) || ('i_size = large || ((('i_category = Men) && > (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || > ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large >: +- 'UnresolvedRelation `item`, None >+- 'UnresolvedRelation `item`, Some(i1) > == Analyzed Logical Plan == > i_product_name: string > GlobalLimit 100 > +- LocalLimit 100 >+- Sort [i_product_name#24 ASC], true > +- Distinct > +- Project [i_product_name#24] > +- Filter (((i_manufact_id#16L >= cast(738 as bigint)) && >
[jira] [Commented] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276600#comment-15276600 ] JESSE CHEN commented on SPARK-15122: works great! now all 99 queries pass! nicely done! > TPC-DS Qury 41 fails with The correlated scalar subquery can only contain > equality predicates > - > > Key: SPARK-15122 > URL: https://issues.apache.org/jira/browse/SPARK-15122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Assignee: Herman van Hovell >Priority: Critical > Fix For: 2.0.0 > > > The official TPC-DS query 41 fails with the following error: > {noformat} > Error in query: The correlated scalar subquery can only contain equality > predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) > && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) > || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra > large || (((i_category#36 = Women) && ((i_color#41 = brown) || > (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && > ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && > ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || > (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || > (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = > cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 > = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = > i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || > (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && > ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = > Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = > Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) > || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = > frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = > petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 > = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = > Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; > {noformat} > The output plans showed the following errors > {noformat} > == Parsed Logical Plan == > 'GlobalLimit 100 > +- 'LocalLimit 100 >+- 'Sort ['i_product_name ASC], true > +- 'Distinct > +- 'Project ['i_product_name] > +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + > 40))) && (scalar-subquery#1 [] > 0)) >: +- 'SubqueryAlias scalar-subquery#1 [] >: +- 'Project ['count(1) AS item_cnt#0] >:+- 'Filter ((('i_manufact = 'i1.i_manufact) && > ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && > ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = > extra large || ((('i_category = Women) && (('i_color = brown) || > ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && > (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && > (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || > ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || > ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && > ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size > = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = > Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = > Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra > large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = > papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || > ('i_size = small) || 'i_category = Men) && (('i_color = orange) || > ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && > (('i_size = petite) || ('i_size = large || ((('i_category = Men) && > (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || > ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large >: +- 'UnresolvedRelation `item`, None >+- 'UnresolvedRelation `item`, Some(i1) > == Analyzed Logical Plan == > i_product_name: string > GlobalLimit 100 > +- LocalLimit 100 >+- Sort [i_product_name#24 ASC], true > +- Distinct > +- Project [i_product_name#24] > +- Filter
[jira] [Updated] (SPARK-15154) LongHashedRelation test fails on Big Endian platform
[ https://issues.apache.org/jira/browse/SPARK-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pete Robbins updated SPARK-15154: - Priority: Minor (was: Major) Summary: LongHashedRelation test fails on Big Endian platform (was: LongHashedRelation fails on Big Endian platform) > LongHashedRelation test fails on Big Endian platform > > > Key: SPARK-15154 > URL: https://issues.apache.org/jira/browse/SPARK-15154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Pete Robbins >Priority: Minor > Labels: big-endian > > NPE in > org.apache.spark.sql.execution.joins.HashedRelationSuite.LongToUnsafeRowMap > Error Message > java.lang.NullPointerException was thrown. > Stacktrace > java.lang.NullPointerException > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(HashedRelationSuite.scala:121) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply$mcV$sp(HashedRelationSuite.scala:119) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526) > at > org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29) > at org.scalatest.Suite$class.run(Suite.scala:1421) > at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:29) > at
[jira] [Commented] (SPARK-15154) LongHashedRelation fails on Big Endian platform
[ https://issues.apache.org/jira/browse/SPARK-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276552#comment-15276552 ] Apache Spark commented on SPARK-15154: -- User 'robbinspg' has created a pull request for this issue: https://github.com/apache/spark/pull/13009 > LongHashedRelation fails on Big Endian platform > --- > > Key: SPARK-15154 > URL: https://issues.apache.org/jira/browse/SPARK-15154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Pete Robbins > Labels: big-endian > > NPE in > org.apache.spark.sql.execution.joins.HashedRelationSuite.LongToUnsafeRowMap > Error Message > java.lang.NullPointerException was thrown. > Stacktrace > java.lang.NullPointerException > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(HashedRelationSuite.scala:121) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply$mcV$sp(HashedRelationSuite.scala:119) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526) > at > org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29) > at org.scalatest.Suite$class.run(Suite.scala:1421) > at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:29) > at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55) > at >
[jira] [Assigned] (SPARK-15154) LongHashedRelation fails on Big Endian platform
[ https://issues.apache.org/jira/browse/SPARK-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15154: Assignee: (was: Apache Spark) > LongHashedRelation fails on Big Endian platform > --- > > Key: SPARK-15154 > URL: https://issues.apache.org/jira/browse/SPARK-15154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Pete Robbins > Labels: big-endian > > NPE in > org.apache.spark.sql.execution.joins.HashedRelationSuite.LongToUnsafeRowMap > Error Message > java.lang.NullPointerException was thrown. > Stacktrace > java.lang.NullPointerException > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(HashedRelationSuite.scala:121) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply$mcV$sp(HashedRelationSuite.scala:119) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526) > at > org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29) > at org.scalatest.Suite$class.run(Suite.scala:1421) > at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:29) > at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55) > at > org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563) > at >
[jira] [Assigned] (SPARK-15154) LongHashedRelation fails on Big Endian platform
[ https://issues.apache.org/jira/browse/SPARK-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15154: Assignee: Apache Spark > LongHashedRelation fails on Big Endian platform > --- > > Key: SPARK-15154 > URL: https://issues.apache.org/jira/browse/SPARK-15154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Pete Robbins >Assignee: Apache Spark > Labels: big-endian > > NPE in > org.apache.spark.sql.execution.joins.HashedRelationSuite.LongToUnsafeRowMap > Error Message > java.lang.NullPointerException was thrown. > Stacktrace > java.lang.NullPointerException > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(HashedRelationSuite.scala:121) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply$mcV$sp(HashedRelationSuite.scala:119) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526) > at > org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29) > at org.scalatest.Suite$class.run(Suite.scala:1421) > at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:29) > at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55) > at > org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563) > at >
[jira] [Commented] (SPARK-15154) LongHashedRelation fails on Big Endian platform
[ https://issues.apache.org/jira/browse/SPARK-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276543#comment-15276543 ] Pete Robbins commented on SPARK-15154: -- I'm convinced the test is invalid. The creation of LongHashedRelation is guarded by {code} if (key.length == 1 && key.head.dataType == LongType) { LongHashedRelation(input, key, sizeEstimate, mm) } {code} In this failing test the key dataType is IntegerType I'll submit a PR to fix the tests > LongHashedRelation fails on Big Endian platform > --- > > Key: SPARK-15154 > URL: https://issues.apache.org/jira/browse/SPARK-15154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Pete Robbins > Labels: big-endian > > NPE in > org.apache.spark.sql.execution.joins.HashedRelationSuite.LongToUnsafeRowMap > Error Message > java.lang.NullPointerException was thrown. > Stacktrace > java.lang.NullPointerException > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(HashedRelationSuite.scala:121) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply$mcV$sp(HashedRelationSuite.scala:119) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526) > at > org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29) > at
[jira] [Commented] (SPARK-14813) ML 2.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276497#comment-15276497 ] Yanbo Liang commented on SPARK-14813: - [~holdenk] I'm not working on regression, please go ahead. I will open JIRAs if I start my work under this topic. > ML 2.0 QA: API: Python API coverage > --- > > Key: SPARK-14813 > URL: https://issues.apache.org/jira/browse/SPARK-14813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: holdenk > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below as "requires") for this list of > to-do items. > UPDATE: This only needs to cover spark.ml since spark.mllib is going into > maintenance mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14459) SQL partitioning must match existing tables, but is not checked.
[ https://issues.apache.org/jira/browse/SPARK-14459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276493#comment-15276493 ] Ryan Blue commented on SPARK-14459: --- Thank you [~lian cheng]! > SQL partitioning must match existing tables, but is not checked. > > > Key: SPARK-14459 > URL: https://issues.apache.org/jira/browse/SPARK-14459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.0.0 > > > Writing into partitioned Hive tables has unexpected results because the > table's partitioning is not detected and applied during the analysis phase. > For example, if I have two tables, {{source}} and {{partitioned}}, with the > same column types: > {code} > CREATE TABLE source (id bigint, data string, part string); > CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part > string); > // copy from source to partitioned > sqlContext.table("source").write.insertInto("partitioned") > {code} > Copying from {{source}} to {{partitioned}} succeeds, but results in 0 rows. > This works if I explicitly partition by adding > {{...write.partitionBy("part").insertInto(...)}}. This work-around isn't > obvious and is prone to error because the {{partitionBy}} must match the > table's partitioning, though it is not checked. > I think when relations are resolved, the partitioning should be checked and > updated if it isn't set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15192) RowEncoder needs to verify nullability in a more explicit way
[ https://issues.apache.org/jira/browse/SPARK-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276478#comment-15276478 ] Apache Spark commented on SPARK-15192: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13008 > RowEncoder needs to verify nullability in a more explicit way > - > > Key: SPARK-15192 > URL: https://issues.apache.org/jira/browse/SPARK-15192 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > When we create a Dataset from an RDD of rows with a specific schema, if the > nullability of a value does not match the nullability defined in the schema, > we will throw an exception that is not easy to understand. > It will be good to verify the nullability in a more explicit way. > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schema = new StructType().add("a", StringType, false).add("b", > StringType, false) > val rdd = sc.parallelize(Row(null, "123") :: Row("234", null) :: Nil) > spark.createDataFrame(rdd, schema).show > {code} > {noformat} > java.lang.RuntimeException: Error while decoding: > java.lang.NullPointerException > createexternalrow(if (isnull(input[0, string])) null else input[0, > string].toString, if (isnull(input[1, string])) null else input[1, > string].toString, StructField(a,StringType,false), > StructField(b,StringType,false)) > :- if (isnull(input[0, string])) null else input[0, string].toString > : :- isnull(input[0, string]) > : : +- input[0, string] > : :- null > : +- input[0, string].toString > : +- input[0, string] > +- if (isnull(input[1, string])) null else input[1, string].toString >:- isnull(input[1, string]) >: +- input[1, string] >:- null >+- input[1, string].toString > +- input[1, string] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:244) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2119) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2119) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2119) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2407) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2118) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2125) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1859) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1858) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2437) > at org.apache.spark.sql.Dataset.head(Dataset.scala:1858) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2075) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) > at org.apache.spark.sql.Dataset.show(Dataset.scala:530) > at org.apache.spark.sql.Dataset.show(Dataset.scala:490) > at org.apache.spark.sql.Dataset.show(Dataset.scala:499) > ... 50 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:241) > ... 72 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15192) RowEncoder needs to verify nullability in a more explicit way
[ https://issues.apache.org/jira/browse/SPARK-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15192: Assignee: (was: Apache Spark) > RowEncoder needs to verify nullability in a more explicit way > - > > Key: SPARK-15192 > URL: https://issues.apache.org/jira/browse/SPARK-15192 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > When we create a Dataset from an RDD of rows with a specific schema, if the > nullability of a value does not match the nullability defined in the schema, > we will throw an exception that is not easy to understand. > It will be good to verify the nullability in a more explicit way. > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schema = new StructType().add("a", StringType, false).add("b", > StringType, false) > val rdd = sc.parallelize(Row(null, "123") :: Row("234", null) :: Nil) > spark.createDataFrame(rdd, schema).show > {code} > {noformat} > java.lang.RuntimeException: Error while decoding: > java.lang.NullPointerException > createexternalrow(if (isnull(input[0, string])) null else input[0, > string].toString, if (isnull(input[1, string])) null else input[1, > string].toString, StructField(a,StringType,false), > StructField(b,StringType,false)) > :- if (isnull(input[0, string])) null else input[0, string].toString > : :- isnull(input[0, string]) > : : +- input[0, string] > : :- null > : +- input[0, string].toString > : +- input[0, string] > +- if (isnull(input[1, string])) null else input[1, string].toString >:- isnull(input[1, string]) >: +- input[1, string] >:- null >+- input[1, string].toString > +- input[1, string] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:244) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2119) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2119) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2119) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2407) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2118) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2125) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1859) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1858) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2437) > at org.apache.spark.sql.Dataset.head(Dataset.scala:1858) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2075) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) > at org.apache.spark.sql.Dataset.show(Dataset.scala:530) > at org.apache.spark.sql.Dataset.show(Dataset.scala:490) > at org.apache.spark.sql.Dataset.show(Dataset.scala:499) > ... 50 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:241) > ... 72 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15192) RowEncoder needs to verify nullability in a more explicit way
[ https://issues.apache.org/jira/browse/SPARK-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15192: Assignee: Apache Spark > RowEncoder needs to verify nullability in a more explicit way > - > > Key: SPARK-15192 > URL: https://issues.apache.org/jira/browse/SPARK-15192 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > When we create a Dataset from an RDD of rows with a specific schema, if the > nullability of a value does not match the nullability defined in the schema, > we will throw an exception that is not easy to understand. > It will be good to verify the nullability in a more explicit way. > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > val schema = new StructType().add("a", StringType, false).add("b", > StringType, false) > val rdd = sc.parallelize(Row(null, "123") :: Row("234", null) :: Nil) > spark.createDataFrame(rdd, schema).show > {code} > {noformat} > java.lang.RuntimeException: Error while decoding: > java.lang.NullPointerException > createexternalrow(if (isnull(input[0, string])) null else input[0, > string].toString, if (isnull(input[1, string])) null else input[1, > string].toString, StructField(a,StringType,false), > StructField(b,StringType,false)) > :- if (isnull(input[0, string])) null else input[0, string].toString > : :- isnull(input[0, string]) > : : +- input[0, string] > : :- null > : +- input[0, string].toString > : +- input[0, string] > +- if (isnull(input[1, string])) null else input[1, string].toString >:- isnull(input[1, string]) >: +- input[1, string] >:- null >+- input[1, string].toString > +- input[1, string] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:244) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2119) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2119) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2119) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2407) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2118) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2125) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1859) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1858) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2437) > at org.apache.spark.sql.Dataset.head(Dataset.scala:1858) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2075) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) > at org.apache.spark.sql.Dataset.show(Dataset.scala:530) > at org.apache.spark.sql.Dataset.show(Dataset.scala:490) > at org.apache.spark.sql.Dataset.show(Dataset.scala:499) > ... 50 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:241) > ... 72 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15226) CSV file data-line with newline at first line load error
[ https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15226: Assignee: Apache Spark > CSV file data-line with newline at first line load error > > > Key: SPARK-15226 > URL: https://issues.apache.org/jira/browse/SPARK-15226 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > CSV file such as: > --- > v1,v2,"v > 3",v4,v5 > a,b,c,d,e > --- > it contains two row,first row : > v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is > legal) > second row: > a,b,c,d,e > then in spark-shell run commands like: > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > var reader = sqlContext.read > var df = reader.csv("path/to/csvfile") > df.collect > then we find the load data is wrong, > the load data has only 3 columns, but in fact it has 5 columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15226) CSV file data-line with newline at first line load error
[ https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15226: Assignee: (was: Apache Spark) > CSV file data-line with newline at first line load error > > > Key: SPARK-15226 > URL: https://issues.apache.org/jira/browse/SPARK-15226 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > CSV file such as: > --- > v1,v2,"v > 3",v4,v5 > a,b,c,d,e > --- > it contains two row,first row : > v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is > legal) > second row: > a,b,c,d,e > then in spark-shell run commands like: > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > var reader = sqlContext.read > var df = reader.csv("path/to/csvfile") > df.collect > then we find the load data is wrong, > the load data has only 3 columns, but in fact it has 5 columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15226) CSV file data-line with newline at first line load error
[ https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276390#comment-15276390 ] Apache Spark commented on SPARK-15226: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/13007 > CSV file data-line with newline at first line load error > > > Key: SPARK-15226 > URL: https://issues.apache.org/jira/browse/SPARK-15226 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > CSV file such as: > --- > v1,v2,"v > 3",v4,v5 > a,b,c,d,e > --- > it contains two row,first row : > v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is > legal) > second row: > a,b,c,d,e > then in spark-shell run commands like: > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > var reader = sqlContext.read > var df = reader.csv("path/to/csvfile") > df.collect > then we find the load data is wrong, > the load data has only 3 columns, but in fact it has 5 columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15226) CSV file data-line with newline at first line load error
Weichen Xu created SPARK-15226: -- Summary: CSV file data-line with newline at first line load error Key: SPARK-15226 URL: https://issues.apache.org/jira/browse/SPARK-15226 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 2.1.0 Reporter: Weichen Xu CSV file such as: --- v1,v2,"v 3",v4,v5 a,b,c,d,e --- it contains two row,first row : v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is legal) second row: a,b,c,d,e then in spark-shell run commands like: val sqlContext = new org.apache.spark.sql.SQLContext(sc); var reader = sqlContext.read var df = reader.csv("path/to/csvfile") df.collect then we find the load data is wrong, the load data has only 3 columns, but in fact it has 5 columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15188) PySpark NaiveBayes is missing Thresholds param
[ https://issues.apache.org/jira/browse/SPARK-15188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-15188: --- Assignee: holdenk > PySpark NaiveBayes is missing Thresholds param > -- > > Key: SPARK-15188 > URL: https://issues.apache.org/jira/browse/SPARK-15188 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: holdenk >Priority: Trivial > > NaiveBayes in Python is missing thresholds param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15188) PySpark NaiveBayes is missing Thresholds param
[ https://issues.apache.org/jira/browse/SPARK-15188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-15188: --- Summary: PySpark NaiveBayes is missing Thresholds param (was: NaiveBayes is missing Thresholds param) > PySpark NaiveBayes is missing Thresholds param > -- > > Key: SPARK-15188 > URL: https://issues.apache.org/jira/browse/SPARK-15188 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > NaiveBayes in Python is missing thresholds param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15212) CSV file reader when read file with first line schema do not filter blank in schema column name
[ https://issues.apache.org/jira/browse/SPARK-15212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276275#comment-15276275 ] Weichen Xu commented on SPARK-15212: en...but still may cause problem, for example, the csv file header contains ` character such as: col`1,col2,... so it is better to add a check whether the column name read from file is legal. > CSV file reader when read file with first line schema do not filter blank in > schema column name > --- > > Key: SPARK-15212 > URL: https://issues.apache.org/jira/browse/SPARK-15212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Weichen Xu >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > for example, run the following code in spark-shell, > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > var reader = sqlContext.read > reader.option("header", true) > var df = reader.csv("file:///diskext/tdata/spark/d1.csv") > when the csv data file contains: > -- > col1, col2,col3,col4,col5 > 1997,Ford,E350,"ac, abs, moon",3000.00 > > > the first line contains schema, the col2 has a blank before it, > then the generated DataFrame's schema column name contains the blank. > This may cause potential problem for example > df.select("col2") > can't find the column, must use > df.select(" col2") > and if register the dataframe as a table, then do query, can't select col2. > df.registerTempTable("tab1"); > sqlContext.sql("select col2 from tab1"); //will fail > must add a column name validate when load csv file with schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276190#comment-15276190 ] Apache Spark commented on SPARK-15223: -- User 'ashishawasthi' has created a pull request for this issue: https://github.com/apache/spark/pull/13004 > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Priority: Trivial > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15223: -- Priority: Trivial (was: Major) Not Major, and too Trivial even for a JIRA > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Priority: Trivial > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15225) Replace SQLContext with SparkSession in Encoder documentation
[ https://issues.apache.org/jira/browse/SPARK-15225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15225: Assignee: (was: Apache Spark) > Replace SQLContext with SparkSession in Encoder documentation > - > > Key: SPARK-15225 > URL: https://issues.apache.org/jira/browse/SPARK-15225 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Liang-Chi Hsieh >Priority: Minor > > Encoder's doc mentions sqlContext.implicits._. We should use > sparkSession.implicits._ instead now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15225) Replace SQLContext with SparkSession in Encoder documentation
[ https://issues.apache.org/jira/browse/SPARK-15225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15225: Assignee: Apache Spark > Replace SQLContext with SparkSession in Encoder documentation > - > > Key: SPARK-15225 > URL: https://issues.apache.org/jira/browse/SPARK-15225 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Minor > > Encoder's doc mentions sqlContext.implicits._. We should use > sparkSession.implicits._ instead now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15225) Replace SQLContext with SparkSession in Encoder documentation
[ https://issues.apache.org/jira/browse/SPARK-15225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276159#comment-15276159 ] Apache Spark commented on SPARK-15225: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/13002 > Replace SQLContext with SparkSession in Encoder documentation > - > > Key: SPARK-15225 > URL: https://issues.apache.org/jira/browse/SPARK-15225 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Liang-Chi Hsieh >Priority: Minor > > Encoder's doc mentions sqlContext.implicits._. We should use > sparkSession.implicits._ instead now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15225) Replace SQLContext with SparkSession in Encoder documentation
Liang-Chi Hsieh created SPARK-15225: --- Summary: Replace SQLContext with SparkSession in Encoder documentation Key: SPARK-15225 URL: https://issues.apache.org/jira/browse/SPARK-15225 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Liang-Chi Hsieh Priority: Minor Encoder's doc mentions sqlContext.implicits._. We should use sparkSession.implicits._ instead now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15223: Assignee: (was: Apache Spark) > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276156#comment-15276156 ] Apache Spark commented on SPARK-15223: -- User 'philipphoffmann' has created a pull request for this issue: https://github.com/apache/spark/pull/13001 > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
[ https://issues.apache.org/jira/browse/SPARK-15223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15223: Assignee: Apache Spark > spark.executor.logs.rolling.maxSize wrongly referred to as > spark.executor.logs.rolling.size.maxBytes > > > Key: SPARK-15223 > URL: https://issues.apache.org/jira/browse/SPARK-15223 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Assignee: Apache Spark > > The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was > changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is > however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15224) Can not delete jar and list jar in spark Thrift server
[ https://issues.apache.org/jira/browse/SPARK-15224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-15224: - Description: when you try to delete jar , and exec delete jar or list jar in you beeline client. it throws exception delete jar; Error: org.apache.spark.sql.AnalysisException: line 1:7 missing FROM at 'jars' near 'jars' line 1:12 missing EOF at 'myudfs' near 'jars'; (state=,code=0) list jar; Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 'list' 'jars' ''; line 1 pos 0 (state=,code=0) {code:title=funnlog.log|borderStyle=solid} 16/05/09 17:26:52 INFO thriftserver.SparkExecuteStatementOperation: Running query 'list jar' with 1da09765-efb4-42dc-8890-3defca40f89d 16/05/09 17:26:52 INFO parse.ParseDriver: Parsing command: list jar NoViableAltException(26@[]) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1071) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166) at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34) at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295) at org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66) at org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:293) at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:240) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:239) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:282) at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:65) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211) at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211) at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114) at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at
[jira] [Created] (SPARK-15223) spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes
Philipp Hoffmann created SPARK-15223: Summary: spark.executor.logs.rolling.maxSize wrongly referred to as spark.executor.logs.rolling.size.maxBytes Key: SPARK-15223 URL: https://issues.apache.org/jira/browse/SPARK-15223 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.6.1 Reporter: Philipp Hoffmann The configuration setting {{spark.executor.logs.rolling.size.maxBytes}} was changed to {{spark.executor.logs.rolling.maxSize}} in 1.4 or so. There is however still a reference in the documentation using the old name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15224) Can not delete jar and list jar in spark Thrift server
poseidon created SPARK-15224: Summary: Can not delete jar and list jar in spark Thrift server Key: SPARK-15224 URL: https://issues.apache.org/jira/browse/SPARK-15224 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Environment: spark 1.6.1 hive 1.2.1 hdfs 2.7.1 Reporter: poseidon when you try to delete jar , and exec delete jar or list jar in you beeline client. it throws exception delete jar; Error: org.apache.spark.sql.AnalysisException: line 1:7 missing FROM at 'jars' near 'jars' line 1:12 missing EOF at 'myudfs' near 'jars'; (state=,code=0) list jar; Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 'list' 'jars' ''; line 1 pos 0 (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15192) RowEncoder needs to verify nullability in a more explicit way
[ https://issues.apache.org/jira/browse/SPARK-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15192: --- Description: When we create a Dataset from an RDD of rows with a specific schema, if the nullability of a value does not match the nullability defined in the schema, we will throw an exception that is not easy to understand. It will be good to verify the nullability in a more explicit way. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schema = new StructType().add("a", StringType, false).add("b", StringType, false) val rdd = sc.parallelize(Row(null, "123") :: Row("234", null) :: Nil) spark.createDataFrame(rdd, schema).show {code} {noformat} java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException createexternalrow(if (isnull(input[0, string])) null else input[0, string].toString, if (isnull(input[1, string])) null else input[1, string].toString, StructField(a,StringType,false), StructField(b,StringType,false)) :- if (isnull(input[0, string])) null else input[0, string].toString : :- isnull(input[0, string]) : : +- input[0, string] : :- null : +- input[0, string].toString : +- input[0, string] +- if (isnull(input[1, string])) null else input[1, string].toString :- isnull(input[1, string]) : +- input[1, string] :- null +- input[1, string].toString +- input[1, string] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:244) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2119) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$13.apply(Dataset.scala:2119) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2119) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2407) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2118) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2125) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1859) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1858) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2437) at org.apache.spark.sql.Dataset.head(Dataset.scala:1858) at org.apache.spark.sql.Dataset.take(Dataset.scala:2075) at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at org.apache.spark.sql.Dataset.show(Dataset.scala:530) at org.apache.spark.sql.Dataset.show(Dataset.scala:490) at org.apache.spark.sql.Dataset.show(Dataset.scala:499) ... 50 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:241) ... 72 more {noformat} was: When we create a Dataset from an RDD of rows with a specific schema, if the nullability of a value does not match the nullability defined in the schema, we will throw an exception that is not easy to understand. It will be good to verify the nullability in a more explicit way. {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schema = new StructType().add("a", StringType, false).add("b", StringType, false) val rdd = sc.parallelize(Row(null, "123") :: Row("234", null) :: Nil) spark.createDataFrame(rdd, schema).show java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException createexternalrow(if (isnull(input[0, string])) null else input[0, string].toString, if (isnull(input[1, string])) null else input[1, string].toString, StructField(a,StringType,false), StructField(b,StringType,false)) :- if (isnull(input[0, string])) null else input[0, string].toString : :- isnull(input[0, string]) : : +- input[0, string] : :- null : +- input[0, string].toString : +- input[0, string] +- if (isnull(input[1, string])) null else input[1, string].toString :- isnull(input[1, string]) : +- input[1, string] :- null +- input[1, string].toString +- input[1, string] at
[jira] [Updated] (SPARK-14459) SQL partitioning must match existing tables, but is not checked.
[ https://issues.apache.org/jira/browse/SPARK-14459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14459: --- Assignee: Ryan Blue > SQL partitioning must match existing tables, but is not checked. > > > Key: SPARK-14459 > URL: https://issues.apache.org/jira/browse/SPARK-14459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.0.0 > > > Writing into partitioned Hive tables has unexpected results because the > table's partitioning is not detected and applied during the analysis phase. > For example, if I have two tables, {{source}} and {{partitioned}}, with the > same column types: > {code} > CREATE TABLE source (id bigint, data string, part string); > CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part > string); > // copy from source to partitioned > sqlContext.table("source").write.insertInto("partitioned") > {code} > Copying from {{source}} to {{partitioned}} succeeds, but results in 0 rows. > This works if I explicitly partition by adding > {{...write.partitionBy("part").insertInto(...)}}. This work-around isn't > obvious and is prone to error because the {{partitionBy}} must match the > table's partitioning, though it is not checked. > I think when relations are resolved, the partitioning should be checked and > updated if it isn't set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14459) SQL partitioning must match existing tables, but is not checked.
[ https://issues.apache.org/jira/browse/SPARK-14459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-14459. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12239 [https://github.com/apache/spark/pull/12239] > SQL partitioning must match existing tables, but is not checked. > > > Key: SPARK-14459 > URL: https://issues.apache.org/jira/browse/SPARK-14459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.0.0 > > > Writing into partitioned Hive tables has unexpected results because the > table's partitioning is not detected and applied during the analysis phase. > For example, if I have two tables, {{source}} and {{partitioned}}, with the > same column types: > {code} > CREATE TABLE source (id bigint, data string, part string); > CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part > string); > // copy from source to partitioned > sqlContext.table("source").write.insertInto("partitioned") > {code} > Copying from {{source}} to {{partitioned}} succeeds, but results in 0 rows. > This works if I explicitly partition by adding > {{...write.partitionBy("part").insertInto(...)}}. This work-around isn't > obvious and is prone to error because the {{partitionBy}} must match the > table's partitioning, though it is not checked. > I think when relations are resolved, the partitioning should be checked and > updated if it isn't set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14459) SQL partitioning must match existing tables, but is not checked.
[ https://issues.apache.org/jira/browse/SPARK-14459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-14459: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 > SQL partitioning must match existing tables, but is not checked. > > > Key: SPARK-14459 > URL: https://issues.apache.org/jira/browse/SPARK-14459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: 2.0.0 > > > Writing into partitioned Hive tables has unexpected results because the > table's partitioning is not detected and applied during the analysis phase. > For example, if I have two tables, {{source}} and {{partitioned}}, with the > same column types: > {code} > CREATE TABLE source (id bigint, data string, part string); > CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (part > string); > // copy from source to partitioned > sqlContext.table("source").write.insertInto("partitioned") > {code} > Copying from {{source}} to {{partitioned}} succeeds, but results in 0 rows. > This works if I explicitly partition by adding > {{...write.partitionBy("part").insertInto(...)}}. This work-around isn't > obvious and is prone to error because the {{partitionBy}} must match the > table's partitioning, though it is not checked. > I think when relations are resolved, the partitioning should be checked and > updated if it isn't set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15218) Error: Could not find or load main class org.apache.spark.launcher.Main when run from a directory containing colon ':'
[ https://issues.apache.org/jira/browse/SPARK-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276129#comment-15276129 ] Viacheslav Saevskiy commented on SPARK-15218: - it's a java specific issue and I didn't find a solution to escape ':' in classpath. What mesos version do you use? It's possible what this issue was fixed on mesos version 0.18 https://issues.apache.org/jira/browse/MESOS-1128 > Error: Could not find or load main class org.apache.spark.launcher.Main when > run from a directory containing colon ':' > -- > > Key: SPARK-15218 > URL: https://issues.apache.org/jira/browse/SPARK-15218 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Adam Cecile > Labels: mesos > > {noformat} > mkdir /tmp/qwe:rtz > cd /tmp/qwe:rtz > wget > http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz > tar xvzf spark-1.6.1-bin-without-hadoop.tgz > cd spark-1.6.1-bin-without-hadoop/ > bin/spark-submit > {noformat} > Returns "Error: Could not find or load main class > org.apache.spark.launcher.Main". > That would not be such an issue if Mesos executor did not have colon in the > generated paths. It means withtout hacking (define relative SPARK_HOME path > by myself) there's no way to run a spark-job insode a mesos job container... > Best regards, Adam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15172) Warning message should explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15172. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 12948 [https://github.com/apache/spark/pull/12948] > Warning message should explicitly tell user initial coefficients is ignored > if its size doesn't match expected size in LogisticRegression > - > > Key: SPARK-15172 > URL: https://issues.apache.org/jira/browse/SPARK-15172 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: ding >Assignee: ding >Priority: Trivial > Fix For: 2.1.0 > > > From ML/LogisticRegression code logic, if size of initial coefficients > doesn't match expected size, initial coefficients value will be ignored. We > should explicitly tell user the information. Besides, log size of initial > coefficients should be more straightforward than log initial coefficients > value when size mismatch happened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15222) SparkR ML examples update in 2.0
[ https://issues.apache.org/jira/browse/SPARK-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15222: Assignee: (was: Apache Spark) > SparkR ML examples update in 2.0 > > > Key: SPARK-15222 > URL: https://issues.apache.org/jira/browse/SPARK-15222 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Priority: Minor > > Update example code in examples/src/main/r/ml.R to reflect the new algorithms. > * spark.glm and glm > * spark.survreg > * spark.naiveBayes > * spark.kmeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15222) SparkR ML examples update in 2.0
[ https://issues.apache.org/jira/browse/SPARK-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276114#comment-15276114 ] Apache Spark commented on SPARK-15222: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/13000 > SparkR ML examples update in 2.0 > > > Key: SPARK-15222 > URL: https://issues.apache.org/jira/browse/SPARK-15222 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Priority: Minor > > Update example code in examples/src/main/r/ml.R to reflect the new algorithms. > * spark.glm and glm > * spark.survreg > * spark.naiveBayes > * spark.kmeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15222) SparkR ML examples update in 2.0
[ https://issues.apache.org/jira/browse/SPARK-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15222: Assignee: Apache Spark > SparkR ML examples update in 2.0 > > > Key: SPARK-15222 > URL: https://issues.apache.org/jira/browse/SPARK-15222 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Update example code in examples/src/main/r/ml.R to reflect the new algorithms. > * spark.glm and glm > * spark.survreg > * spark.naiveBayes > * spark.kmeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276109#comment-15276109 ] Xin Hao edited comment on SPARK-4452 at 5/9/16 8:36 AM: Since this is an old issue which impact Spark since 1.1.0, can the patch be merged to Spark 1.6.X ? This will be very helpful for Spark 1.6.X users. Thanks. was (Author: xhao1): Since this is an old issue which impact Spark since 1.1.0, can the patch be merged to Spark 1.6.X ? Thanks. > Shuffle data structures can starve others on the same thread for memory > > > Key: SPARK-4452 > URL: https://issues.apache.org/jira/browse/SPARK-4452 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Tianshuo Deng >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > When an Aggregator is used with ExternalSorter in a task, spark will create > many small files and could cause too many files open error during merging. > Currently, ShuffleMemoryManager does not work well when there are 2 spillable > objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used > by Aggregator) in this case. Here is an example: Due to the usage of mapside > aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may > ask as much memory as it can, which is totalMem/numberOfThreads. Then later > on when ExternalSorter is created in the same thread, the > ShuffleMemoryManager could refuse to allocate more memory to it, since the > memory is already given to the previous requested > object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling > small files(due to the lack of memory) > I'm currently working on a PR to address these two issues. It will include > following changes: > 1. The ShuffleMemoryManager should not only track the memory usage for each > thread, but also the object who holds the memory > 2. The ShuffleMemoryManager should be able to trigger the spilling of a > spillable object. In this way, if a new object in a thread is requesting > memory, the old occupant could be evicted/spilled. Previously the spillable > objects trigger spilling by themselves. So one may not trigger spilling even > if another object in the same thread needs more memory. After this change The > ShuffleMemoryManager could trigger the spilling of an object if it needs to. > 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously > ExternalAppendOnlyMap returns an destructive iterator and can not be spilled > after the iterator is returned. This should be changed so that even after the > iterator is returned, the ShuffleMemoryManager can still spill it. > Currently, I have a working branch in progress: > https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made > change 3 and have a prototype of change 1 and 2 to evict spillable from > memory manager, still in progress. I will send a PR when it's done. > Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276109#comment-15276109 ] Xin Hao commented on SPARK-4452: Since this is an old issue which impact Spark since 1.1.0, can the patch be merged to Spark 1.6.X ? Thanks. > Shuffle data structures can starve others on the same thread for memory > > > Key: SPARK-4452 > URL: https://issues.apache.org/jira/browse/SPARK-4452 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Tianshuo Deng >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > When an Aggregator is used with ExternalSorter in a task, spark will create > many small files and could cause too many files open error during merging. > Currently, ShuffleMemoryManager does not work well when there are 2 spillable > objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used > by Aggregator) in this case. Here is an example: Due to the usage of mapside > aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may > ask as much memory as it can, which is totalMem/numberOfThreads. Then later > on when ExternalSorter is created in the same thread, the > ShuffleMemoryManager could refuse to allocate more memory to it, since the > memory is already given to the previous requested > object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling > small files(due to the lack of memory) > I'm currently working on a PR to address these two issues. It will include > following changes: > 1. The ShuffleMemoryManager should not only track the memory usage for each > thread, but also the object who holds the memory > 2. The ShuffleMemoryManager should be able to trigger the spilling of a > spillable object. In this way, if a new object in a thread is requesting > memory, the old occupant could be evicted/spilled. Previously the spillable > objects trigger spilling by themselves. So one may not trigger spilling even > if another object in the same thread needs more memory. After this change The > ShuffleMemoryManager could trigger the spilling of an object if it needs to. > 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously > ExternalAppendOnlyMap returns an destructive iterator and can not be spilled > after the iterator is returned. This should be changed so that even after the > iterator is returned, the ShuffleMemoryManager can still spill it. > Currently, I have a working branch in progress: > https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made > change 3 and have a prototype of change 1 and 2 to evict spillable from > memory manager, still in progress. I will send a PR when it's done. > Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15222) SparkR ML examples update in 2.0
Yanbo Liang created SPARK-15222: --- Summary: SparkR ML examples update in 2.0 Key: SPARK-15222 URL: https://issues.apache.org/jira/browse/SPARK-15222 Project: Spark Issue Type: Improvement Components: ML, SparkR Reporter: Yanbo Liang Priority: Minor Update example code in examples/src/main/r/ml.R to reflect the new algorithms. * spark.glm and glm * spark.survreg * spark.naiveBayes * spark.kmeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15159) Remove usage of HiveContext in SparkR.
[ https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276100#comment-15276100 ] Vijay Parmar commented on SPARK-15159: -- 1. I looked at the source code on "https://github.com/apache/spark/blob/master/R/pkg/R/SQLContext.R; and found that on lines 193 and 194 there is "sparkRHivesc" other than this there is nowhere mention of it in the code. I was bit confused whether this is the only change that needs to be done or something else also? 2. I didn't felt the need of any change in unit tests of SparkR. Please let know you opinion or suggestion(s). So that I can proceed further on this. > Remove usage of HiveContext in SparkR. > -- > > Key: SPARK-15159 > URL: https://issues.apache.org/jira/browse/SPARK-15159 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > HiveContext is to be deprecated in 2.0. Replace them with > SparkSession.withHiveSupport in SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14898) MultivariateGaussian could use Cholesky in calculateCovarianceConstants
[ https://issues.apache.org/jira/browse/SPARK-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276090#comment-15276090 ] Sean Owen commented on SPARK-14898: --- I don't think the comment means that the SVD is used. It's noting that it could be used, but in this case, the desired product reduces to a simpler operation, an eigendecomposition. [~josephkb] I think this is perhaps out of date? > MultivariateGaussian could use Cholesky in calculateCovarianceConstants > --- > > Key: SPARK-14898 > URL: https://issues.apache.org/jira/browse/SPARK-14898 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml.stat.distribution.MultivariateGaussian, > calculateCovarianceConstants uses SVD. It might be more efficient to use > Cholesky. We should check other numerical libraries and see if we should > switch to Cholesky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15136) Linkify ML PyDoc
[ https://issues.apache.org/jira/browse/SPARK-15136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15136. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12918 [https://github.com/apache/spark/pull/12918] > Linkify ML PyDoc > > > Key: SPARK-15136 > URL: https://issues.apache.org/jira/browse/SPARK-15136 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Priority: Minor > Fix For: 2.0.0 > > > PyDoc links in ml are in non-standard format. Switch to standard sphinx link > format for better formatted documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15136) Linkify ML PyDoc
[ https://issues.apache.org/jira/browse/SPARK-15136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15136: -- Assignee: holdenk > Linkify ML PyDoc > > > Key: SPARK-15136 > URL: https://issues.apache.org/jira/browse/SPARK-15136 > Project: Spark > Issue Type: Improvement >Reporter: holdenk >Assignee: holdenk >Priority: Minor > Fix For: 2.0.0 > > > PyDoc links in ml are in non-standard format. Switch to standard sphinx link > format for better formatted documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15221) error: not found: value sqlContext when starting Spark 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15221. --- Resolution: Not A Problem Target Version/s: (was: 1.6.1) Exactly, this is the problem: Caused by: java.sql.SQLException: Directory /home/metastore_db cannot be created. You probably don't have permission to create that dir, but it's also probably not where you meant it to be created. You'd have to determine why you're trying to write into /home, but that's not a Spark issue per se. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first > error: not found: value sqlContext when starting Spark 1.6.1 > > > Key: SPARK-15221 > URL: https://issues.apache.org/jira/browse/SPARK-15221 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.0.4, 8 GB RAM, 1 Processor >Reporter: Vijay Parmar >Priority: Blocker > Labels: build, newbie > > When I start Spark (version 1.6.1), at the very end I am getting the > following error message: > :16: error: not found: value sqlContext > import sqlContext.implicits._ > ^ > :16: error: not found: value sqlContext > import sqlContext.sql > I have gone through some content on the web about editing the /.bashrc file > and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. > Also tried editing the /etc/hosts file with :- > $ sudo vi /etc/hosts > ... > 127.0.0.1 > ... > but still the issue persists. Is it the issue with the build or something > else? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13064) api/v1/application/jobs/attempt lacks "attempId" field for spark-shell
[ https://issues.apache.org/jira/browse/SPARK-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13064. --- Resolution: Won't Fix > api/v1/application/jobs/attempt lacks "attempId" field for spark-shell > -- > > Key: SPARK-13064 > URL: https://issues.apache.org/jira/browse/SPARK-13064 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Reporter: Zhuo Liu >Priority: Minor > > For any application launches with spark-shell will not have attemptId field > in their rest API. From the REST API point of view, we might want to force an > Id for it, i.e., "1". > {code} > { > "id" : "application_1453789230389_377545", > "name" : "PySparkShell", > "attempts" : [ { > "startTime" : "2016-01-28T02:17:11.035GMT", > "endTime" : "2016-01-28T02:30:01.355GMT", > "lastUpdated" : "2016-01-28T02:30:01.516GMT", > "duration" : 770320, > "sparkUser" : "huyng", > "completed" : true > } ] > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15221) error: not found: value sqlContext when starting Spark 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276071#comment-15276071 ] Vijay Parmar commented on SPARK-15221: -- Well it's a whole lot that gets generated before the one I posted but still sharing some of the part :- Caused by: java.sql.SQLException: Failed to create database 'metastore_db', see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.createDatabase(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection40.(Unknown Source) at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown Source) at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source) at org.apache.derby.jdbc.Driver20.connect(Unknown Source) at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:208) at org.apache.commons.dbcp.DriverManagerConnectionFactory.createConnection(DriverManagerConnectionFactory.java:78) at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582) at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1148) at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106) at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:501) at org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) ... 104 more Caused by: java.sql.SQLException: Failed to create database 'metastore_db', see the next exception for details. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) ... 131 more Caused by: java.sql.SQLException: Directory /home/metastore_db cannot be created. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) ... 128 more Caused by: ERROR XBM0H: Directory /home/metastore_db cannot be created. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.services.monitor.StorageFactoryService$10.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.services.monitor.StorageFactoryService.createServiceRoot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.createPersistentService(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.createPersistentService(Unknown Source) :16: error: not found: value sqlContext import sqlContext.implicits._ ^ :16: error: not found: value sqlContext import sqlContext.sql > error: not found: value sqlContext when starting Spark 1.6.1 >
[jira] [Commented] (SPARK-15221) error: not found: value sqlContext when starting Spark 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276064#comment-15276064 ] Sean Owen commented on SPARK-15221: --- This virtually always happens when an earlier error occurred. Check farther up the console output. > error: not found: value sqlContext when starting Spark 1.6.1 > > > Key: SPARK-15221 > URL: https://issues.apache.org/jira/browse/SPARK-15221 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.0.4, 8 GB RAM, 1 Processor >Reporter: Vijay Parmar >Priority: Blocker > Labels: build, newbie > > When I start Spark (version 1.6.1), at the very end I am getting the > following error message: > :16: error: not found: value sqlContext > import sqlContext.implicits._ > ^ > :16: error: not found: value sqlContext > import sqlContext.sql > I have gone through some content on the web about editing the /.bashrc file > and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. > Also tried editing the /etc/hosts file with :- > $ sudo vi /etc/hosts > ... > 127.0.0.1 > ... > but still the issue persists. Is it the issue with the build or something > else? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15221) error: not found: value sqlContext when starting Spark 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay Parmar updated SPARK-15221: - Priority: Blocker (was: Minor) > error: not found: value sqlContext when starting Spark 1.6.1 > > > Key: SPARK-15221 > URL: https://issues.apache.org/jira/browse/SPARK-15221 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.0.4, 8 GB RAM, 1 Processor >Reporter: Vijay Parmar >Priority: Blocker > Labels: build, newbie > > When I start Spark (version 1.6.1), at the very end I am getting the > following error message: > :16: error: not found: value sqlContext > import sqlContext.implicits._ > ^ > :16: error: not found: value sqlContext > import sqlContext.sql > I have gone through some content on the web about editing the /.bashrc file > and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. > Also tried editing the /etc/hosts file with :- > $ sudo vi /etc/hosts > ... > 127.0.0.1 > ... > but still the issue persists. Is it the issue with the build or something > else? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15221) error: not found: value sqlContext when starting Spark 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay Parmar updated SPARK-15221: - Description: When I start Spark (version 1.6.1), at the very end I am getting the following error message: :16: error: not found: value sqlContext import sqlContext.implicits._ ^ :16: error: not found: value sqlContext import sqlContext.sql I have gone through some content on the web about editing the /.bashrc file and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. Also tried editing the /etc/hosts file with :- $ sudo vi /etc/hosts ... 127.0.0.1 ... but still the issue persists. Is it the issue with the build or something else? was: When I start Spark (version 1.6.1), at the very end I am getting the following error message: :16: error: not found: value sqlContext import sqlContext.implicits._ ^ :16: error: not found: value sqlContext import sqlContext.sql I have gone through some content on the web about editing the /.bashrc file and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. Also tried editing the /etc/hosts file with :- $ sudo vi /etc/hosts ... 127.0.0.1 ... but still the issue persists. Is it the issue with the build or something else? > error: not found: value sqlContext when starting Spark 1.6.1 > > > Key: SPARK-15221 > URL: https://issues.apache.org/jira/browse/SPARK-15221 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.0.4, 8 GB RAM, 1 Processor >Reporter: Vijay Parmar >Priority: Minor > Labels: build, newbie > > When I start Spark (version 1.6.1), at the very end I am getting the > following error message: > :16: error: not found: value sqlContext > import sqlContext.implicits._ > ^ > :16: error: not found: value sqlContext > import sqlContext.sql > I have gone through some content on the web about editing the /.bashrc file > and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. > Also tried editing the /etc/hosts file with :- > $ sudo vi /etc/hosts > ... > 127.0.0.1 > ... > but still the issue persists. Is it the issue with the build or something > else? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15221) error: not found: value sqlContext when starting Spark 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay Parmar updated SPARK-15221: - Description: When I start Spark (version 1.6.1), at the very end I am getting the following error message: :16: error: not found: value sqlContext import sqlContext.implicits._ ^ :16: error: not found: value sqlContext import sqlContext.sql I have gone through some content on the web about editing the /.bashrc file and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. Also tried editing the /etc/hosts file with :- $ sudo vi /etc/hosts ... 127.0.0.1 ... but still the issue persists. Is it the issue with the build or something else? was: When I start Spark (version 1.6.1), at the very end I am getting the following error message: :16: error: not found: value sqlContext import sqlContext.implicits._ ^ :16: error: not found: value sqlContext import sqlContext.sql I have gone through some content on the web about editing the /.bashrc file and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. Also tried editing the /etc/hosts file with :- $ sudo vi /etc/hosts ... 127.0.0.1 ... but still the issue persists. Is it the issue with the build or something else? > error: not found: value sqlContext when starting Spark 1.6.1 > > > Key: SPARK-15221 > URL: https://issues.apache.org/jira/browse/SPARK-15221 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: Ubuntu 14.0.4, 8 GB RAM, 1 Processor >Reporter: Vijay Parmar >Priority: Minor > Labels: build, newbie > > When I start Spark (version 1.6.1), at the very end I am getting the > following error message: > :16: error: not found: value sqlContext > import sqlContext.implicits._ > ^ > :16: error: not found: value sqlContext > import sqlContext.sql > I have gone through some content on the web about editing the /.bashrc file > and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. > Also tried editing the /etc/hosts file with :- > $ sudo vi /etc/hosts > ... > 127.0.0.1 > ... > but still the issue persists. Is it the issue with the build or something > else? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15221) error: not found: value sqlContext when starting Spark 1.6.1
Vijay Parmar created SPARK-15221: Summary: error: not found: value sqlContext when starting Spark 1.6.1 Key: SPARK-15221 URL: https://issues.apache.org/jira/browse/SPARK-15221 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Environment: Ubuntu 14.0.4, 8 GB RAM, 1 Processor Reporter: Vijay Parmar Priority: Minor When I start Spark (version 1.6.1), at the very end I am getting the following error message: :16: error: not found: value sqlContext import sqlContext.implicits._ ^ :16: error: not found: value sqlContext import sqlContext.sql I have gone through some content on the web about editing the /.bashrc file and including the "SPARK_LOCAL_IP=127.0.0.1" under SPARK variables. Also tried editing the /etc/hosts file with :- $ sudo vi /etc/hosts ... 127.0.0.1 ... but still the issue persists. Is it the issue with the build or something else? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization
[ https://issues.apache.org/jira/browse/SPARK-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276055#comment-15276055 ] Yi Zhou commented on SPARK-15219: - Posted the core physical plan > [Spark SQL] it don't support to detect runtime temporary table for enabling > broadcast hash join optimization > > > Key: SPARK-15219 > URL: https://issues.apache.org/jira/browse/SPARK-15219 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yi Zhou > > We observed an interesting thing about broadcast Hash join( similar to Map > Join in Hive) when comparing the implementation by Hive on MR engine. The > blew query is a multi-way join operation based on 3 tables including > product_reviews, 2 run-time temporary result tables(fsr and fwr) from > ‘select’ query operation and also there is a two-way join(1 table and 1 > run-time temporary table) in both 'fsr' and 'fwr',which cause slower > performance than Hive on MR. We investigated the difference between Spark SQL > and Hive on MR engine and found that there are total of 5 map join tasks with > tuned map join parameters in Hive on MR but there are only 2 broadcast hash > join tasks in Spark SQL even if we set a larger threshold(e.g.,1GB) for > broadcast hash join. From our investigation, it seems that if there is > run-time temporary table in join operation in Spark SQL engine it will not > detect such table for enabling broadcast hash join optimization. > Core SQL snippet: > {code} > INSERT INTO TABLE q19_spark_sql_power_test_0_result > SELECT * > FROM > ( --wrap in additional FROM(), because Sorting/distribute by with UDTF in > select clause is not allowed > SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS > ( > item_sk, > review_sentence, > sentiment, > sentiment_word > ) > FROM product_reviews pr, > ( > --store returns in week ending given date > SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty > FROM store_returns sr, > ( > -- within the week ending a given date > SELECT d1.d_date_sk > FROM date_dim d1, date_dim d2 > WHERE d1.d_week_seq = d2.d_week_seq > AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', > '2004-12-20' ) > ) sr_dateFilter > WHERE sr.sr_returned_date_sk = d_date_sk > GROUP BY sr_item_sk --across all store and web channels > HAVING sr_item_qty > 0 > ) fsr, > ( > --web returns in week ending given date > SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty > FROM web_returns wr, > ( > -- within the week ending a given date > SELECT d1.d_date_sk > FROM date_dim d1, date_dim d2 > WHERE d1.d_week_seq = d2.d_week_seq > AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', > '2004-12-20' ) > ) wr_dateFilter > WHERE wr.wr_returned_date_sk = d_date_sk > GROUP BY wr_item_sk --across all store and web channels > HAVING wr_item_qty > 0 > ) fwr > WHERE fsr.sr_item_sk = fwr.wr_item_sk > AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items > -- equivalent across all store and web channels (within a tolerance of +/- > 10%) > AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1 > )extractedSentiments > WHERE sentiment= 'NEG' --if there are any major negative reviews. > ORDER BY item_sk,review_sentence,sentiment,sentiment_word > ; > {code} > Physical Plan: > {code} > == Physical Plan == > InsertIntoHiveTable MetastoreRelation bigbench_3t_sparksql, > q19_spark_sql_run_query_0_result, None, Map(), false, false > +- ConvertToSafe >+- Sort [item_sk#537L ASC,review_sentence#538 ASC,sentiment#539 > ASC,sentiment_word#540 ASC], true, 0 > +- ConvertToUnsafe > +- Exchange rangepartitioning(item_sk#537L ASC,review_sentence#538 > ASC,sentiment#539 ASC,sentiment_word#540 ASC,200), None > +- ConvertToSafe >+- Project > [item_sk#537L,review_sentence#538,sentiment#539,sentiment_word#540] > +- Filter (sentiment#539 = NEG) > +- !Generate > HiveGenericUDTF#io.bigdatabenchmark.v1.queries.q10.SentimentUDF(pr_item_sk#363L,pr_review_content#366), > false, false, > [item_sk#537L,review_sentence#538,sentiment#539,sentiment_word#540] > +- ConvertToSafe >+- Project [pr_item_sk#363L,pr_review_content#366] > +- Filter (abs((cast((sr_item_qty#356L - > wr_item_qty#357L) as double) / (cast((sr_item_qty#356L + wr_item_qty#357L) as > double) / 2.0))) <= 0.1) > +- SortMergeJoin [sr_item_sk#369L], > [wr_item_sk#445L] > :- Sort
[jira] [Updated] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization
[ https://issues.apache.org/jira/browse/SPARK-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Zhou updated SPARK-15219: Description: We observed an interesting thing about broadcast Hash join( similar to Map Join in Hive) when comparing the implementation by Hive on MR engine. The blew query is a multi-way join operation based on 3 tables including product_reviews, 2 run-time temporary result tables(fsr and fwr) from ‘select’ query operation and also there is a two-way join(1 table and 1 run-time temporary table) in both 'fsr' and 'fwr',which cause slower performance than Hive on MR. We investigated the difference between Spark SQL and Hive on MR engine and found that there are total of 5 map join tasks with tuned map join parameters in Hive on MR but there are only 2 broadcast hash join tasks in Spark SQL even if we set a larger threshold(e.g.,1GB) for broadcast hash join. From our investigation, it seems that if there is run-time temporary table in join operation in Spark SQL engine it will not detect such table for enabling broadcast hash join optimization. Core SQL snippet: {code} INSERT INTO TABLE q19_spark_sql_power_test_0_result SELECT * FROM ( --wrap in additional FROM(), because Sorting/distribute by with UDTF in select clause is not allowed SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS ( item_sk, review_sentence, sentiment, sentiment_word ) FROM product_reviews pr, ( --store returns in week ending given date SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty FROM store_returns sr, ( -- within the week ending a given date SELECT d1.d_date_sk FROM date_dim d1, date_dim d2 WHERE d1.d_week_seq = d2.d_week_seq AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' ) ) sr_dateFilter WHERE sr.sr_returned_date_sk = d_date_sk GROUP BY sr_item_sk --across all store and web channels HAVING sr_item_qty > 0 ) fsr, ( --web returns in week ending given date SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty FROM web_returns wr, ( -- within the week ending a given date SELECT d1.d_date_sk FROM date_dim d1, date_dim d2 WHERE d1.d_week_seq = d2.d_week_seq AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' ) ) wr_dateFilter WHERE wr.wr_returned_date_sk = d_date_sk GROUP BY wr_item_sk --across all store and web channels HAVING wr_item_qty > 0 ) fwr WHERE fsr.sr_item_sk = fwr.wr_item_sk AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items -- equivalent across all store and web channels (within a tolerance of +/- 10%) AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1 )extractedSentiments WHERE sentiment= 'NEG' --if there are any major negative reviews. ORDER BY item_sk,review_sentence,sentiment,sentiment_word ; {code} Physical Plan: {code} == Physical Plan == InsertIntoHiveTable MetastoreRelation bigbench_3t_sparksql, q19_spark_sql_run_query_0_result, None, Map(), false, false +- ConvertToSafe +- Sort [item_sk#537L ASC,review_sentence#538 ASC,sentiment#539 ASC,sentiment_word#540 ASC], true, 0 +- ConvertToUnsafe +- Exchange rangepartitioning(item_sk#537L ASC,review_sentence#538 ASC,sentiment#539 ASC,sentiment_word#540 ASC,200), None +- ConvertToSafe +- Project [item_sk#537L,review_sentence#538,sentiment#539,sentiment_word#540] +- Filter (sentiment#539 = NEG) +- !Generate HiveGenericUDTF#io.bigdatabenchmark.v1.queries.q10.SentimentUDF(pr_item_sk#363L,pr_review_content#366), false, false, [item_sk#537L,review_sentence#538,sentiment#539,sentiment_word#540] +- ConvertToSafe +- Project [pr_item_sk#363L,pr_review_content#366] +- Filter (abs((cast((sr_item_qty#356L - wr_item_qty#357L) as double) / (cast((sr_item_qty#356L + wr_item_qty#357L) as double) / 2.0))) <= 0.1) +- SortMergeJoin [sr_item_sk#369L], [wr_item_sk#445L] :- Sort [sr_item_sk#369L ASC], false, 0 : +- Project [pr_item_sk#363L,sr_item_qty#356L,pr_review_content#366,sr_item_sk#369L] : +- SortMergeJoin [pr_item_sk#363L], [sr_item_sk#369L] ::- Sort [pr_item_sk#363L ASC], false, 0 :: +- TungstenExchange hashpartitioning(pr_item_sk#363L,200), None :: +- ConvertToUnsafe ::+- HiveTableScan [pr_item_sk#363L,pr_review_content#366], MetastoreRelation bigbench_3t_sparksql, product_reviews, Some(pr)
[jira] [Commented] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization
[ https://issues.apache.org/jira/browse/SPARK-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276042#comment-15276042 ] Herman van Hovell commented on SPARK-15219: --- [~jameszhouyi] Could you also post the query plan? Use either {{explain extended ...}} in SQL or {{df.explain(true)}} using dataframes. > [Spark SQL] it don't support to detect runtime temporary table for enabling > broadcast hash join optimization > > > Key: SPARK-15219 > URL: https://issues.apache.org/jira/browse/SPARK-15219 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yi Zhou > > We observed an interesting thing about broadcast Hash join( similar to Map > Join in Hive) when comparing the implementation by Hive on MR engine. The > blew query is a multi-way join operation based on 3 tables including > product_reviews, 2 run-time temporary result tables(fsr and fwr) from > ‘select’ query operation and also there is a two-way join(1 table and 1 > run-time temporary table) in both 'fsr' and 'fwr',which cause slower > performance than Hive on MR. We investigated the difference between Spark SQL > and Hive on MR engine and found that there are total of 5 map join tasks with > tuned map join parameters in Hive on MR but there are only 2 broadcast hash > join tasks in Spark SQL even if we set a larger threshold(e.g.,1GB) for > broadcast hash join. From our investigation, it seems that if there is > run-time temporary table in join operation in Spark SQL engine it will not > detect such table for enabling broadcast hash join optimization. > Core SQL snippet: > {code} > INSERT INTO TABLE q19_spark_sql_power_test_0_result > SELECT * > FROM > ( --wrap in additional FROM(), because Sorting/distribute by with UDTF in > select clause is not allowed > SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS > ( > item_sk, > review_sentence, > sentiment, > sentiment_word > ) > FROM product_reviews pr, > ( > --store returns in week ending given date > SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty > FROM store_returns sr, > ( > -- within the week ending a given date > SELECT d1.d_date_sk > FROM date_dim d1, date_dim d2 > WHERE d1.d_week_seq = d2.d_week_seq > AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', > '2004-12-20' ) > ) sr_dateFilter > WHERE sr.sr_returned_date_sk = d_date_sk > GROUP BY sr_item_sk --across all store and web channels > HAVING sr_item_qty > 0 > ) fsr, > ( > --web returns in week ending given date > SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty > FROM web_returns wr, > ( > -- within the week ending a given date > SELECT d1.d_date_sk > FROM date_dim d1, date_dim d2 > WHERE d1.d_week_seq = d2.d_week_seq > AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', > '2004-12-20' ) > ) wr_dateFilter > WHERE wr.wr_returned_date_sk = d_date_sk > GROUP BY wr_item_sk --across all store and web channels > HAVING wr_item_qty > 0 > ) fwr > WHERE fsr.sr_item_sk = fwr.wr_item_sk > AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items > -- equivalent across all store and web channels (within a tolerance of +/- > 10%) > AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1 > )extractedSentiments > WHERE sentiment= 'NEG' --if there are any major negative reviews. > ORDER BY item_sk,review_sentence,sentiment,sentiment_word > ; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15211) Select features column from LibSVMRelation causes failure
[ https://issues.apache.org/jira/browse/SPARK-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15211: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 Description: It will cause failure when trying to load data with LibSVMRelation and select features column: {code} val df2 = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") df: org.apache.spark.sql.DataFrame = [label: double, features: vector] scala> df2.select("features").show java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of class java.lang.Byte) createexternalrow(if (isnull(input[0, vector])) null else newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize, StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true)) ... {code} was: It will cause failure when trying to load data with LibSVMRelation and select features column: {code} val df2 = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") df: org.apache.spark.sql.DataFrame = [label: double, features: vector] scala> df2.select("features").show java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of class java.lang.Byte) createexternalrow(if (isnull(input[0, vector])) null else newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize, StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true)) ... {code} > Select features column from LibSVMRelation causes failure > - > > Key: SPARK-15211 > URL: https://issues.apache.org/jira/browse/SPARK-15211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > It will cause failure when trying to load data with LibSVMRelation and select > features column: > {code} > val df2 = > spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector] > scala> df2.select("features").show > java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of > class java.lang.Byte) > createexternalrow(if (isnull(input[0, vector])) null else newInstance(class > org.apache.spark.mllib.linalg.VectorUDT).deserialize, > StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true)) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15211) Select features column from LibSVMRelation causes failure
[ https://issues.apache.org/jira/browse/SPARK-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-15211. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12986 [https://github.com/apache/spark/pull/12986] > Select features column from LibSVMRelation causes failure > - > > Key: SPARK-15211 > URL: https://issues.apache.org/jira/browse/SPARK-15211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > It will cause failure when trying to load data with LibSVMRelation and select > features column: > {code} > val df2 = > spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector] > scala> df2.select("features").show > java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of > class java.lang.Byte) > createexternalrow(if (isnull(input[0, vector])) null else newInstance(class > org.apache.spark.mllib.linalg.VectorUDT).deserialize, > StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true)) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15211) Select features column from LibSVMRelation causes failure
[ https://issues.apache.org/jira/browse/SPARK-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15211: --- Assignee: Liang-Chi Hsieh > Select features column from LibSVMRelation causes failure > - > > Key: SPARK-15211 > URL: https://issues.apache.org/jira/browse/SPARK-15211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > It will cause failure when trying to load data with LibSVMRelation and select > features column: > {code} > val df2 = > spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector] > scala> df2.select("features").show > java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of > class java.lang.Byte) > createexternalrow(if (isnull(input[0, vector])) null else newInstance(class > org.apache.spark.mllib.linalg.VectorUDT).deserialize, > StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true)) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265805#comment-15265805 ] Sandeep Singh edited comment on SPARK-928 at 5/9/16 7:02 AM: - [~joshrosen] I would like to work on this. I tried benchmarking the difference between unsafe kryo and our current impl. and then we can have a spark.kryo.useUnsafe flag as Matei has mentioned. {code:title=Benchmarking results|borderStyle=solid} JBenchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative basicTypes: Int unsafe:true160 / 178 98.5 10.1 1.0X basicTypes: Long unsafe:true 210 / 218 74.9 13.4 0.8X basicTypes: Float unsafe:true 203 / 213 77.5 12.9 0.8X basicTypes: Double unsafe:true 226 / 235 69.5 14.4 0.7X Array: Int unsafe:true1087 / 1101 14.5 69.1 0.1X Array: Long unsafe:true 2758 / 2844 5.7 175.4 0.1X Array: Float unsafe:true 1511 / 1552 10.4 96.1 0.1X Array: Double unsafe:true 2942 / 2972 5.3 187.0 0.1X Map of string->Double unsafe:true 2645 / 2739 5.9 168.2 0.1X basicTypes: Int unsafe:false 211 / 218 74.7 13.4 0.8X basicTypes: Long unsafe:false 247 / 253 63.6 15.7 0.6X basicTypes: Float unsafe:false 211 / 216 74.5 13.4 0.8X basicTypes: Double unsafe:false227 / 233 69.2 14.4 0.7X Array: Int unsafe:false 3012 / 3032 5.2 191.5 0.1X Array: Long unsafe:false 4463 / 4515 3.5 283.8 0.0X Array: Float unsafe:false 2788 / 2868 5.6 177.2 0.1X Array: Double unsafe:false3558 / 3752 4.4 226.2 0.0X Map of string->Double unsafe:false2806 / 2933 5.6 178.4 0.1X {code} You can find the code for benchmarking here (https://github.com/techaddict/spark/commit/46fa44141c849ca15bbd6136cea2fa52bd927da2), very ugly right now but will improve it(add more benchmarks) before creating a PR. was (Author: techaddict): [~joshrosen] I would like to work on this. I tried benchmarking the difference between unsafe kryo and our current impl. and then we can have a spark.kryo.useUnsafe flag as Matei has mentioned. {code:title=Benchmarking results|borderStyle=solid} Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz Benchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative basicTypes: Int unsafe:false 2 /4 8988.0 0.1 1.0X basicTypes: Long unsafe:false1 /1 13981.3 0.1 1.6X basicTypes: Float unsafe:false 1 /1 14460.6 0.1 1.6X basicTypes: Double unsafe:false 1 /1 15876.9 0.1 1.8X Array: Int unsafe:false 33 / 44474.8 2.1 0.1X Array: Long unsafe:false18 / 25888.6 1.1 0.1X Array: Float unsafe:false 10 / 16 1627.4 0.6 0.2X Array: Double unsafe:false 10 / 13 1523.1 0.7 0.2X Map of string->Double unsafe:false 413 / 447 38.1 26.3 0.0X basicTypes: Int unsafe:true 1 /1 16402.6 0.1 1.8X basicTypes: Long unsafe:true 1 /1 19732.1 0.1 2.2X basicTypes: Float unsafe:true1 /1 19752.9 0.1 2.2X basicTypes: Double unsafe:true 1 /1 23111.4 0.0 2.6X Array: Int unsafe:true 7 /8 2239.9 0.4 0.2X Array: Long unsafe:true 8 /9 2000.1 0.5 0.2X Array: Float unsafe:true
[jira] [Created] (SPARK-15220) Add hyperlink to "running application" and "completed application"
Mao, Wei created SPARK-15220: Summary: Add hyperlink to "running application" and "completed application" Key: SPARK-15220 URL: https://issues.apache.org/jira/browse/SPARK-15220 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Mao, Wei Priority: Minor Add hyperlink to "running application" and "completed application", so user can jump to application table directly, In my environment, I set up 1000+ works and it's painful to scroll down to skip worker list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization
[ https://issues.apache.org/jira/browse/SPARK-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Zhou updated SPARK-15219: Description: We observed an interesting thing about broadcast Hash join( similar to Map Join in Hive) when comparing the implementation by Hive on MR engine. The blew query is a multi-way join operation based on 3 tables including product_reviews, 2 run-time temporary result tables(fsr and fwr) from ‘select’ query operation and also there is a two-way join(1 table and 1 run-time temporary table) in both 'fsr' and 'fwr',which cause slower performance than Hive on MR. We investigated the difference between Spark SQL and Hive on MR engine and found that there are total of 5 map join tasks with tuned map join parameters in Hive on MR but there are only 2 broadcast hash join tasks in Spark SQL even if we set a larger threshold(e.g.,1GB) for broadcast hash join. From our investigation, it seems that if there is run-time temporary table in join operation in Spark SQL engine it will not detect such table for enabling broadcast hash join optimization. Core SQL snippet: {code} INSERT INTO TABLE q19_spark_sql_power_test_0_result SELECT * FROM ( --wrap in additional FROM(), because Sorting/distribute by with UDTF in select clause is not allowed SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS ( item_sk, review_sentence, sentiment, sentiment_word ) FROM product_reviews pr, ( --store returns in week ending given date SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty FROM store_returns sr, ( -- within the week ending a given date SELECT d1.d_date_sk FROM date_dim d1, date_dim d2 WHERE d1.d_week_seq = d2.d_week_seq AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' ) ) sr_dateFilter WHERE sr.sr_returned_date_sk = d_date_sk GROUP BY sr_item_sk --across all store and web channels HAVING sr_item_qty > 0 ) fsr, ( --web returns in week ending given date SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty FROM web_returns wr, ( -- within the week ending a given date SELECT d1.d_date_sk FROM date_dim d1, date_dim d2 WHERE d1.d_week_seq = d2.d_week_seq AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' ) ) wr_dateFilter WHERE wr.wr_returned_date_sk = d_date_sk GROUP BY wr_item_sk --across all store and web channels HAVING wr_item_qty > 0 ) fwr WHERE fsr.sr_item_sk = fwr.wr_item_sk AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items -- equivalent across all store and web channels (within a tolerance of +/- 10%) AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1 )extractedSentiments WHERE sentiment= 'NEG' --if there are any major negative reviews. ORDER BY item_sk,review_sentence,sentiment,sentiment_word ; {code} was: We observed an interesting thing about broadcast Hash join( similar to Map Join in Hive) when comparing the implementation by Hive on MR engine. The blew query is a multi-way join operation based on 3 tables including product_reviews, 2 run-time temporary result tables(fsr and fwr) from ‘select’ query operation and also there is a two-way join(1 table and 1 run-time temporary table) in both 'fsr' and 'fwr'. We investigated the difference between Spark SQL and Hive on MR engine and found that there are total of 5 map join tasks with tuned map join parameters in Hive on MR but there are only 2 broadcast hash join tasks in Spark SQL even if we set a larger threshold(e.g.,1GB) for broadcast hash join. From our investigation, it seems that if there is run-time temporary table in join operation in Spark SQL engine it will not detect such table for enabling broadcast hash join optimization. Core SQL snippet: {code} INSERT INTO TABLE q19_spark_sql_power_test_0_result SELECT * FROM ( --wrap in additional FROM(), because Sorting/distribute by with UDTF in select clause is not allowed SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS ( item_sk, review_sentence, sentiment, sentiment_word ) FROM product_reviews pr, ( --store returns in week ending given date SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty FROM store_returns sr, ( -- within the week ending a given date SELECT d1.d_date_sk FROM date_dim d1, date_dim d2 WHERE d1.d_week_seq = d2.d_week_seq AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' ) ) sr_dateFilter WHERE sr.sr_returned_date_sk = d_date_sk GROUP BY sr_item_sk --across all store and web channels HAVING sr_item_qty > 0 ) fsr, ( --web returns in week ending given date SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty FROM web_returns wr, ( -- within the
[jira] [Created] (SPARK-15219) [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization
Yi Zhou created SPARK-15219: --- Summary: [Spark SQL] it don't support to detect runtime temporary table for enabling broadcast hash join optimization Key: SPARK-15219 URL: https://issues.apache.org/jira/browse/SPARK-15219 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yi Zhou We observed an interesting thing about broadcast Hash join( similar to Map Join in Hive) when comparing the implementation by Hive on MR engine. The blew query is a multi-way join operation based on 3 tables including product_reviews, 2 run-time temporary result tables(fsr and fwr) from ‘select’ query operation and also there is a two-way join(1 table and 1 run-time temporary table) in both 'fsr' and 'fwr'. We investigated the difference between Spark SQL and Hive on MR engine and found that there are total of 5 map join tasks with tuned map join parameters in Hive on MR but there are only 2 broadcast hash join tasks in Spark SQL even if we set a larger threshold(e.g.,1GB) for broadcast hash join. From our investigation, it seems that if there is run-time temporary table in join operation in Spark SQL engine it will not detect such table for enabling broadcast hash join optimization. Core SQL snippet: {code} INSERT INTO TABLE q19_spark_sql_power_test_0_result SELECT * FROM ( --wrap in additional FROM(), because Sorting/distribute by with UDTF in select clause is not allowed SELECT extract_sentiment(pr.pr_item_sk, pr.pr_review_content) AS ( item_sk, review_sentence, sentiment, sentiment_word ) FROM product_reviews pr, ( --store returns in week ending given date SELECT sr_item_sk, SUM(sr_return_quantity) sr_item_qty FROM store_returns sr, ( -- within the week ending a given date SELECT d1.d_date_sk FROM date_dim d1, date_dim d2 WHERE d1.d_week_seq = d2.d_week_seq AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' ) ) sr_dateFilter WHERE sr.sr_returned_date_sk = d_date_sk GROUP BY sr_item_sk --across all store and web channels HAVING sr_item_qty > 0 ) fsr, ( --web returns in week ending given date SELECT wr_item_sk, SUM(wr_return_quantity) wr_item_qty FROM web_returns wr, ( -- within the week ending a given date SELECT d1.d_date_sk FROM date_dim d1, date_dim d2 WHERE d1.d_week_seq = d2.d_week_seq AND d2.d_date IN ( '2004-03-8' ,'2004-08-02' ,'2004-11-15', '2004-12-20' ) ) wr_dateFilter WHERE wr.wr_returned_date_sk = d_date_sk GROUP BY wr_item_sk --across all store and web channels HAVING wr_item_qty > 0 ) fwr WHERE fsr.sr_item_sk = fwr.wr_item_sk AND pr.pr_item_sk = fsr.sr_item_sk --extract product_reviews for found items -- equivalent across all store and web channels (within a tolerance of +/- 10%) AND abs( (sr_item_qty-wr_item_qty)/ ((sr_item_qty+wr_item_qty)/2)) <= 0.1 )extractedSentiments WHERE sentiment= 'NEG' --if there are any major negative reviews. ORDER BY item_sk,review_sentence,sentiment,sentiment_word ; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15218) Error: Could not find or load main class org.apache.spark.launcher.Main when run from a directory containing colon ':'
[ https://issues.apache.org/jira/browse/SPARK-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Cecile updated SPARK-15218: Description: {noformat} mkdir /tmp/qwe:rtz cd /tmp/qwe:rtz wget http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz tar xvzf spark-1.6.1-bin-without-hadoop.tgz cd spark-1.6.1-bin-without-hadoop/ bin/spark-submit {noformat} Returns "Error: Could not find or load main class org.apache.spark.launcher.Main". That would not be such an issue if Mesos executor did not have colon in the generated paths. It means withtout hacking (define relative SPARK_HOME path by myself) there's no way to run a spark-job insode a mesos job container... Best regards, Adam. was: {noformat} mkdir /tmp/qwe:rtz cd /tmp/qwe:rtz wget http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz tar xvzf spark-1.6.1-bin-without-hadoop.tgz cd spark-1.6.1-bin-without-hadoop/ bin/spark-submit {noformat} Returns "Error: Could not find or load main class org.apache.spark.launcher.Main". That would not be such an issue if Mesos executor did not have colon in the generated paths. It means withtout hacking (define relative SPARK_HOME path by myself) there's no way to run a spark-job insode a mesos job container... Best regards, Adam. > Error: Could not find or load main class org.apache.spark.launcher.Main when > run from a directory containing colon ':' > -- > > Key: SPARK-15218 > URL: https://issues.apache.org/jira/browse/SPARK-15218 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Adam Cecile > Labels: mesos > > {noformat} > mkdir /tmp/qwe:rtz > cd /tmp/qwe:rtz > wget > http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz > tar xvzf spark-1.6.1-bin-without-hadoop.tgz > cd spark-1.6.1-bin-without-hadoop/ > bin/spark-submit > {noformat} > Returns "Error: Could not find or load main class > org.apache.spark.launcher.Main". > That would not be such an issue if Mesos executor did not have colon in the > generated paths. It means withtout hacking (define relative SPARK_HOME path > by myself) there's no way to run a spark-job insode a mesos job container... > Best regards, Adam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15218) Error: Could not find or load main class org.apache.spark.launcher.Main when run from a directory containing colon ':'
Adam Cecile created SPARK-15218: --- Summary: Error: Could not find or load main class org.apache.spark.launcher.Main when run from a directory containing colon ':' Key: SPARK-15218 URL: https://issues.apache.org/jira/browse/SPARK-15218 Project: Spark Issue Type: Bug Components: Spark Core, Spark Shell, Spark Submit Affects Versions: 1.6.1 Reporter: Adam Cecile {noformat} mkdir /tmp/qwe:rtz cd /tmp/qwe:rtz wget http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz tar xvzf spark-1.6.1-bin-without-hadoop.tgz cd spark-1.6.1-bin-without-hadoop/ bin/spark-submit {noformat} Returns "Error: Could not find or load main class org.apache.spark.launcher.Main". That would not be such an issue if Mesos executor did not have colon in the generated paths. It means withtout hacking (define relative SPARK_HOME path by myself) there's no way to run a spark-job insode a mesos job container... Best regards, Adam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones
[ https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275987#comment-15275987 ] Vijay Parmar commented on SPARK-14057: -- I have few suggestions to make here after looking into the issue along with referring google and other sources:- 1. We can make use of the built-in java.time.package which is available in Java 8 and higher versions (http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/Date.java). In short, here a new instance would be created to adjust the Time-Zone. In java.util.date package the class in most of the cases ignores the Time-Zone. We could try implementing this package. 2. This code snippet can be handy :- ZoneId zoneLondon = ZoneId.of("London"); ZonedDateTime nowLondon = ZonedDateTime.now ( zoneLondon ); ZoneId zoneSingapore = ZoneId.of("Singapore"); ZonedDateTime nowSingapore = nowLondon.withZoneSameInstant( zoneSingapore ); ZonedDateTime nowUTC = nowLondon.withZoneSameInstant( ZoneOffset.UTC ); 3. We need to look into the SQL side code also To have an understanding how the Time is getting captured and stored once it is received from this end. I will still keep on looking iinto the issue and will update you. Meanwhile, I also wait for your comment(s) on my suggestions. > sql time stamps do not respect time zones > - > > Key: SPARK-14057 > URL: https://issues.apache.org/jira/browse/SPARK-14057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Andrew Davidson >Priority: Minor > > we have time stamp data. The time stamp data is UTC how ever when we load the > data into spark data frames, the system assume the time stamps are in the > local time zone. This causes problems for our data scientists. Often they > pull data from our data center into their local macs. The data centers run > UTC. There computers are typically in PST or EST. > It is possible to hack around this problem > This cause a lot of errors in their analysis > A complete description of this issue can be found in the following mail msg > https://www.mail-archive.com/user@spark.apache.org/msg48121.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org