[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. In spark-default.xml, configure spark.hierarchyStore. {code} spark.hierarchyStore nvm 50GB, ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. was: *Problem*: Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution*: Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance*: 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. In spark-default.xml, configure spark.hierarchyStore. {code} spark.hierarchyStore nvm 50GB, ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first, when nvm's usable space is less than 50GB, it starts to allocate from ssd. when ssd's usable space is less than 80GB, it starts to allocate from the last layer. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with
[jira] [Commented] (SPARK-12231) Failed to generate predicate Error when using dropna
[ https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048260#comment-15048260 ] yahsuan, chang commented on SPARK-12231: the following code won't throw exception on my environment {code} import pyspark sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) df = sqlc.range(10) df1 = df.withColumn('a', df['id'] * 2) df1.write.parquet('./data') df2 = sqlc.read.parquet('./data') print df2.dropna().count() {code} I use spark-1.5.2-bin-hadoop2.6/bin/spark-submit to execute the code. Thanks for your quick reply > Failed to generate predicate Error when using dropna > > > Key: SPARK-12231 > URL: https://issues.apache.org/jira/browse/SPARK-12231 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2 > Environment: python version: 2.7.9 > os: ubuntu 14.04 >Reporter: yahsuan, chang > > code to reproduce error > # write.py > import pyspark > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > df = sqlc.range(10) > df1 = df.withColumn('a', df['id'] * 2) > df1.write.partitionBy('id').parquet('./data') > # read.py > import pyspark > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > df2 = sqlc.read.parquet('./data') > df2.dropna().count() > $ spark-submit write.py > $ spark-submit read.py > # error message > 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to > interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Binding attribute, tree: a#0L > ... > If write data without partitionBy, the error won't happen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048286#comment-15048286 ] Reynold Xin commented on SPARK-12225: - Why is this necessary? It seems difficult to design an API for this, and it is trivial for users to do it themselves. > Support adding or replacing multiple columns at once in DataFrame API > - > > Key: SPARK-12225 > URL: https://issues.apache.org/jira/browse/SPARK-12225 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.2 >Reporter: Sun Rui > > Currently, withColumn() method of DataFrame supports adding or replacing only > single column. It would be convenient to support adding or replacing multiple > columns at once. > Also withColumnRenamed() supports renaming only single column.It would also > be convenient to support renaming multiple columns at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12225) Support adding or replacing multiple columns at once in DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048432#comment-15048432 ] Sun Rui commented on SPARK-12225: - [~r...@databricks.com] This comes from supporting mutate() function in SparkR. mutate() is a function in the popular data manipulation dplyr package, which allows adding and replacing multiple columns in a call. The implementation can't be as simple as calling withColumn() multiple times, as each call to withColumn() returns a new DataFrame, while the remaining columns to be added or for replacement are Columns referring to the old DataFrame. So the implementation is not trivial, and you can refer to https://github.com/apache/spark/pull/10220. It would simplify the SparkR's implementation if there is a multiple-column version of withColumn. Another motivation for this JIRA is that we just added drop() for multiple columns. So from API parity's point of view, it would be good to have a multiple-column version of withColumn. > Support adding or replacing multiple columns at once in DataFrame API > - > > Key: SPARK-12225 > URL: https://issues.apache.org/jira/browse/SPARK-12225 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.2 >Reporter: Sun Rui > > Currently, withColumn() method of DataFrame supports adding or replacing only > single column. It would be convenient to support adding or replacing multiple > columns at once. > Also withColumnRenamed() supports renaming only single column.It would also > be convenient to support renaming multiple columns at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.
[ https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon closed SPARK-11621. Resolution: Fixed Fix Version/s: 1.6.0 > ORC filter pushdown not working properly after new unhandled filter interface. > -- > > Key: SPARK-11621 > URL: https://issues.apache.org/jira/browse/SPARK-11621 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Hyukjin Kwon > Fix For: 1.6.0 > > > After the new interface to get rid of filters predicate-push-downed which are > already processed in datasource-level > (https://github.com/apache/spark/pull/9399), it dose not push down filters > for ORC. > This is because at {{DataSourceStrategy}}, all the filters are treated as > unhandled filters. > Also, since ORC does not support to filter fully record by record but instead > rough results came out, the filters for ORC should not go to unhandled > filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8338) Ganglia fails to start
[ https://issues.apache.org/jira/browse/SPARK-8338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048294#comment-15048294 ] Michael Reilly commented on SPARK-8338: --- This issue appears to have resurfaced in 1.5.2. I'm creating a cluster in ec2 with spark-1.5.2-bin-hadoop2.6. I issue a command like this ./spark-ec2 --key-pair=spark-sentiment --identity-file=/home/ec2-user/xxx.pem --spark-version=1.5.2 --region=eu-west-1 --zone=eu-west-1b --user-data=java8.sh launch spark-cluster-test --vpc-id=myvpc --subnet-id=mysubnet (java8.sh just installs java8 on the cluster nodes) After about ten minutes it finishes. The last few lines of output are: Setting up ganglia RSYNC'ing /etc/ganglia to slaves... ec2-52-30-193-79.eu-west-1.compute.amazonaws.com Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Connection to ec2-52-30-193-79.eu-west-1.compute.amazonaws.com closed. Shutting down GANGLIA gmetad: [FAILED] Starting GANGLIA gmetad: [ OK ] Stopping httpd:[FAILED] Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so into server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such file or directory [FAILED] [timing] ganglia setup: 00h 00m 02s Connection to ec2-52-31-xxx-xxx.eu-west-1.compute.amazonaws.com closed. Spark standalone cluster started at http://ec2-52-31-xxx-xxx.eu-west-1.compute.amazonaws.com:8080 Ganglia started at http://ec2-52-31-xxx-xxx.eu-west-1.compute.amazonaws.com:5080/ganglia Done! > Ganglia fails to start > -- > > Key: SPARK-8338 > URL: https://issues.apache.org/jira/browse/SPARK-8338 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.4.0 >Reporter: Vladimir Vladimirov >Assignee: Vladimir Vladimirov >Priority: Minor > Fix For: 1.5.0 > > > Exception > {code} > Starting httpd: httpd: Syntax error on line 154 of > /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so > into server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object > file: No such file or directory > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem*: Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution*: Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance*: 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. was: *Problem*: Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is low. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution*: Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance*: 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than x1.86 (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we support both RDD cache and shuffle and no extra inter process communication. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem*: > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution*: > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance*: > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12238) s/Advanced sources/External Sources in docs.
[ https://issues.apache.org/jira/browse/SPARK-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12238: Assignee: (was: Apache Spark) > s/Advanced sources/External Sources in docs. > > > Key: SPARK-12238 > URL: https://issues.apache.org/jira/browse/SPARK-12238 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Reporter: Prashant Sharma > > While reading the docs, I felt reading as external sources(instead of > Advanced sources) seemed more appropriate as in they belong outside streaming > core project. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem*: Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution*: Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance*: 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. In spark-default.xml, configure spark.hierarchyStore. {code} spark.hierarchyStore nvm 50GB, ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first, when nvm's usable space is less than 50GB, it starts to allocate from ssd. when ssd's usable space is less than 80GB, it starts to allocate from the last layer. was: *Problem*: Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution*: Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance*: 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem*: > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution*: > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance*: > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} >
[jira] [Commented] (SPARK-12238) s/Advanced sources/External Sources in docs.
[ https://issues.apache.org/jira/browse/SPARK-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048365#comment-15048365 ] Apache Spark commented on SPARK-12238: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/10223 > s/Advanced sources/External Sources in docs. > > > Key: SPARK-12238 > URL: https://issues.apache.org/jira/browse/SPARK-12238 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Reporter: Prashant Sharma > > While reading the docs, I felt reading as external sources(instead of > Advanced sources) seemed more appropriate as in they belong outside streaming > core project. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12237) Unsupported message RpcMessage causes message retries
Jacek Laskowski created SPARK-12237: --- Summary: Unsupported message RpcMessage causes message retries Key: SPARK-12237 URL: https://issues.apache.org/jira/browse/SPARK-12237 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Jacek Laskowski When an unsupported message is sent to an endpoint, Spark throws {{org.apache.spark.SparkException}} and retries sending the message. It should *not* since the message is unsupported. {code} WARN NettyRpcEndpointRef: Error sending message [message = RetrieveSparkProps] in 1 attempts org.apache.spark.SparkException: Unsupported message RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@c0a6275) from localhost:51137 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104) at org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) WARN NettyRpcEndpointRef: Error sending message [message = RetrieveSparkProps] in 2 attempts org.apache.spark.SparkException: Unsupported message RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@73a76a5a) from localhost:51137 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104) at org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) WARN NettyRpcEndpointRef: Error sending message [message = RetrieveSparkProps] in 3 attempts org.apache.spark.SparkException: Unsupported message RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@670bfda7) from localhost:51137 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104) at org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:253) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Error sending message [message = RetrieveSparkProps] at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) at
[jira] [Closed] (SPARK-12233) Cannot specify a data frame column during join
[ https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu closed SPARK-12233. --- Resolution: Not A Problem > Cannot specify a data frame column during join > -- > > Key: SPARK-12233 > URL: https://issues.apache.org/jira/browse/SPARK-12233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Fengdong Yu >Priority: Minor > > {code} > sqlContext.udf.register("lowercase", (s: String) =>{ > if (null == s) "" else s.toLowerCase > }) > > sqlContext.udf.register("substr", (s: String) =>{ > if (null == s) "" > else { > val index = s.indexOf("@") > if (index < 0) s else s.toLowerCase.substring(index + 1)} > }) > > sqlContext.read.orc("/data/test/test.data") > .registerTempTable("testTable") > > val extracted = > sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, > lowercase(family_name) AS > family_name, > substr(email_address) AS domain, > lowercase(email_address) AS emailaddr, > experience > > FROM testTable > WHERE email_address != '' > """) > .distinct > > val count = > extracted.groupBy("given_name", "family_name", "domain") >.count > > count.where(count("count") > 1) > .drop(count("count")) > .join(extracted, Seq("given_name", "family_name", "domain")) > {code} > {color:red} .select(count("given_name"), count("family_name"), > extracted("emailaddr")) {color} > Red Font should be: > {color:red} select("given_name", "family_name", "emailaddr") {color} > > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 > missing from > given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 > in operator !Project [given_name#522,family_name#523,emailaddr#525]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at
[jira] [Issue Comment Deleted] (SPARK-12233) Cannot specify a data frame column during join
[ https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu updated SPARK-12233: Comment: was deleted (was: I am using : select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr")) to hit the error. But you are right. I cannot use the old DF to select the element after the join should close now. Thanks) > Cannot specify a data frame column during join > -- > > Key: SPARK-12233 > URL: https://issues.apache.org/jira/browse/SPARK-12233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Fengdong Yu >Priority: Minor > > {code} > sqlContext.udf.register("lowercase", (s: String) =>{ > if (null == s) "" else s.toLowerCase > }) > > sqlContext.udf.register("substr", (s: String) =>{ > if (null == s) "" > else { > val index = s.indexOf("@") > if (index < 0) s else s.toLowerCase.substring(index + 1)} > }) > > sqlContext.read.orc("/data/test/test.data") > .registerTempTable("testTable") > > val extracted = > sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, > lowercase(family_name) AS > family_name, > substr(email_address) AS domain, > lowercase(email_address) AS emailaddr, > experience > > FROM testTable > WHERE email_address != '' > """) > .distinct > > val count = > extracted.groupBy("given_name", "family_name", "domain") >.count > > count.where(count("count") > 1) > .drop(count("count")) > .join(extracted, Seq("given_name", "family_name", "domain")) > {code} > {color:red} .select(count("given_name"), count("family_name"), > extracted("emailaddr")) {color} > Red Font should be: > {color:red} select("given_name", "family_name", "emailaddr") {color} > > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 > missing from > given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 > in operator !Project [given_name#522,family_name#523,emailaddr#525]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at >
[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048312#comment-15048312 ] Alok Singh commented on SPARK-12172: +1 San Rui having RDD api in sparkR 1) lets people do low level operations if they want. 2) Also user can run custom mapper and reducer (for example one may decide to write custom ML routine in sparkR using dataframe and RDD, rather than modifying mllib ) 3) if someone want to enhance sparkR for there internal usages or for their customers but probably it doesn't make sense for sparkR community to add it. He/She can make the custom R package on top of sparkR package. just some thoughts.. thanks Alok > Consider removing SparkR internal RDD APIs > -- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11965) Update user guide for RFormula feature interactions
[ https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048283#comment-15048283 ] Apache Spark commented on SPARK-11965: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10222 > Update user guide for RFormula feature interactions > --- > > Key: SPARK-11965 > URL: https://issues.apache.org/jira/browse/SPARK-11965 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > Update the user guide for RFormula to cover feature interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11965) Update user guide for RFormula feature interactions
[ https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11965: Assignee: (was: Apache Spark) > Update user guide for RFormula feature interactions > --- > > Key: SPARK-11965 > URL: https://issues.apache.org/jira/browse/SPARK-11965 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > Update the user guide for RFormula to cover feature interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11965) Update user guide for RFormula feature interactions
[ https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11965: Assignee: Apache Spark > Update user guide for RFormula feature interactions > --- > > Key: SPARK-11965 > URL: https://issues.apache.org/jira/browse/SPARK-11965 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Update the user guide for RFormula to cover feature interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12238) s/Advanced sources/External Sources in docs.
[ https://issues.apache.org/jira/browse/SPARK-12238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12238: Assignee: Apache Spark > s/Advanced sources/External Sources in docs. > > > Key: SPARK-12238 > URL: https://issues.apache.org/jira/browse/SPARK-12238 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Reporter: Prashant Sharma >Assignee: Apache Spark > > While reading the docs, I felt reading as external sources(instead of > Advanced sources) seemed more appropriate as in they belong outside streaming > core project. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12239) SparkR - Not distributing SparkR module in YARN
Sebastian YEPES FERNANDEZ created SPARK-12239: - Summary: SparkR - Not distributing SparkR module in YARN Key: SPARK-12239 URL: https://issues.apache.org/jira/browse/SPARK-12239 Project: Spark Issue Type: Bug Components: SparkR, YARN Affects Versions: 1.5.2, 1.5.3 Reporter: Sebastian YEPES FERNANDEZ Priority: Critical Hello, I am trying to use the SparkR in a YARN environment and I have encountered the following problem: Every thing work correctly when using bin/sparkR, but if I try running the same jobs using sparkR directly through R it does not work. I have managed to track down what is causing the problem, when sparkR is launched through R the "SparkR" module is not distributed to the worker nodes. I have tried working around this issue using the setting "spark.yarn.dist.archives", but it does not work as it deploys the file/extracted folder with the extension ".zip" and workers are actually looking for a folder with the name "sparkr" Is there currently any way to make this work? {code} # spark-defaults.conf spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip # R library(SparkR, lib.loc="/opt/apps/spark/R/lib/") sc <- sparkR.init(appName="SparkR", master="yarn-client", sparkEnvir=list(spark.executor.instances="1")) sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) head(df) 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) {code} Container stderr: {code} 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.7 KB, free 530.0 MB) 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R': No such file or directory 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426) {code} Worker Node that runned the Container: {code} # ls -la /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02 total 71M drwx--x--- 3 yarn hadoop 4.0K Dec 9 09:04 . drwx--x--- 7 yarn hadoop 4.0K Dec 9 09:04 .. -rw-r--r-- 1 yarn hadoop 110 Dec 9 09:03 container_tokens -rw-r--r-- 1 yarn hadoop 12 Dec 9 09:03 .container_tokens.crc -rwx-- 1 yarn hadoop 736 Dec 9 09:03 default_container_executor_session.sh -rw-r--r-- 1 yarn hadoop 16 Dec 9 09:03 .default_container_executor_session.sh.crc -rwx-- 1 yarn hadoop 790 Dec 9 09:03 default_container_executor.sh -rw-r--r-- 1 yarn hadoop 16 Dec 9 09:03 .default_container_executor.sh.crc -rwxr-xr-x 1 yarn hadoop 61K Dec 9 09:04 hadoop-lzo-0.6.0.2.3.2.0-2950.jar -rwxr-xr-x 1 yarn hadoop 317K Dec 9 09:04 kafka-clients-0.8.2.2.jar -rwx-- 1 yarn hadoop 6.0K Dec 9 09:03 launch_container.sh -rw-r--r-- 1 yarn hadoop 56 Dec 9 09:03 .launch_container.sh.crc -rwxr-xr-x 1 yarn hadoop 2.2M Dec 9 09:04 spark-cassandra-connector_2.10-1.5.0-M3.jar -rwxr-xr-x 1 yarn hadoop 7.1M Dec 9 09:04 spark-csv-assembly-1.3.0.jar lrwxrwxrwx 1 yarn hadoop 119 Dec 9 09:03 __spark__.jar -> /hadoop/hdfs/disk03/hadoop/yarn/local/usercache/spark/filecache/361/spark-assembly-1.5.3-SNAPSHOT-hadoop2.7.1.jar lrwxrwxrwx 1 yarn hadoop 84 Dec 9 09:03 sparkr.zip -> /hadoop/hdfs/disk01/hadoop/yarn/local/usercache/spark/filecache/359/sparkr.zip -rwxr-xr-x 1 yarn hadoop 1.8M Dec 9 09:04 spark-streaming_2.10-1.5.3-SNAPSHOT.jar -rwxr-xr-x 1 yarn hadoop 11M Dec 9 09:04 spark-streaming-kafka-assembly_2.10-1.5.3-SNAPSHOT.jar -rwxr-xr-x 1 yarn hadoop 48M Dec 9 09:04 sparkts-0.1.0-SNAPSHOT-jar-with-dependencies.jar drwx--x--- 2 yarn hadoop 46 Dec 9 09:04 tmp {code} *Working case:* {code} # sparkR --master yarn-client --num-executors 1 df <- createDataFrame(sqlContext, faithful) head(df) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55 {code} Worker Node that runned the Container: {code} # ls -la
[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048256#comment-15048256 ] Xiao Li commented on SPARK-12218: - That is what I did in my environment. {code} val df1 = Seq(1, 2, 3).map(i => (i, i, i.toString)).toDF("int1", "int2", "str") df1.where("not (int1=0 and int2=0)").explain(true) df1.where("not(int1=0) or not(int2=0)").explain(true) {code} Their physical plans are exactly the same. {code} == Parsed Logical Plan == 'Filter NOT (('int1 = 0) && ('int2 = 0)) +- Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5] +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]] == Analyzed Logical Plan == int1: int, int2: int, str: string Filter NOT ((int1#3 = 0) && (int2#4 = 0)) +- Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5] +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]] == Optimized Logical Plan == Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5] +- Filter (NOT (_1#0 = 0) || NOT (_2#1 = 0)) +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]] == Physical Plan == Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5] +- Filter (NOT (_1#0 = 0) || NOT (_2#1 = 0)) +- LocalTableScan [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]] {code} {code} == Parsed Logical Plan == 'Filter (NOT ('int1 = 0) || NOT ('int2 = 0)) +- Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5] +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]] == Analyzed Logical Plan == int1: int, int2: int, str: string Filter (NOT (int1#3 = 0) || NOT (int2#4 = 0)) +- Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5] +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]] == Optimized Logical Plan == Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5] +- Filter (NOT (_1#0 = 0) || NOT (_2#1 = 0)) +- LocalRelation [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]] == Physical Plan == Project [_1#0 AS int1#3,_2#1 AS int2#4,_3#2 AS str#5] +- Filter (NOT (_1#0 = 0) || NOT (_2#1 = 0)) +- LocalTableScan [_1#0,_2#1,_3#2], [[1,1,1],[2,2,2],[3,3,3]] {code} > Boolean logic in sql does not work "not (A and B)" is not the same as "(not > A) or (not B)" > > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Blocker > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join
[ https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048262#comment-15048262 ] Fengdong Yu commented on SPARK-12233: - I am using : select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr")) to hit the error. But you are right. I cannot use the old DF to select the element after the join should close now. Thanks > Cannot specify a data frame column during join > -- > > Key: SPARK-12233 > URL: https://issues.apache.org/jira/browse/SPARK-12233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Fengdong Yu >Priority: Minor > > {code} > sqlContext.udf.register("lowercase", (s: String) =>{ > if (null == s) "" else s.toLowerCase > }) > > sqlContext.udf.register("substr", (s: String) =>{ > if (null == s) "" > else { > val index = s.indexOf("@") > if (index < 0) s else s.toLowerCase.substring(index + 1)} > }) > > sqlContext.read.orc("/data/test/test.data") > .registerTempTable("testTable") > > val extracted = > sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, > lowercase(family_name) AS > family_name, > substr(email_address) AS domain, > lowercase(email_address) AS emailaddr, > experience > > FROM testTable > WHERE email_address != '' > """) > .distinct > > val count = > extracted.groupBy("given_name", "family_name", "domain") >.count > > count.where(count("count") > 1) > .drop(count("count")) > .join(extracted, Seq("given_name", "family_name", "domain")) > {code} > {color:red} .select(count("given_name"), count("family_name"), > extracted("emailaddr")) {color} > Red Font should be: > {color:red} select("given_name", "family_name", "emailaddr") {color} > > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 > missing from > given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 > in operator !Project [given_name#522,family_name#523,emailaddr#525]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at >
[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join
[ https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048264#comment-15048264 ] Fengdong Yu commented on SPARK-12233: - I am using : select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr")) to hit the error. But you are right. I cannot use the old DF to select the element after the join should close now. Thanks > Cannot specify a data frame column during join > -- > > Key: SPARK-12233 > URL: https://issues.apache.org/jira/browse/SPARK-12233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Fengdong Yu >Priority: Minor > > {code} > sqlContext.udf.register("lowercase", (s: String) =>{ > if (null == s) "" else s.toLowerCase > }) > > sqlContext.udf.register("substr", (s: String) =>{ > if (null == s) "" > else { > val index = s.indexOf("@") > if (index < 0) s else s.toLowerCase.substring(index + 1)} > }) > > sqlContext.read.orc("/data/test/test.data") > .registerTempTable("testTable") > > val extracted = > sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, > lowercase(family_name) AS > family_name, > substr(email_address) AS domain, > lowercase(email_address) AS emailaddr, > experience > > FROM testTable > WHERE email_address != '' > """) > .distinct > > val count = > extracted.groupBy("given_name", "family_name", "domain") >.count > > count.where(count("count") > 1) > .drop(count("count")) > .join(extracted, Seq("given_name", "family_name", "domain")) > {code} > {color:red} .select(count("given_name"), count("family_name"), > extracted("emailaddr")) {color} > Red Font should be: > {color:red} select("given_name", "family_name", "emailaddr") {color} > > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 > missing from > given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 > in operator !Project [given_name#522,family_name#523,emailaddr#525]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at >
[jira] [Issue Comment Deleted] (SPARK-12233) Cannot specify a data frame column during join
[ https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu updated SPARK-12233: Comment: was deleted (was: I am using : select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr")) to hit the error. But you are right. I cannot use the old DF to select the element after the join should close now. Thanks) > Cannot specify a data frame column during join > -- > > Key: SPARK-12233 > URL: https://issues.apache.org/jira/browse/SPARK-12233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Fengdong Yu >Priority: Minor > > {code} > sqlContext.udf.register("lowercase", (s: String) =>{ > if (null == s) "" else s.toLowerCase > }) > > sqlContext.udf.register("substr", (s: String) =>{ > if (null == s) "" > else { > val index = s.indexOf("@") > if (index < 0) s else s.toLowerCase.substring(index + 1)} > }) > > sqlContext.read.orc("/data/test/test.data") > .registerTempTable("testTable") > > val extracted = > sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, > lowercase(family_name) AS > family_name, > substr(email_address) AS domain, > lowercase(email_address) AS emailaddr, > experience > > FROM testTable > WHERE email_address != '' > """) > .distinct > > val count = > extracted.groupBy("given_name", "family_name", "domain") >.count > > count.where(count("count") > 1) > .drop(count("count")) > .join(extracted, Seq("given_name", "family_name", "domain")) > {code} > {color:red} .select(count("given_name"), count("family_name"), > extracted("emailaddr")) {color} > Red Font should be: > {color:red} select("given_name", "family_name", "emailaddr") {color} > > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 > missing from > given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 > in operator !Project [given_name#522,family_name#523,emailaddr#525]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at >
[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join
[ https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048265#comment-15048265 ] Fengdong Yu commented on SPARK-12233: - I am using : select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr")) to hit the error. But you are right. I cannot use the old DF to select the element after the join should close now. Thanks > Cannot specify a data frame column during join > -- > > Key: SPARK-12233 > URL: https://issues.apache.org/jira/browse/SPARK-12233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Fengdong Yu >Priority: Minor > > {code} > sqlContext.udf.register("lowercase", (s: String) =>{ > if (null == s) "" else s.toLowerCase > }) > > sqlContext.udf.register("substr", (s: String) =>{ > if (null == s) "" > else { > val index = s.indexOf("@") > if (index < 0) s else s.toLowerCase.substring(index + 1)} > }) > > sqlContext.read.orc("/data/test/test.data") > .registerTempTable("testTable") > > val extracted = > sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, > lowercase(family_name) AS > family_name, > substr(email_address) AS domain, > lowercase(email_address) AS emailaddr, > experience > > FROM testTable > WHERE email_address != '' > """) > .distinct > > val count = > extracted.groupBy("given_name", "family_name", "domain") >.count > > count.where(count("count") > 1) > .drop(count("count")) > .join(extracted, Seq("given_name", "family_name", "domain")) > {code} > {color:red} .select(count("given_name"), count("family_name"), > extracted("emailaddr")) {color} > Red Font should be: > {color:red} select("given_name", "family_name", "emailaddr") {color} > > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 > missing from > given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 > in operator !Project [given_name#522,family_name#523,emailaddr#525]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at >
[jira] [Issue Comment Deleted] (SPARK-12233) Cannot specify a data frame column during join
[ https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu updated SPARK-12233: Comment: was deleted (was: I am using : select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr")) to hit the error. But you are right. I cannot use the old DF to select the element after the join should close now. Thanks) > Cannot specify a data frame column during join > -- > > Key: SPARK-12233 > URL: https://issues.apache.org/jira/browse/SPARK-12233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Fengdong Yu >Priority: Minor > > {code} > sqlContext.udf.register("lowercase", (s: String) =>{ > if (null == s) "" else s.toLowerCase > }) > > sqlContext.udf.register("substr", (s: String) =>{ > if (null == s) "" > else { > val index = s.indexOf("@") > if (index < 0) s else s.toLowerCase.substring(index + 1)} > }) > > sqlContext.read.orc("/data/test/test.data") > .registerTempTable("testTable") > > val extracted = > sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, > lowercase(family_name) AS > family_name, > substr(email_address) AS domain, > lowercase(email_address) AS emailaddr, > experience > > FROM testTable > WHERE email_address != '' > """) > .distinct > > val count = > extracted.groupBy("given_name", "family_name", "domain") >.count > > count.where(count("count") > 1) > .drop(count("count")) > .join(extracted, Seq("given_name", "family_name", "domain")) > {code} > {color:red} .select(count("given_name"), count("family_name"), > extracted("emailaddr")) {color} > Red Font should be: > {color:red} select("given_name", "family_name", "emailaddr") {color} > > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 > missing from > given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 > in operator !Project [given_name#522,family_name#523,emailaddr#525]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at >
[jira] [Commented] (SPARK-12232) Consider exporting read.table in R
[ https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048261#comment-15048261 ] Sun Rui commented on SPARK-12232: - +1 [~yanboliang] Since we already have table(), we can change it to an S4 generic function, change the original table() function into an S4 method on "jobj" class, and add a new table() s4 method on "DataFrame". If backward-compatibility is not a concern, we may consider rename table() to something tableToDF(). But it seems that table() is not so common in the R base package, I find no strong motivation for it. > Consider exporting read.table in R > -- > > Key: SPARK-12232 > URL: https://issues.apache.org/jira/browse/SPARK-12232 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Priority: Minor > > Since we have read.df, read.json, read.parquet (some in pending PRs), we have > table() and we should consider having read.table() for consistency and > R-likeness. > However, this conflicts with utils::read.table which returns a R data.frame. > It seems neither table() or read.table() is desirable in this case. > table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html > read.table: > https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12233) Cannot specify a data frame column during join
[ https://issues.apache.org/jira/browse/SPARK-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048263#comment-15048263 ] Fengdong Yu commented on SPARK-12233: - I am using : select(df4("int"), df4("str1"), df4("str2"), df3("emailaddr")) to hit the error. But you are right. I cannot use the old DF to select the element after the join should close now. Thanks > Cannot specify a data frame column during join > -- > > Key: SPARK-12233 > URL: https://issues.apache.org/jira/browse/SPARK-12233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Fengdong Yu >Priority: Minor > > {code} > sqlContext.udf.register("lowercase", (s: String) =>{ > if (null == s) "" else s.toLowerCase > }) > > sqlContext.udf.register("substr", (s: String) =>{ > if (null == s) "" > else { > val index = s.indexOf("@") > if (index < 0) s else s.toLowerCase.substring(index + 1)} > }) > > sqlContext.read.orc("/data/test/test.data") > .registerTempTable("testTable") > > val extracted = > sqlContext.sql(""" SELECT lowercase(given_name) AS given_name, > lowercase(family_name) AS > family_name, > substr(email_address) AS domain, > lowercase(email_address) AS emailaddr, > experience > > FROM testTable > WHERE email_address != '' > """) > .distinct > > val count = > extracted.groupBy("given_name", "family_name", "domain") >.count > > count.where(count("count") > 1) > .drop(count("count")) > .join(extracted, Seq("given_name", "family_name", "domain")) > {code} > {color:red} .select(count("given_name"), count("family_name"), > extracted("emailaddr")) {color} > Red Font should be: > {color:red} select("given_name", "family_name", "emailaddr") {color} > > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) emailaddr#525 > missing from > given_name#522,domain#524,url#517,family_name#523,emailaddr#532,experience#490 > in operator !Project [given_name#522,family_name#523,emailaddr#525]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at >
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem*: Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is low. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution*: Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance*: 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than x1.86 (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we support both RDD cache and shuffle and no extra inter process communication. was: Problem: Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is low. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? Solution: Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). Performance: 1. At the best case, our solution performs the same as all SSDs. At the worst case, like all data are spilled to HDDs, no performance regression. 2. Compared with all HDDs, hierarchy store improves more than x1.86 (it could be higher, CPU reaches bottleneck in our test environment). 3. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we support both RDD cache and shuffle and no extra inter process communication. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem*: > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is low. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution*: > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance*: > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than x1.86 (it could > be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we > support both RDD cache and shuffle and no extra inter process communication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12238) s/Advanced sources/External Sources in docs.
Prashant Sharma created SPARK-12238: --- Summary: s/Advanced sources/External Sources in docs. Key: SPARK-12238 URL: https://issues.apache.org/jira/browse/SPARK-12238 Project: Spark Issue Type: Improvement Components: Documentation, Streaming Reporter: Prashant Sharma While reading the docs, I felt reading as external sources(instead of Advanced sources) seemed more appropriate as in they belong outside streaming core project. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12179) Spark SQL get different result with the same code
[ https://issues.apache.org/jira/browse/SPARK-12179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048496#comment-15048496 ] Tao Li commented on SPARK-12179: I try to use spark internal Row_Number() udf, but still has the problem. select $DATE as date, 'main' as type, host, rfhost, rfpv from ( select Row_Number() OVER (partition by host ORDER BY host ,rfpv desc) r, host, rfhost, rfpv from ( select delhost(t0.host) as host, delhost(t0.rfhost) as rfhost ,sum(t0.rfpv) as rfpv from ( select h.host as host,i.rfhost as rfhost ,i.rfpv as rfpv from ( select parse_url(ur,'HOST') as host,count(1) as pv from custom.web_sogourank_orc_zlib where logdate>=$starttime and logdate<=$endtime group by parse_url(ur,'HOST') order by pv desc limit 1 ) h left outer join ( select parse_url(ur,'HOST') as host,parse_url(rf,'HOST') as rfhost , count(*) as rfpv from custom.web_sogourank_orc_zlib where logdate>=$starttime and logdate<=$endtime group by parse_url(ur,'HOST'), parse_url(rf,'HOST') ) i on h.host = i.host ) t0 group by delhost(t0.host),delhost(t0.rfhost) distribute by host sort by host ,rfpv desc ) t1 ) t2 where r<=10 > Spark SQL get different result with the same code > - > > Key: SPARK-12179 > URL: https://issues.apache.org/jira/browse/SPARK-12179 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, > 1.5.2, 1.5.3 > Environment: hadoop version: 2.5.0-cdh5.3.2 > spark version: 1.5.3 > run mode: yarn-client >Reporter: Tao Li >Priority: Critical > > I run the sql in yarn-client mode, but get different result each time. > As you can see the example, I get the different shuffle write with the same > shuffle read in two jobs with the same code. > Some of my spark app runs well, but some always met this problem. And I met > this problem on spark 1.3, 1.4 and 1.5 version. > Can you give me some suggestions about the possible causes or how do I figure > out the problem? > 1. First Run > Details for Stage 9 (Attempt 0) > Total Time Across All Tasks: 5.8 min > Shuffle Read: 24.4 MB / 205399 > Shuffle Write: 6.8 MB / 54934 > 2. Second Run > Details for Stage 9 (Attempt 0) > Total Time Across All Tasks: 5.6 min > Shuffle Read: 24.4 MB / 205399 > Shuffle Write: 6.8 MB / 54905 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. Configure spark.hierarchyStore. {code} spark.hierarchyStore='nvm 50GB,ssd 80GB' {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. was: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. Configure spark.hierarchyStore. {code} -Dspark.hierarchyStore=nvm 50GB,ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} -Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than
[jira] [Commented] (SPARK-11630) ClosureCleaner incorrectly warns for class based closures
[ https://issues.apache.org/jira/browse/SPARK-11630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048518#comment-15048518 ] W.H. commented on SPARK-11630: -- +1 > ClosureCleaner incorrectly warns for class based closures > - > > Key: SPARK-11630 > URL: https://issues.apache.org/jira/browse/SPARK-11630 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Frens Jan Rumph >Priority: Trivial > > Spark's `ClosureCleaner` utility seems to check whether a function is an > anonymous function: [ClosureCleaner.scala on line > 49|https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L49] > If not, it warns the user. > However, I'm using some class based functions. Something along the lines of: > {code} > trait FromUnreadRow[T] extends (UnreadRow => T) with Serializable > object ToPlainRow extends FromUnreadRow[PlainRow] { > override def apply(row: UnreadRow): PlainRow = ??? > } > {code} > This works just fine. I can't really see that the warning is actually useful > in this case. I appreciate checking for common 'mistakes', but in my case a > user might be alarmed unnecessarily. > Anything that can be done about this? Anything I can do? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. In spark-default.xml, configure spark.hierarchyStore. {code} -Dspark.hierarchyStore=nvm 50GB,ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} -Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. was: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. In spark-default.xml, configure spark.hierarchyStore. {code} spark.hierarchyStore nvm 50GB, ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} -Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with
[jira] [Comment Edited] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.
[ https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048636#comment-15048636 ] Adam Roberts edited comment on SPARK-9858 at 12/9/15 1:07 PM: -- Thanks for the prompt reply, rowBuffer is a variable in org.apache.spark.sql.execution.UnsafeRowSerializer within the asKeyValueIterator method. I experimented with the Exchange class, same problems are observed using the SparkSqlSeriaizer; suggesting the UnsafeRowSerializer is probably fine. I agree with your second comment, I think the code within org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere. It'll be useful to determine how the values in the assertions can be determined programatically, I think the partitioning algorithm itself is working as expected but for some reason stages require more bytes on the platforms I'm using. spark.sql.shuffle.partitions is unchanged, I'm working off the latest master code. Is there something special about the aggregate, join, and complex query 2 tests? Can we print exactly what the bytes are for each stage? I know rdd.count is always correct and the DataFrames are the same (printed each row, written to json and parquet - no concerns). Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate test passes. I'm wondering if there's an extra factor we should take into account when determining the indices regardless of platform. was (Author: aroberts): Thanks for the prompt reply, rowBuffer is a variable in org.apache.spark.sql.execution.UnsafeRowSerializer within the asKeyValueIterator method. I experimented with the Exchange class, same problems are observed using the SparkSqlSeriaizer; suggesting the UnsafeRowSerializer is probably fine. I agree with your second comment, I think the code within org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere. It'll be useful to determine how the values in the assertions can be determined programatically, I think the partitioning algorithm itself is working as expected but for some reason stages require more bytes on the platforms I'm using. spark.sql.shuffle.partitions is unchanged, I'm working off the latest master code. Is there something special about the aggregate, join, and complex query 2 tests? Can we print exactly what the bytes are for each stage? I know rdd.count is always correct and the DataFrames are the same (printed each row, written to json and parquet - no concerns). Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate test passes. > Introduce an ExchangeCoordinator to estimate the number of post-shuffle > partitions. > --- > > Key: SPARK-9858 > URL: https://issues.apache.org/jira/browse/SPARK-9858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12237) Unsupported message RpcMessage causes message retries
[ https://issues.apache.org/jira/browse/SPARK-12237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048651#comment-15048651 ] Nan Zhu commented on SPARK-12237: - may I ask how you found this issue? It seems that Master received RetrieveSparkProps message.which is supposed to be only transmitted between Executor and Driver > Unsupported message RpcMessage causes message retries > - > > Key: SPARK-12237 > URL: https://issues.apache.org/jira/browse/SPARK-12237 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Jacek Laskowski > > When an unsupported message is sent to an endpoint, Spark throws > {{org.apache.spark.SparkException}} and retries sending the message. It > should *not* since the message is unsupported. > {code} > WARN NettyRpcEndpointRef: Error sending message [message = > RetrieveSparkProps] in 1 attempts > org.apache.spark.SparkException: Unsupported message > RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@c0a6275) > from localhost:51137 > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104) > at > org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > WARN NettyRpcEndpointRef: Error sending message [message = > RetrieveSparkProps] in 2 attempts > org.apache.spark.SparkException: Unsupported message > RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@73a76a5a) > from localhost:51137 > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104) > at > org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > WARN NettyRpcEndpointRef: Error sending message [message = > RetrieveSparkProps] in 3 attempts > org.apache.spark.SparkException: Unsupported message > RpcMessage(localhost:51137,RetrieveSparkProps,org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@670bfda7) > from localhost:51137 > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:104) > at > org.apache.spark.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:373) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151) > at >
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048574#comment-15048574 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048579#comment-15048579 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048572#comment-15048572 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048576#comment-15048576 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048571#comment-15048571 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048577#comment-15048577 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048570#comment-15048570 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048575#comment-15048575 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048573#comment-15048573 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048578#comment-15048578 ] yucai commented on SPARK-12196: --- Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. Configure spark.hierarchyStore. {code} -Dspark.hierarchyStore=nvm 50GB,ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} -Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. was: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. In spark-default.xml, configure spark.hierarchyStore. {code} -Dspark.hierarchyStore=nvm 50GB,ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} -Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy
[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048609#comment-15048609 ] Pere Ferrera Bertran commented on SPARK-3461: - What's the status of this? We provided a hack on top of the current API for Java users here, in case it's interesting for people hitting this issue: http://www.datasalt.com/2015/12/a-scalable-groupbykey-and-secondary-sort-for-java-spark/ > Support external groupByKey using repartitionAndSortWithinPartitions > > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Critical > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. In spark-default.xml, configure spark.hierarchyStore. {code} spark.hierarchyStore nvm 50GB, ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} -Dspark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. was: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. In spark-default.xml, configure spark.hierarchyStore. {code} spark.hierarchyStore nvm 50GB, ssd 80GB {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. Configure spark.storage.hierarchyStore. {code} spark.hierarchyStore='nvm 50GB,ssd 80GB' {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. was: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. Configure spark.hierarchyStore. {code} spark.hierarchyStore='nvm 50GB,ssd 80GB' {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619 ] Irakli Machabeli commented on SPARK-12218: -- I'm afraid I don't really know what that means, "plan by explain(true)" Shall I type it in repl? [ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047928#comment-15047928 ] Xiao Li commented on SPARK-12218: - Could you provide the plan by explain(true)? [~imachabeli] Thanks! "(not A) or (not B)" PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))").count() not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff')))").count() -- This message was sent by Atlassian JIRA (v6.3.4#6332) > Boolean logic in sql does not work "not (A and B)" is not the same as "(not > A) or (not B)" > > > Key: SPARK-12218 > URL: https://issues.apache.org/jira/browse/SPARK-12218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Irakli Machabeli >Priority: Blocker > > Two identical queries produce different results > In [2]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( > PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff'))").count() > Out[2]: 18 > In [3]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and ( > not(PaymentsReceived=0) or not (ExplicitRoll in ('PreviouslyPaidOff', > 'PreviouslyChargedOff')))").count() > Out[3]: 28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.
[ https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048636#comment-15048636 ] Adam Roberts commented on SPARK-9858: - Thanks for the prompt reply, rowBuffer is a variable in org.apache.spark.sql.execution.UnsafeRowSerializer within the asKeyValueIterator method. I experimented with the Exchange class, same problems are observed using the SparkSqlSeriaizer; suggesting the UnsafeRowSerializer is probably fine. I agree with your second comment, I think the code within org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere. It'll be useful to determine how the values in the assertions can be determined programatically, I think the partitioning algorithm itself is working as expected but for some reason stages require more bytes on the platforms I'm using. spark.sql.shuffle.partitions is unchanged, I'm working off the latest master code. Is there something special about the aggregate, join, and complex query 2 tests? Can we print exactly what the bytes are for each stage? I know rdd.count is always correct and the DataFrames are the same (printed each row, written to json and parquet - no concerns). Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate test passes. > Introduce an ExchangeCoordinator to estimate the number of post-shuffle > partitions. > --- > > Key: SPARK-9858 > URL: https://issues.apache.org/jira/browse/SPARK-9858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048642#comment-15048642 ] Apache Spark commented on SPARK-12196: -- User 'yucai' has created a pull request for this issue: https://github.com/apache/spark/pull/10225 > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. Configure spark.hierarchyStore. > {code} > spark.hierarchyStore='nvm 50GB,ssd 80GB' > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12229) How to Perform spark submit of application written in scala from Node js
[ https://issues.apache.org/jira/browse/SPARK-12229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048663#comment-15048663 ] Nan Zhu commented on SPARK-12229: - https://github.com/spark-jobserver/spark-jobserver might be a good reference BTW, I think you can forward this question to user list to get more feedback. Most of the discussions here are just for bug/feature tracking, etc. If you are comfortable about this, would you mind closing the issue? > How to Perform spark submit of application written in scala from Node js > > > Key: SPARK-12229 > URL: https://issues.apache.org/jira/browse/SPARK-12229 > Project: Spark > Issue Type: Question > Components: Spark Core, Spark Submit >Affects Versions: 1.3.1 > Environment: Linux 14.4 >Reporter: himanshu singhal > Labels: newbie > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > I am having an spark core application written in scala and now i am > developing the front end in node js but having a question how can i make the > spark submit to run the application from the node js -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12240) FileNotFoundException: (Too many open files) when using multiple groupby on DataFrames
Shubhanshu Mishra created SPARK-12240: - Summary: FileNotFoundException: (Too many open files) when using multiple groupby on DataFrames Key: SPARK-12240 URL: https://issues.apache.org/jira/browse/SPARK-12240 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Environment: Debian 3.2.68-1+deb7u6 x86_64 GNU/Linux Reporter: Shubhanshu Mishra Whenever, I try to do multiple grouping using data frames my job crashes with the error FileNotFoundException and message = too many open files. I can do these groupings using RDD easily but when I use the DataFrame operation I see these issues. The code I am running: ``` df_t = df.filter(df['max_cum_rank'] == 0).select(['col1','col2']).groupby('col1').agg(F.min('col2')).groupby('min(col2)').agg(F.countDistinct('col1')).toPandas() ``` In [151]: df_t = df.filter(df['max_cum_rank'] == 0).select(['col1','col2']).groupby('col1').agg(F.min('col2')).groupby('min(col2)').agg(F.countDistinct('col1')).toPandas() [Stage 27:=>(415 + 1) / 416]15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/22/temp_shuffle_1abbf917-842c-41ef-b113-ed60ee22e675 java.io.FileNotFoundException: /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/22/temp_shuffle_1abbf917-842c-41ef-b113-ed60ee22e675 (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174) at org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/29/temp_shuffle_e35e6e28-fdbf-4775-a32d-d0f5fd882e9e java.io.FileNotFoundException: /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/29/temp_shuffle_e35e6e28-fdbf-4775-a32d-d0f5fd882e9e (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174) at org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/18/temp_shuffle_2d26adcb-e3bb-4a01-8998-7428ebe5544d java.io.FileNotFoundException: /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/18/temp_shuffle_2d26adcb-e3bb-4a01-8998-7428ebe5544d (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174) at org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Comment: was deleted (was: Sorry, I have to delete this PR because of my wrong github operation, I will send a new one ASAP.) > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it > could be higher, CPU reaches bottleneck in our test environment). > 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because > we support both RDD cache and shuffle and no extra inter process > communication. > *Usage* > 1. In spark-default.xml, configure spark.hierarchyStore. > {code} > spark.hierarchyStore nvm 50GB, ssd 80GB > {code} > It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all > the rest form the last layer. > 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir > or yarn.nodemanager.local-dirs. > {code} > spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others > {code} > After then, restart your Spark application, it will allocate blocks from nvm > first. > When nvm's usable space is less than 50GB, it starts to allocate from ssd. > When ssd's usable space is less than 80GB, it starts to allocate from the > last layer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619 ] Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:28 PM: --- Below is the explain plan. To make it clear, query that contains not (A and B) : {code} and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff')) {code} produces wrong results, and query that is already expanded as (not A) or (not B) produces correct output. By the way I saw in explain plan cast(0 as double)) so I tried to change 0 => 0.0 but no difference. Physical plan looks similar: {code} 'Filter (('LoanID = 62231) && (NOT ('PaymentsReceived = 0) || NOT 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Filter ((LoanID#8588 = 62231) && (NOT (PaymentsReceived#8601 = 0.0) || NOT ExplicitRoll#8611 IN (PreviouslyPaidOff,PreviouslyChargedOff))) {code} Explain plan results: {code} In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))").explain(True) {code} {noformat} == Parsed Logical Plan == 'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Analyzed Logical Plan == BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, InterestPaid: double, LateFees: double, ServicingFees: double, RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, NetCashToInvestorsFromDebtSale: double, OZVintage: string Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Optimized Logical Plan == Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619 ] Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:40 PM: --- Below is the explain plan. To make it clear, query that contains not (A and B) : {code} and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff')) {code} produces wrong results, and query that is already expanded as (not A) or (not B) produces correct output. Physical plan look like this: {code} wrong results-- Filter ((LoanID#8629 = 62231) && NOT ((PaymentsReceived#8642 = 0.0) && ExplicitRoll#8652 IN (PreviouslyPaidOff,PreviouslyChargedOff))) correct results -- Filter ((LoanID#8803 = 62231) && (NOT (PaymentsReceived#8816 = 0.0) || NOT ExplicitRoll#8826 IN (PreviouslyPaidOff,PreviouslyChargedOff))) {code} Explain plan results: {code} Wrong: In [15]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( PaymentsReceived=0.0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))").explain(True) {code} {noformat} == Parsed Logical Plan == 'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0.0) && 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8625,MnthRate#8626,ObservationMonth#8627,CycleCounter#8628,LoanID#8629,Loankey#8630,OriginationDate#8631,OriginationQuarter#8632,LoanAmount#8633,Term#8634,LenderRate#8635,ProsperRating#8636,ScheduledMonthlyPaymentAmount#8637,ChargeoffMonth#8638,ChargeoffAmount#8639,CompletedMonth#8640,MonthOfLastPayment#8641,PaymentsReceived#8642,CollectionFees#8643,PrincipalPaid#8644,InterestPaid#8645,LateFees#8646,ServicingFees#8647,RecoveryPayments#8648,RecoveryPrin#8649,DaysPastDue#8650,PriorMonthDPD#8651,ExplicitRoll#8652,SummaryRoll#8653,CumulPrin#8654,EOMPrin#8655,ScheduledPrinRemaining#8656,ScheduledCumulPrin#8657,ScheduledPeriodicPrin#8658,BOMPrin#8659,ListingNumber#8660,DebtSaleMonth#8661,GrossCashFromDebtSale#8662,DebtSaleFee#8663,NetCashToInvestorsFromDebtSale#8664,OZVintage#8665] ParquetRelation[file:/d:/MktLending/prp_enh1] == Analyzed Logical Plan == BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, InterestPaid: double, LateFees: double, ServicingFees: double, RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, NetCashToInvestorsFromDebtSale: double, OZVintage: string Filter ((LoanID#8629 = 62231) && NOT ((PaymentsReceived#8642 = cast(0.0 as double)) && ExplicitRoll#8652 IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8625,MnthRate#8626,ObservationMonth#8627,CycleCounter#8628,LoanID#8629,Loankey#8630,OriginationDate#8631,OriginationQuarter#8632,LoanAmount#8633,Term#8634,LenderRate#8635,ProsperRating#8636,ScheduledMonthlyPaymentAmount#8637,ChargeoffMonth#8638,ChargeoffAmount#8639,CompletedMonth#8640,MonthOfLastPayment#8641,PaymentsReceived#8642,CollectionFees#8643,PrincipalPaid#8644,InterestPaid#8645,LateFees#8646,ServicingFees#8647,RecoveryPayments#8648,RecoveryPrin#8649,DaysPastDue#8650,PriorMonthDPD#8651,ExplicitRoll#8652,SummaryRoll#8653,CumulPrin#8654,EOMPrin#8655,ScheduledPrinRemaining#8656,ScheduledCumulPrin#8657,ScheduledPeriodicPrin#8658,BOMPrin#8659,ListingNumber#8660,DebtSaleMonth#8661,GrossCashFromDebtSale#8662,DebtSaleFee#8663,NetCashToInvestorsFromDebtSale#8664,OZVintage#8665] ParquetRelation[file:/d:/MktLending/prp_enh1] == Optimized Logical Plan == Filter ((LoanID#8629 = 62231) && NOT ((PaymentsReceived#8642 = 0.0) && ExplicitRoll#8652 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048802#comment-15048802 ] Reynold Xin commented on SPARK-3461: I'm going to close this one since this API is now available in the new Dataset API (SPARK-). > Support external groupByKey using repartitionAndSortWithinPartitions > > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Critical > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12217) Document invalid handling for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12217: -- Target Version/s: (was: 1.6.0) Fix Version/s: (was: 1.6.1) (was: 2.0.0) [~BenFradet] please don't set fix or target version > Document invalid handling for StringIndexer > --- > > Key: SPARK-12217 > URL: https://issues.apache.org/jira/browse/SPARK-12217 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Benjamin Fradet >Priority: Minor > > Documentation is needed regarding the handling of invalid labels in > StringIndexer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12219) Spark 1.5.2 code does not build on Scala 2.11.7 with SBT assembly
[ https://issues.apache.org/jira/browse/SPARK-12219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048807#comment-15048807 ] Sean Owen commented on SPARK-12219: --- Can you provide more detail? what exactly is the error and how are you building? it compiles vs 2.11 at the moment, or did so recently > Spark 1.5.2 code does not build on Scala 2.11.7 with SBT assembly > - > > Key: SPARK-12219 > URL: https://issues.apache.org/jira/browse/SPARK-12219 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 >Reporter: Rodrigo Boavida > > I've tried with no success to build Spark on Scala 2.11.7. I'm getting build > errors using sbt due to the issues found in the below thread in July of this > year. > https://mail-archives.apache.org/mod_mbox/spark-dev/201507.mbox/%3CCA+3qhFSJGmZToGmBU1=ivy7kr6eb7k8t6dpz+ibkstihryw...@mail.gmail.com%3E > Seems some minor fixes are needed to make the Scala 2.11 compiler happy. > I needed to build with SBT as per suggested on below thread to get over some > apparent maven shader plugin because which changed some classes when I change > to akka 2.4.0. > https://groups.google.com/forum/#!topic/akka-user/iai6whR6-xU > I've set this bug to Major priority assuming that the Spark community wants > to keep fully supporting SBT builds, including the Scala 2.11 compatibility. > Tnks, > Rod -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12207) "org.datanucleus" is already registered
[ https://issues.apache.org/jira/browse/SPARK-12207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12207. --- Resolution: Invalid I don't think is appropriate as a JIRA; you're not narrowing down the problem to an actionable issue. It looks like a problem in your env. > "org.datanucleus" is already registered > --- > > Key: SPARK-12207 > URL: https://issues.apache.org/jira/browse/SPARK-12207 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows 7 64 bit > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > winutils ls \tmp\hive -> > drwxrwxrwx 1 PC\Stefan BloomBear-SSD\None 0 Dec 8 2015 \tmp\hive >Reporter: stefan > > I read the response to an identical issue: > https://issues.apache.org/jira/browse/SPARK-11142 > and then did search the mailing list archives here: > http://apache-spark-user-list.1001560.n3.nabble.com/ > but got no help > http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=search_page=1=%22org.datanucleus%22+is+already+registered=0 > There were only 6 posts over all the dates but no resolution. > My apache spark folder has 3 jar files in the lib folder > datanucleus-api-jdo-3.2.6.jar > datanucleus-core-3.2.10.jar > datanucleus-rdbms-3.2.9.jar > and these are compiled -> mostly unreadable > There are no other files with the word 'datanucleus' on my pc. > Maven's pom.xml file has been implicated > http://stackoverflow.com/questions/877949/conflicting-versions-of-datanucleus-enhancer-in-a-maven-google-app-engine-projec > but I do not have maven on this windows box. > This post: > http://www.worldofdb2.com/profiles/blogs/using-spark-s-interactive-scala-shell-for-accessing-db2-data > suggests downloading > DB2 JDBC driver jar (db2jcc.jar or db2jcc4.jar) and setting > SPARK_CLASSPATH=c:\db2jcc.jar > but the mailing list archives say SPARK_CLASSPATH is deprecated > https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3C01a901d0547c$a23ba480$e6b2ed80$@innowireless.com%3E > Could I have a ptr to a resolution to this problem.? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12031) Integer overflow when do sampling.
[ https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12031. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10023 [https://github.com/apache/spark/pull/10023] > Integer overflow when do sampling. > -- > > Key: SPARK-12031 > URL: https://issues.apache.org/jira/browse/SPARK-12031 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 >Reporter: uncleGen >Priority: Critical > Fix For: 2.0.0, 1.6.1 > > > In my case, some partitions contain too much items. When do range partition, > exception thrown as: > {code} > java.lang.IllegalArgumentException: n must be positive > at java.util.Random.nextInt(Random.java:300) > at > org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58) > at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259) > at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12195) Adding BigDecimal, Date and Timestamp into Encoder
[ https://issues.apache.org/jira/browse/SPARK-12195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12195: -- Assignee: Xiao Li > Adding BigDecimal, Date and Timestamp into Encoder > -- > > Key: SPARK-12195 > URL: https://issues.apache.org/jira/browse/SPARK-12195 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 1.6.0 > > > Add three types like DecimalType, DateType and TimestampType into Encoder for > DataSet APIs. > DecimalType -> java.math.BigDecimal > DateType -> java.sql.Date > TimestampType -> java.sql.Timestamp -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12201) add type coercion rule for greatest/least
[ https://issues.apache.org/jira/browse/SPARK-12201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12201: -- Assignee: Wenchen Fan > add type coercion rule for greatest/least > - > > Key: SPARK-12201 > URL: https://issues.apache.org/jira/browse/SPARK-12201 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619 ] Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:27 PM: --- Below is the explain plan. To make it clear, query that contains not (A and B) : {code} and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))") {code} produces wrong results, and query that is already expanded as (not A) or (not B) produces correct output. By the way I saw in explain plan cast(0 as double)) so I tried to change 0 => 0.0 but no difference. Physical plan looks similar: {code} 'Filter (('LoanID = 62231) && (NOT ('PaymentsReceived = 0) || NOT 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Filter ((LoanID#8588 = 62231) && (NOT (PaymentsReceived#8601 = 0.0) || NOT ExplicitRoll#8611 IN (PreviouslyPaidOff,PreviouslyChargedOff))) {code} Explain plan results: {code} In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))").explain(True) {code} {noformat} == Parsed Logical Plan == 'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Analyzed Logical Plan == BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, InterestPaid: double, LateFees: double, ServicingFees: double, RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, NetCashToInvestorsFromDebtSale: double, OZVintage: string Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Optimized Logical Plan == Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
[jira] [Created] (SPARK-12241) Improve failure reporting in Yarn client obtainTokenForHBase()
Steve Loughran created SPARK-12241: -- Summary: Improve failure reporting in Yarn client obtainTokenForHBase() Key: SPARK-12241 URL: https://issues.apache.org/jira/browse/SPARK-12241 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.5.2 Environment: Kerberos Reporter: Steve Loughran Priority: Minor SPARK-11265 improved reporting of reflection problems getting the Hive token in a secure cluster. The `obtainTokenForHBase()` method needs to pick up the same policy of what exceptions to swallow/reject, and report any swallowed exceptions with the full stack trace. without this, problems authenticating with HBase result in log errors with no meaningful information at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12242) DataFrame.transform function
Reynold Xin created SPARK-12242: --- Summary: DataFrame.transform function Key: SPARK-12242 URL: https://issues.apache.org/jira/browse/SPARK-12242 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Similar to Dataset.transform. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12242) DataFrame.transform function
[ https://issues.apache.org/jira/browse/SPARK-12242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048818#comment-15048818 ] Apache Spark commented on SPARK-12242: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/10226 > DataFrame.transform function > > > Key: SPARK-12242 > URL: https://issues.apache.org/jira/browse/SPARK-12242 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Similar to Dataset.transform. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12242) DataFrame.transform function
[ https://issues.apache.org/jira/browse/SPARK-12242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12242: Assignee: Reynold Xin (was: Apache Spark) > DataFrame.transform function > > > Key: SPARK-12242 > URL: https://issues.apache.org/jira/browse/SPARK-12242 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Similar to Dataset.transform. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12196) Store blocks in storage devices with hierarchy way
[ https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-12196: -- Description: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. Configure spark.storage.hierarchyStore. {code} spark.storage.hierarchyStore='nvm 50GB,ssd 80GB' {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. was: *Problem* Nowadays, users have both SSDs and HDDs. SSDs have great performance, but capacity is small. HDDs have good capacity, but x2-x3 lower than SSDs. How can we get both good? *Solution* Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup storage. When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets blocks from SSDs first, and when SSD’s useable space is less than some threshold, getting blocks from HDDs. In our implementation, we actually go further. We support a way to build any level hierarchy store access all storage medias (NVM, SSD, HDD etc.). *Performance* 1. At the best case, our solution performs the same as all SSDs. 2. At the worst case, like all data are spilled to HDDs, no performance regression. 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it could be higher, CPU reaches bottleneck in our test environment). 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because we support both RDD cache and shuffle and no extra inter process communication. *Usage* 1. Configure spark.storage.hierarchyStore. {code} spark.hierarchyStore='nvm 50GB,ssd 80GB' {code} It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all the rest form the last layer. 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir or yarn.nodemanager.local-dirs. {code} spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others {code} After then, restart your Spark application, it will allocate blocks from nvm first. When nvm's usable space is less than 50GB, it starts to allocate from ssd. When ssd's usable space is less than 80GB, it starts to allocate from the last layer. > Store blocks in storage devices with hierarchy way > -- > > Key: SPARK-12196 > URL: https://issues.apache.org/jira/browse/SPARK-12196 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: yucai > > *Problem* > Nowadays, users have both SSDs and HDDs. > SSDs have great performance, but capacity is small. HDDs have good capacity, > but x2-x3 lower than SSDs. > How can we get both good? > *Solution* > Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup > storage. > When Spark core allocates blocks for RDD (either shuffle or RDD cache), it > gets blocks from SSDs first, and when SSD’s useable space is less than some > threshold, getting blocks from HDDs. > In our implementation, we actually go further. We support a way to build any > level hierarchy store access all storage medias (NVM, SSD, HDD etc.). > *Performance* > 1. At the best case, our solution performs the same as all SSDs. > 2. At the worst case, like all data are spilled to HDDs, no performance > regression. > 3. Compared with all HDDs, hierarchy
[jira] [Resolved] (SPARK-12229) How to Perform spark submit of application written in scala from Node js
[ https://issues.apache.org/jira/browse/SPARK-12229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12229. --- Resolution: Invalid Target Version/s: (was: 1.3.1) [~himanshu.techsk...@gmail.com] as I say, the user@ list is the right place for questions. Don't open a JIRA. > How to Perform spark submit of application written in scala from Node js > > > Key: SPARK-12229 > URL: https://issues.apache.org/jira/browse/SPARK-12229 > Project: Spark > Issue Type: Question > Components: Spark Core, Spark Submit >Affects Versions: 1.3.1 > Environment: Linux 14.4 >Reporter: himanshu singhal > Labels: newbie > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > I am having an spark core application written in scala and now i am > developing the front end in node js but having a question how can i make the > spark submit to run the application from the node js -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12216: -- Priority: Minor (was: Major) Component/s: Spark Shell [~skypickle] have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a JIRA. This can't be Major and needs a component. It's not clear why: permissions problem? something else has that dir open? Is it reproducible? I suspect it's a problem specific to Windows. Do you have a suggestion about how to work around it? > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6929) Alias for more complex expression causes attribute not been able to resolve
[ https://issues.apache.org/jira/browse/SPARK-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6929. -- Resolution: Duplicate > Alias for more complex expression causes attribute not been able to resolve > --- > > Key: SPARK-6929 > URL: https://issues.apache.org/jira/browse/SPARK-6929 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michał Warecki >Priority: Critical > > I've extracted the minimal query that don't work with aliases. You can remove > tstudent expression ((tstudent((COUNT(g_0.test2_value) - 1)) from that query > and result will be the same. In exception you can see that c_0 is not > resolved but c_1 cause that problem. > {code} > SELECT g_0.test1 AS c_0, (AVG(g_0.test2) - ((tstudent((COUNT(g_0.test2_value) > - 1)) * stddev(g_0.test2_value)) / sqrt(convert(COUNT(g_0.test2), long AS > c_1 FROM sometable AS g_0 GROUP BY g_0.test1 ORDER BY c_0 LIMIT 502 > {code} > cause exception: > {code} > Remote org.apache.spark.sql.AnalysisException: cannot resolve 'c_0' given > input columns c_0, c_1; line 1 pos 246 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at
[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619 ] Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 1:47 PM: --- In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))").explain(True) == Parsed Logical Plan == 'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Analyzed Logical Plan == BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, InterestPaid: double, LateFees: double, ServicingFees: double, RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, NetCashToInvestorsFromDebtSale: double, OZVintage: string Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Optimized Logical Plan == Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Physical Plan == Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff))) Scan
[jira] [Commented] (SPARK-2280) Java & Scala reference docs should describe function reference behavior.
[ https://issues.apache.org/jira/browse/SPARK-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048750#comment-15048750 ] Sean Owen commented on SPARK-2280: -- I think it's still a reasonable improvement to the docs. I'd restrict it to the RDD class and related core RDD classes like JavaRDDLike and PairRDDFunctions. If interested, maybe start with a PR as a "[WIP]" work in progress to see if people support adding all the tags and review the descriptions, then finish the job if it's OK. > Java & Scala reference docs should describe function reference behavior. > > > Key: SPARK-2280 > URL: https://issues.apache.org/jira/browse/SPARK-2280 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Hans Uhlig >Priority: Minor > Labels: starter > > Example > JavaPairRDDgroupBy(Function f) > Return an RDD of grouped elements. Each group consists of a key and a > sequence of elements mapping to that key. > T and K are not described and there is no explanation of what the function's > inputs and outputs should be and how GroupBy uses this information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12208) Abstract the examples into a common place
[ https://issues.apache.org/jira/browse/SPARK-12208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12208: -- Target Version/s: (was: 1.6.0) Priority: Minor (was: Major) > Abstract the examples into a common place > - > > Key: SPARK-12208 > URL: https://issues.apache.org/jira/browse/SPARK-12208 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter >Priority: Minor > > When we write examples in the code, we put the generation of the data along > with the example itself. We typically have either: > {code} > val data = > sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") > ... > {code} > or some more esoteric stuff such as: > {code} > val data = Array( > (0, 0.1), > (1, 0.8), > (2, 0.2) > ) > val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", > "feature") > {code} > {code} > val data = Array( > Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), > Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), > Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) > ) > val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features") > {code} > I suggest we follow the example of sklearn and standardize all the generation > of example data inside a few methods, for example in > {{org.apache.spark.ml.examples.ExampleData}}. One reason is that just reading > the code is sometimes not enough to figure out what the data is supposed to > be. For example when using {{libsvm_data}}, it is unclear what the dataframe > columns are. This is something we should comment somewhere. > Also, it would help explaining in one place all the scala idiosyncracies such > as using {{Tuple1.apply}} and such. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12031) Integer overflow when do sampling.
[ https://issues.apache.org/jira/browse/SPARK-12031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12031: -- Assignee: uncleGen Priority: Major (was: Critical) > Integer overflow when do sampling. > -- > > Key: SPARK-12031 > URL: https://issues.apache.org/jira/browse/SPARK-12031 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 >Reporter: uncleGen >Assignee: uncleGen > Fix For: 1.6.1, 2.0.0 > > > In my case, some partitions contain too much items. When do range partition, > exception thrown as: > {code} > java.lang.IllegalArgumentException: n must be positive > at java.util.Random.nextInt(Random.java:300) > at > org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:58) > at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:259) > at org.apache.spark.RangePartitioner$$anonfun$8.apply(Partitioner.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3461. Resolution: Implemented Assignee: Reynold Xin (was: Sandy Ryza) Fix Version/s: 1.6.0 > Support external groupByKey using repartitionAndSortWithinPartitions > > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Reynold Xin >Priority: Critical > Fix For: 1.6.0 > > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619 ] Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:22 PM: --- Below is the explain plan. To make it clear query that contains {code} "and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))")" {code} produces wrong results, one that is already expanded as (not A) or (not B) produces correct output. Physical plan looks similar: {code} 'Filter (('LoanID = 62231) && (NOT ('PaymentsReceived = 0) || NOT 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Filter ((LoanID#8588 = 62231) && (NOT (PaymentsReceived#8601 = 0.0) || NOT ExplicitRoll#8611 IN (PreviouslyPaidOff,PreviouslyChargedOff))) {code} Explain plan results: {code} In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))").explain(True) {code} {noformat} == Parsed Logical Plan == 'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Analyzed Logical Plan == BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, InterestPaid: double, LateFees: double, ServicingFees: double, RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, NetCashToInvestorsFromDebtSale: double, OZVintage: string Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Optimized Logical Plan == Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
[jira] [Comment Edited] (SPARK-12218) Boolean logic in sql does not work "not (A and B)" is not the same as "(not A) or (not B)"
[ https://issues.apache.org/jira/browse/SPARK-12218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048619#comment-15048619 ] Irakli Machabeli edited comment on SPARK-12218 at 12/9/15 2:25 PM: --- Below is the explain plan. To make it clear, query that contains not (A and B) : {code} "and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))")" {code} produces wrong results, and query that is already expanded as (not A) or (not B) produces correct output. By the way I saw in explain plan cast(0 as double)) so I tried to change 0 => 0.0 but no difference. Physical plan looks similar: {code} 'Filter (('LoanID = 62231) && (NOT ('PaymentsReceived = 0) || NOT 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Filter ((LoanID#8588 = 62231) && (NOT (PaymentsReceived#8601 = 0.0) || NOT ExplicitRoll#8611 IN (PreviouslyPaidOff,PreviouslyChargedOff))) {code} Explain plan results: {code} In [13]: sqlContext.read.parquet('prp_enh1').where(" LoanID=62231 and not( PaymentsReceived=0 and ExplicitRoll in ('PreviouslyPaidOff', 'PreviouslyChargedOff'))").explain(True) {code} {noformat} == Parsed Logical Plan == 'Filter (('LoanID = 62231) && NOT (('PaymentsReceived = 0) && 'ExplicitRoll IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Analyzed Logical Plan == BorrowerRate: double, MnthRate: double, ObservationMonth: date, CycleCounter: int, LoanID: int, Loankey: string, OriginationDate: date, OriginationQuarter: string, LoanAmount: double, Term: int, LenderRate: double, ProsperRating: string, ScheduledMonthlyPaymentAmount: double, ChargeoffMonth: date, ChargeoffAmount: double, CompletedMonth: date, MonthOfLastPayment: date, PaymentsReceived: double, CollectionFees: double, PrincipalPaid: double, InterestPaid: double, LateFees: double, ServicingFees: double, RecoveryPayments: double, RecoveryPrin: double, DaysPastDue: int, PriorMonthDPD: int, ExplicitRoll: string, SummaryRoll: string, CumulPrin: double, EOMPrin: double, ScheduledPrinRemaining: double, ScheduledCumulPrin: double, ScheduledPeriodicPrin: double, BOMPrin: double, ListingNumber: int, DebtSaleMonth: int, GrossCashFromDebtSale: double, DebtSaleFee: double, NetCashToInvestorsFromDebtSale: double, OZVintage: string Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = cast(0 as double)) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff))) Relation[BorrowerRate#8543,MnthRate#8544,ObservationMonth#8545,CycleCounter#8546,LoanID#8547,Loankey#8548,OriginationDate#8549,OriginationQuarter#8550,LoanAmount#8551,Term#8552,LenderRate#8553,ProsperRating#8554,ScheduledMonthlyPaymentAmount#8555,ChargeoffMonth#8556,ChargeoffAmount#8557,CompletedMonth#8558,MonthOfLastPayment#8559,PaymentsReceived#8560,CollectionFees#8561,PrincipalPaid#8562,InterestPaid#8563,LateFees#8564,ServicingFees#8565,RecoveryPayments#8566,RecoveryPrin#8567,DaysPastDue#8568,PriorMonthDPD#8569,ExplicitRoll#8570,SummaryRoll#8571,CumulPrin#8572,EOMPrin#8573,ScheduledPrinRemaining#8574,ScheduledCumulPrin#8575,ScheduledPeriodicPrin#8576,BOMPrin#8577,ListingNumber#8578,DebtSaleMonth#8579,GrossCashFromDebtSale#8580,DebtSaleFee#8581,NetCashToInvestorsFromDebtSale#8582,OZVintage#8583] ParquetRelation[file:/d:/MktLending/prp_enh1] == Optimized Logical Plan == Filter ((LoanID#8547 = 62231) && NOT ((PaymentsReceived#8560 = 0.0) && ExplicitRoll#8570 IN (PreviouslyPaidOff,PreviouslyChargedOff)))
[jira] [Comment Edited] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.
[ https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048636#comment-15048636 ] Adam Roberts edited comment on SPARK-9858 at 12/9/15 2:35 PM: -- Thanks for the prompt reply, rowBuffer is a variable in org.apache.spark.sql.execution.UnsafeRowSerializer within the asKeyValueIterator method. I experimented with the Exchange class, same problems are observed using the SparkSqlSeriaizer; suggesting the UnsafeRowSerializer is probably fine. I agree with your second comment, I think the code within org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere. It'll be useful to determine how the values in the assertions can be determined programatically, I think the partitioning algorithm itself is working as expected but for some reason stages require more bytes on the platforms I'm using. spark.sql.shuffle.partitions is unchanged, I'm working off the latest master code. Is there something special about the aggregate, join, and complex query 2 tests? Can we print exactly what the bytes are for each stage? I know rdd.count is always correct and the DataFrames are the same (printed each row, written to json and parquet - no concerns). Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate test passes. I'm wondering if there's an extra factor we should be taking into account was (Author: aroberts): Thanks for the prompt reply, rowBuffer is a variable in org.apache.spark.sql.execution.UnsafeRowSerializer within the asKeyValueIterator method. I experimented with the Exchange class, same problems are observed using the SparkSqlSeriaizer; suggesting the UnsafeRowSerializer is probably fine. I agree with your second comment, I think the code within org.apache.spark.unsafe.Platform is OK or we'd be hitting problems elsewhere. It'll be useful to determine how the values in the assertions can be determined programatically, I think the partitioning algorithm itself is working as expected but for some reason stages require more bytes on the platforms I'm using. spark.sql.shuffle.partitions is unchanged, I'm working off the latest master code. Is there something special about the aggregate, join, and complex query 2 tests? Can we print exactly what the bytes are for each stage? I know rdd.count is always correct and the DataFrames are the same (printed each row, written to json and parquet - no concerns). Potential clue: if we set SQLConf.SHUFFLE_PARTITIONS.key to 4, the aggregate test passes. I'm wondering if there's an extra factor we should take into account when determining the indices regardless of platform. > Introduce an ExchangeCoordinator to estimate the number of post-shuffle > partitions. > --- > > Key: SPARK-9858 > URL: https://issues.apache.org/jira/browse/SPARK-9858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6918) Secure HBase with Kerberos does not work over YARN
[ https://issues.apache.org/jira/browse/SPARK-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048799#comment-15048799 ] Pierre Beauvois commented on SPARK-6918: Hi, I aslo have the issue on my platform. Here are my specs: * Spark 1.5.2 * HBase 1.1.2 * Hadoop 2.7.1 * Zookeeper 3.4.5 * Authentication done through Kerberos I added the option "spark.driver.extraClassPath" in the spark-defaults.conf which contains the HBASE_CONF_DIR as below: spark.driver.extraClassPath = /opt/application/Hbase/current/conf/ On the driver, I start spark-shell (I'm running it in yarn-client mode) {code} [my_user@uabigspark01 ~]$ spark-shell -v --name HBaseTest --jars /opt/application/Hbase/current/lib/hbase-common-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-server-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-client-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-protocol-1.1.2.jar,/opt/application/Hbase/current/lib/protobuf-java-2.5.0.jar,/opt/application/Hbase/current/lib/htrace-core-3.1.0-incubating.jar,/opt/application/Hbase/current/lib/hbase-annotations-1.1.2.jar,/opt/application/Hbase/current/lib/guava-12.0.1.jar {code} Then I run the following lines: {code} scala> import org.apache.spark._ import org.apache.spark._ scala> import org.apache.spark.rdd.NewHadoopRDD import org.apache.spark.rdd.NewHadoopRDD scala> import org.apache.hadoop.fs.Path import org.apache.hadoop.fs.Path scala> import org.apache.hadoop.hbase.util.Bytes import org.apache.hadoop.hbase.util.Bytes scala> import org.apache.hadoop.hbase.HColumnDescriptor import org.apache.hadoop.hbase.HColumnDescriptor scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} scala> import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result} import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result} scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.mapreduce.TableInputFormat scala> import org.apache.hadoop.hbase.io.ImmutableBytesWritable import org.apache.hadoop.hbase.io.ImmutableBytesWritable scala> val conf = HBaseConfiguration.create() conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, hbase-site.xml scala> conf.addResource(new Path("/opt/application/Hbase/current/conf/hbase-site.xml")) scala> conf.set("hbase.zookeeper.quorum", "FQDN1:2181,FQDN2:2181,FQDN3:2181") scala> conf.set(TableInputFormat.INPUT_TABLE, "user:noheader") scala> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) 2015-12-09 15:17:58,890 INFO [main] storage.MemoryStore: ensureFreeSpace(266248) called with curMem=0, maxMem=556038881 2015-12-09 15:17:58,892 INFO [main] storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 260.0 KB, free 530.0 MB) 2015-12-09 15:17:59,196 INFO [main] storage.MemoryStore: ensureFreeSpace(32808) called with curMem=266248, maxMem=556038881 2015-12-09 15:17:59,197 INFO [main] storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 32.0 KB, free 530.0 MB) 2015-12-09 15:17:59,199 INFO [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.200.208:60217 (size: 32.0 KB, free: 530.2 MB) 2015-12-09 15:17:59,203 INFO [main] spark.SparkContext: Created broadcast 0 from newAPIHadoopRDD at :34 hBaseRDD: org.apache.spark.rdd.RDD[(org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result)] = NewHadoopRDD[0] at newAPIHadoopRDD at :34 scala> hBaseRDD.count() 2015-12-09 15:18:09,441 INFO [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4a52ac6 connecting to ZooKeeper ensemble=FQDN1:2181,FQDN2:2181,FQDN3:2181 2015-12-09 15:18:09,456 INFO [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT 2015-12-09 15:18:09,456 INFO [main] zookeeper.ZooKeeper: Client environment:host.name=DRIVER.FQDN 2015-12-09 15:18:09,456 INFO [main] zookeeper.ZooKeeper: Client environment:java.version=1.7.0_85 2015-12-09 15:18:09,456 INFO [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 2015-12-09 15:18:09,456 INFO [main] zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/jre 2015-12-09 15:18:09,456 INFO [main] zookeeper.ZooKeeper: Client
[jira] [Updated] (SPARK-12188) [SQL] Code refactoring and comment correction in Dataset APIs
[ https://issues.apache.org/jira/browse/SPARK-12188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12188: -- Assignee: Xiao Li > [SQL] Code refactoring and comment correction in Dataset APIs > - > > Key: SPARK-12188 > URL: https://issues.apache.org/jira/browse/SPARK-12188 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 1.6.0 > > > - Created a new private variable `boundTEncoder` that can be shared by > multiple functions, `RDD`, `select` and `collect`. > - Replaced all the `queryExecution.analyzed` by the function call > `logicalPlan` > - A few API comments are using wrong class names (e.g., `DataFrame`) or > parameter names (e.g., `n`) > - A few API descriptions are wrong. (e.g., `mapPartitions`) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12222) deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1: -- Assignee: Fei Wang > deserialize RoaringBitmap using Kryo serializer throw Buffer underflow > exception > > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Fei Wang >Assignee: Fei Wang > Fix For: 1.6.0, 2.0.0 > > > here are some problems when deserialize RoaringBitmap. see the examples below: > run this piece of code > ``` > import com.esotericsoftware.kryo.io.{Input => KryoInput, Output => KryoOutput} > import java.io.DataInput > class KryoInputDataInputBridge(input: KryoInput) extends DataInput { > override def readLong(): Long = input.readLong() > override def readChar(): Char = input.readChar() > override def readFloat(): Float = input.readFloat() > override def readByte(): Byte = input.readByte() > override def readShort(): Short = input.readShort() > override def readUTF(): String = input.readString() // readString in > kryo does utf8 > override def readInt(): Int = input.readInt() > override def readUnsignedShort(): Int = input.readShortUnsigned() > override def skipBytes(n: Int): Int = input.skip(n.toLong).toInt > override def readFully(b: Array[Byte]): Unit = input.read(b) > override def readFully(b: Array[Byte], off: Int, len: Int): Unit = > input.read(b, off, len) > override def readLine(): String = throw new > UnsupportedOperationException("readLine") > override def readBoolean(): Boolean = input.readBoolean() > override def readUnsignedByte(): Int = input.readByteUnsigned() > override def readDouble(): Double = input.readDouble() > } > class KryoOutputDataOutputBridge(output: KryoOutput) extends DataOutput { > override def writeFloat(v: Float): Unit = output.writeFloat(v) > // There is no "readChars" counterpart, except maybe "readLine", which > is not supported > override def writeChars(s: String): Unit = throw new > UnsupportedOperationException("writeChars") > override def writeDouble(v: Double): Unit = output.writeDouble(v) > override def writeUTF(s: String): Unit = output.writeString(s) // > writeString in kryo does UTF8 > override def writeShort(v: Int): Unit = output.writeShort(v) > override def writeInt(v: Int): Unit = output.writeInt(v) > override def writeBoolean(v: Boolean): Unit = output.writeBoolean(v) > override def write(b: Int): Unit = output.write(b) > override def write(b: Array[Byte]): Unit = output.write(b) > override def write(b: Array[Byte], off: Int, len: Int): Unit = > output.write(b, off, len) > override def writeBytes(s: String): Unit = output.writeString(s) > override def writeChar(v: Int): Unit = output.writeChar(v.toChar) > override def writeLong(v: Long): Unit = output.writeLong(v) > override def writeByte(v: Int): Unit = output.writeByte(v) > } > val outStream = new FileOutputStream("D:\\wfserde") > val output = new KryoOutput(outStream) > val bitmap = new RoaringBitmap > bitmap.add(1) > bitmap.add(3) > bitmap.add(5) > bitmap.serialize(new KryoOutputDataOutputBridge(output)) > output.flush() > output.close() > val inStream = new FileInputStream("D:\\wfserde") > val input = new KryoInput(inStream) > val ret = new RoaringBitmap > ret.deserialize(new KryoInputDataInputBridge(input)) > println(ret) > ``` > this will throw `Buffer underflow` error: > ``` > com.esotericsoftware.kryo.KryoException: Buffer underflow. > at com.esotericsoftware.kryo.io.Input.require(Input.java:156) > at com.esotericsoftware.kryo.io.Input.skip(Input.java:131) > at com.esotericsoftware.kryo.io.Input.skip(Input.java:264) > at > org.apache.spark.sql.SQLQuerySuite$$anonfun$6$KryoInputDataInputBridge$1.skipBytes > ``` > after same investigation, i found this is caused by a bug of kryo's > `Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) > and we call this method in `KryoInputDataInputBridge`. > So i think we can fix this issue in this two ways: > 1) upgrade the kryo version to 2.23.0 or 2.24.0, which has fix this bug in > kryo (i am not sure the upgrade is safe in spark, can you check it? @davies ) > 2) we can bypass the kryo's `Input.skip(long count)` by directly call > another `skip` method in kryo's > Input.java(https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/io/Input.java#L124), > i.e. write the bug-fixed version of `Input.skip(long count)` in
[jira] [Reopened] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.
[ https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-11621: --- > ORC filter pushdown not working properly after new unhandled filter interface. > -- > > Key: SPARK-11621 > URL: https://issues.apache.org/jira/browse/SPARK-11621 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Hyukjin Kwon > > After the new interface to get rid of filters predicate-push-downed which are > already processed in datasource-level > (https://github.com/apache/spark/pull/9399), it dose not push down filters > for ORC. > This is because at {{DataSourceStrategy}}, all the filters are treated as > unhandled filters. > Also, since ORC does not support to filter fully record by record but instead > rough results came out, the filters for ORC should not go to unhandled > filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12213) Query with only one distinct should not having on expand
[ https://issues.apache.org/jira/browse/SPARK-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12213: -- Component/s: SQL > Query with only one distinct should not having on expand > > > Key: SPARK-12213 > URL: https://issues.apache.org/jira/browse/SPARK-12213 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > Expand will double the number of records, slow down projection and > aggregation, it's better to generate a plan without Expand for a query with > only one distinct (for example, ss_max in TPCDS) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11621) ORC filter pushdown not working properly after new unhandled filter interface.
[ https://issues.apache.org/jira/browse/SPARK-11621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11621. --- Resolution: Duplicate Fix Version/s: (was: 1.6.0) > ORC filter pushdown not working properly after new unhandled filter interface. > -- > > Key: SPARK-11621 > URL: https://issues.apache.org/jira/browse/SPARK-11621 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Hyukjin Kwon > > After the new interface to get rid of filters predicate-push-downed which are > already processed in datasource-level > (https://github.com/apache/spark/pull/9399), it dose not push down filters > for ORC. > This is because at {{DataSourceStrategy}}, all the filters are treated as > unhandled filters. > Also, since ORC does not support to filter fully record by record but instead > rough results came out, the filters for ORC should not go to unhandled > filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10109) NPE when saving Parquet To HDFS
[ https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048821#comment-15048821 ] Paul Zaczkieiwcz commented on SPARK-10109: -- Still seeing this in 1.5.1. My use case is the same. I've got several independent spark jobs downloading data and appending to the same partitioned parquet directory in HDFS. None of the partitions overlap but the _temporary folder has become a bottleneck, only allowing one job to append at a time. > NPE when saving Parquet To HDFS > --- > > Key: SPARK-10109 > URL: https://issues.apache.org/jira/browse/SPARK-10109 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Sparc-ec2, standalone cluster on amazon >Reporter: Virgil Palanciuc > > Very simple code, trying to save a dataframe > I get this in the driver > {quote} > 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID > 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) > and (not for that task): > 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID > 5607, 172.yy.yy.yy): java.lang.NullPointerException > at > parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) > at > parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) > at > parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) > at > org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536) > at > scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107) > at > scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > I get this in the executor log: > {quote} > 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet > File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not > have any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) >
[jira] [Commented] (SPARK-2388) Streaming from multiple different Kafka topics is problematic
[ https://issues.apache.org/jira/browse/SPARK-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048885#comment-15048885 ] Dan Dutrow commented on SPARK-2388: --- The Kafka Direct API lets you insert a callback function that gives you access to the topic name and other metadata besides they key. > Streaming from multiple different Kafka topics is problematic > - > > Key: SPARK-2388 > URL: https://issues.apache.org/jira/browse/SPARK-2388 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Sergey > Fix For: 1.0.1 > > > Default way of creating stream out of Kafka source would be as > val stream = KafkaUtils.createStream(ssc,"localhost:2181","logs", > Map("retarget" -> 2,"datapair" -> 2)) > However, if two topics - in this case "retarget" and "datapair" - are very > different, there is no way to set up different filter, mapping functions, > etc), as they are effectively merged. > However, instance of KafkaInputDStream, created with this call internally > calls ConsumerConnector.createMessageStream() which returns *map* of > KafkaStreams, keyed by topic. It would be great if this map would be exposed > somehow, so aforementioned call > val streamS = KafkaUtils.createStreamS(...) > returned map of streams. > Regards, > Sergey Malov > Collective Media -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12241) Improve failure reporting in Yarn client obtainTokenForHBase()
[ https://issues.apache.org/jira/browse/SPARK-12241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12241: Assignee: Apache Spark > Improve failure reporting in Yarn client obtainTokenForHBase() > -- > > Key: SPARK-12241 > URL: https://issues.apache.org/jira/browse/SPARK-12241 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.5.2 > Environment: Kerberos >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Minor > > SPARK-11265 improved reporting of reflection problems getting the Hive token > in a secure cluster. > The `obtainTokenForHBase()` method needs to pick up the same policy of what > exceptions to swallow/reject, and report any swallowed exceptions with the > full stack trace. > without this, problems authenticating with HBase result in log errors with no > meaningful information at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9059) Update Python Direct Kafka Word count examples to show the use of HasOffsetRanges
[ https://issues.apache.org/jira/browse/SPARK-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9059. -- Resolution: Not A Problem > Update Python Direct Kafka Word count examples to show the use of > HasOffsetRanges > - > > Key: SPARK-9059 > URL: https://issues.apache.org/jira/browse/SPARK-9059 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das >Priority: Minor > Labels: starter > > Update Python examples of Direct Kafka word count to access the offset ranges > using HasOffsetRanges and print it. For example in Scala, > > {code} > var offsetRanges: Array[OffsetRange] = _ > ... > directKafkaDStream.foreachRDD { rdd => > offsetRanges = rdd.asInstanceOf[HasOffsetRanges] > } > ... > transformedDStream.foreachRDD { rdd => > // some operation > println("Processed ranges: " + offsetRanges) > } > {code} > See https://spark.apache.org/docs/latest/streaming-kafka-integration.html for > more info, and the master source code for more updated information on python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12203) Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers using receivers
[ https://issues.apache.org/jira/browse/SPARK-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-12203. --- Resolution: Won't Fix > Add KafkaDirectInputDStream that directly pulls messages from Kafka Brokers > using receivers > --- > > Key: SPARK-12203 > URL: https://issues.apache.org/jira/browse/SPARK-12203 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Liang-Chi Hsieh > > Currently, we have DirectKafkaInputDStream, which directly pulls messages > from Kafka Brokers without any receivers, and KafkaInputDStream, which pulls > messages from a Kafka Broker using receiver with zookeeper. > As we observed, because DirectKafkaInputDStream retrieves messages from Kafka > after each batch finishes, it posts a latency compared with KafkaInputDStream > that continues to pull messages during each batch window. > So we try to add KafkaDirectInputDStream that directly pulls messages from > Kafka Brokers as DirectKafkaInputDStream, but it uses receivers as > KafkaInputDStream and pulls messages during each batch window. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org