[jira] [Updated] (SPARK-1681) Handle hive support correctly in ./make-distribution
[ https://issues.apache.org/jira/browse/SPARK-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1681: --- Description: When Hive support is enabled we should copy the datanucleus jars to the packaged distribution. The simplest way would be to create a lib_managed folder in the final distribution so that the compute-classpath script searches in exactly the same way whether or not it's a release. A slightly nicer solution is to put the jars inside of `/lib` and have some fancier check for the jar location in the compute-classpath script. We should also document how to run Spark SQL on YARN when hive support is enabled. In particular how to add the necessary jars to spark-submit. was: When Hive support is enabled we should copy the datanucleus jars to the packaged distribution. The simplest way would be to create a lib_managed folder in the final distribution so that the compute-classpath script searches in exactly the same way whether or not it's a release. A slightly nicer solution is to put the jars inside of `/lib` and have some fancier check for the jar location in the compute-classpath script. > Handle hive support correctly in ./make-distribution > > > Key: SPARK-1681 > URL: https://issues.apache.org/jira/browse/SPARK-1681 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > Fix For: 1.0.0 > > > When Hive support is enabled we should copy the datanucleus jars to the > packaged distribution. The simplest way would be to create a lib_managed > folder in the final distribution so that the compute-classpath script > searches in exactly the same way whether or not it's a release. > A slightly nicer solution is to put the jars inside of `/lib` and have some > fancier check for the jar location in the compute-classpath script. > We should also document how to run Spark SQL on YARN when hive support is > enabled. In particular how to add the necessary jars to spark-submit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1681) Handle hive support correctly in ./make-distribution
Patrick Wendell created SPARK-1681: -- Summary: Handle hive support correctly in ./make-distribution Key: SPARK-1681 URL: https://issues.apache.org/jira/browse/SPARK-1681 Project: Spark Issue Type: Bug Components: Build, SQL Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker Fix For: 1.0.0 When Hive support is enabled we should copy the datanucleus jars to the packaged distribution. The simplest way would be to create a lib_managed folder in the final distribution so that the compute-classpath script searches in exactly the same way whether or not it's a release. A slightly nicer solution is to put the jars inside of `/lib` and have some fancier check for the jar location in the compute-classpath script. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf
Patrick Wendell created SPARK-1680: -- Summary: Clean up use of setExecutorEnvs in SparkConf Key: SPARK-1680 URL: https://issues.apache.org/jira/browse/SPARK-1680 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.0.0 Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This was added in 0.9.0 but the config change removed propagation of these values to executors. We should make one of two decisions: 1. Don't allow setting arbitrary executor envs in standalone mode. 2. Document this option, respect the env variables when launching a job, and consolidate this with SPARK_YARN_USER_ENV. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985219#comment-13985219 ] Patrick Wendell commented on SPARK-922: --- This is no longer a blocker now that we've downgraded the python dependency, but would still be nice to have. > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 1.0.0, 0.9.1 >Reporter: Josh Rosen > Fix For: 1.1.0 > > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-922: -- Priority: Major (was: Blocker) > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 1.0.0, 0.9.1 >Reporter: Josh Rosen > Fix For: 1.1.0 > > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-922: -- Fix Version/s: (was: 1.0.0) 1.1.0 > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 1.0.0, 0.9.1 >Reporter: Josh Rosen >Priority: Blocker > Fix For: 1.1.0 > > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1466) Pyspark doesn't check if gateway process launches correctly
[ https://issues.apache.org/jira/browse/SPARK-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1466: --- Fix Version/s: (was: 1.0.0) 1.0.1 > Pyspark doesn't check if gateway process launches correctly > --- > > Key: SPARK-1466 > URL: https://issues.apache.org/jira/browse/SPARK-1466 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Blocker > Fix For: 1.0.1 > > > If the gateway process fails to start correctly (e.g., because JAVA_HOME > isn't set correctly, there's no Spark jar, etc.), right now pyspark fails > because of a very difficult-to-understand error, where we try to parse stdout > to get the port where Spark started and there's nothing there. We should > properly catch the error, print it to the user, and exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1626) Update Spark YARN docs to use spark-submit
[ https://issues.apache.org/jira/browse/SPARK-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1626. Resolution: Duplicate > Update Spark YARN docs to use spark-submit > -- > > Key: SPARK-1626 > URL: https://issues.apache.org/jira/browse/SPARK-1626 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1492) running-on-yarn doc should use spark-submit script for examples
[ https://issues.apache.org/jira/browse/SPARK-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1492: --- Priority: Blocker (was: Major) > running-on-yarn doc should use spark-submit script for examples > --- > > Key: SPARK-1492 > URL: https://issues.apache.org/jira/browse/SPARK-1492 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Sandy Ryza >Priority: Blocker > > the spark-class script puts out lots of warnings telling users to use > spark-submit script with new options. We should update the > running-on-yarn.md docs to have examples using the spark-submit script rather > then spark-class. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1626) Update Spark YARN docs to use spark-submit
[ https://issues.apache.org/jira/browse/SPARK-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1626: --- Assignee: Sandy Ryza (was: Patrick Wendell) > Update Spark YARN docs to use spark-submit > -- > > Key: SPARK-1626 > URL: https://issues.apache.org/jira/browse/SPARK-1626 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Reporter: Patrick Wendell >Assignee: Sandy Ryza >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1004) PySpark on YARN
[ https://issues.apache.org/jira/browse/SPARK-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1004. Resolution: Fixed Issue resolved by pull request 30 [https://github.com/apache/spark/pull/30] > PySpark on YARN > --- > > Key: SPARK-1004 > URL: https://issues.apache.org/jira/browse/SPARK-1004 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Josh Rosen >Assignee: Sandy Ryza >Priority: Blocker > Fix For: 1.0.0 > > > This is for tracking progress on supporting YARN in PySpark. > We might be able to use {{yarn-client}} mode > (https://spark.incubator.apache.org/docs/latest/running-on-yarn.html#launch-spark-application-with-yarn-client-mode). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1679) In-Memory compression needs to be configurable.
Michael Armbrust created SPARK-1679: --- Summary: In-Memory compression needs to be configurable. Key: SPARK-1679 URL: https://issues.apache.org/jira/browse/SPARK-1679 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.0.0 Since we are still finding bugs in the compression code I think we should make it configurable in SparkConf and turn it off by default for the 1.0 release. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1678) Compression loses repeated values.
Michael Armbrust created SPARK-1678: --- Summary: Compression loses repeated values. Key: SPARK-1678 URL: https://issues.apache.org/jira/browse/SPARK-1678 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Fix For: 1.0.0 Here's a test case: {code} test("all the same strings") { sparkContext.parallelize(1 to 1000).map(_ => StringData("test")).registerAsTable("test1000") assert(sql("SELECT * FROM test1000").count() === 1000) cacheTable("test1000") assert(sql("SELECT * FROM test1000").count() === 1000) } {code} First assert passes, second one fails. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1677) Allow users to avoid Hadoop output checks if desired
[ https://issues.apache.org/jira/browse/SPARK-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1677: --- Issue Type: Improvement (was: Bug) > Allow users to avoid Hadoop output checks if desired > > > Key: SPARK-1677 > URL: https://issues.apache.org/jira/browse/SPARK-1677 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Patrick Wendell > > For compatibility with older versions of Spark it would be nice to have an > option `spark.hadoop.validateOutputSpecs` and a description "If set to true, > validates the output specification used in saveAsHadoopFile and other > variants. This can be disabled to silence exceptions due to pre-existing > output directories." > This would just wrap the checking done in this PR: > https://issues.apache.org/jira/browse/SPARK-1100 > https://github.com/apache/spark/pull/11 > By first checking the spark conf. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1677) Allow users to avoid Hadoop output checks if desired
Patrick Wendell created SPARK-1677: -- Summary: Allow users to avoid Hadoop output checks if desired Key: SPARK-1677 URL: https://issues.apache.org/jira/browse/SPARK-1677 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Patrick Wendell For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` and a description "If set to true, validates the output specification used in saveAsHadoopFile and other variants. This can be disabled to silence exceptions due to pre-existing output directories." This would just wrap the checking done in this PR: https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 By first checking the spark conf. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1677) Allow users to avoid Hadoop output checks if desired
[ https://issues.apache.org/jira/browse/SPARK-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1677: --- Description: For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true) and a description "If set to true, validates the output specification used in saveAsHadoopFile and other variants. This can be disabled to silence exceptions due to pre-existing output directories." This would just wrap the checking done in this PR: https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 By first checking the spark conf. was: For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` and a description "If set to true, validates the output specification used in saveAsHadoopFile and other variants. This can be disabled to silence exceptions due to pre-existing output directories." This would just wrap the checking done in this PR: https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 By first checking the spark conf. > Allow users to avoid Hadoop output checks if desired > > > Key: SPARK-1677 > URL: https://issues.apache.org/jira/browse/SPARK-1677 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Patrick Wendell > > For compatibility with older versions of Spark it would be nice to have an > option `spark.hadoop.validateOutputSpecs` (default true) and a description > "If set to true, validates the output specification used in saveAsHadoopFile > and other variants. This can be disabled to silence exceptions due to > pre-existing output directories." > This would just wrap the checking done in this PR: > https://issues.apache.org/jira/browse/SPARK-1100 > https://github.com/apache/spark/pull/11 > By first checking the spark conf. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1676) HDFS FileSystems continually pile up in the FS cache
Aaron Davidson created SPARK-1676: - Summary: HDFS FileSystems continually pile up in the FS cache Key: SPARK-1676 URL: https://issues.apache.org/jira/browse/SPARK-1676 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 0.9.1 Reporter: Aaron Davidson Priority: Critical Due to HDFS-3545, FileSystem.get() always produces (and caches) a new FileSystem when provided with a new UserGroupInformation (UGI), even if the UGI represents the same user as another UGI. This causes a buildup of FileSystem objects at an alarming rate, often one per task for something like sc.textFile(). The bug is especially hard-hitting for NativeS3FileSystem, which also maintains an open connection to S3, clogging up the system file handles. The bug was introduced in https://github.com/apache/spark/pull/29, where doAs was made the default behavior. A fix is not forthcoming for the general case, as UGIs do not cache well, but this problem can lead to spark clusters entering into a failed state and requiring executors be restarted. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1675) Make clear whether computePrincipalComponents centers data
Sandy Ryza created SPARK-1675: - Summary: Make clear whether computePrincipalComponents centers data Key: SPARK-1675 URL: https://issues.apache.org/jira/browse/SPARK-1675 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1661) the result of querying table created with RegexSerDe is all null
[ https://issues.apache.org/jira/browse/SPARK-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1661. - Resolution: Won't Fix > the result of querying table created with RegexSerDe is all null > > > Key: SPARK-1661 > URL: https://issues.apache.org/jira/browse/SPARK-1661 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 0.9.0 > Environment: linux 2.6.32-358.el6.x86_64,Hive 12.0,shark 0.9.0,Hadoop > 2.2.0 >Reporter: likunjian > Labels: HQL, hadoop, hive, regex, shark > Attachments: log.txt > > Original Estimate: 168h > Remaining Estimate: 168h > > the result of querying table created with RegexSerDe is all null > when i query the table created with > org.apache.hadoop.hive.contrib.serde2.RegexSerDe by shark,the columns in the > result is all null > select * from access_log where logdate='2014-04-28' limit 10; > OK > ip hosttimemethod request protocolstatus size > referer cookieuid requesttime session httpxrequestedwith agent > upstreamresponsetimelogdate > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > Time taken: 4.362 seconds > my regex is > ^([^ ]*) [^ ]* ([^ ]*) \\[([^\]]*)\\] \"([^ ]*) ([^ ]*) ([^ ]*)\" (-|[0-9]*) > (-|[0-9]*) \"(\.\+\?|-)\" ([^ ]*) ([^ ]*) ([^ ]*) \"(\.\+\?|-)\" > \"(\.\+\?|-)\" \"(\.\+\?|-)\"$ > nginx log example: > 42.49.44.61 - www..comm [20/Apr/2014:23:58:03 +0800] "GET /x/296837 > HTTP/1.1" 200 3871 "http://www.x.com/x/296837"; - 0.015 > 63hbb4om2cvtjs0f7d969n1uf4 "com.x.browser" "Mozilla/5.0 (Linux; U; x > 4.1.2; zh-cn; ZTE N919 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) > Version/4.0 Mobile Safari/534.30" "0.015" > 111.121.176.149 - www..comm [20/Apr/2014:23:58:03 +0800] "GET > /x/264904 HTTP/1.1" 200 3827 > "http://m.baidu.com/s?from=2001a&bd_page_type=1&word=%E8%8E%B2%E8%97%95%E6%80%8E%E6%A0%B7%E5%8D%A4%E6%89%8D%E5%A5%BD%E5%90%83"; > - 0.015 ft7tr4b06b23ub9lnugdf4gcq3 "-" "Mozilla/5.0 (Linux; U; x 4.1.2; > zh-CN; 8190Q Build/JZO54K) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 > UCBrowser/9.5.2.394 U3/0.8.0 Mobile Safari/533.1" "0.015" > 222.209.97.169 - www..comm [20/Apr/2014:23:58:04 +0800] "GET / HTTP/1.1" > 200 3188 "http://m.idea123.cn/food.html"; - 0.014 - "-" "Lenovo S890/S100 > Linux/3.0.13 x/4.0.3 Release/12.12.2011 Browser/AppleWebKit534.30 > Profile/MIDP-2.0 Configuration/CLDC-1.1 Mobile Safari/534.30" "0.014" > 59.36.84.241 - www..comm [20/Apr/2014:23:58:05 +0800] "GET > /app/x/topic/view.php?id=138555 HTTP/1.1" 200 3151 "-" - 0.009 - "-" > "Mozilla/5.0 (Linux; U; x 2.3.7; zh-cn; TD500 Build/GWK74) > AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" > "0.009" > 113.242.39.81 - www..comm [20/Apr/2014:23:58:07 +0800] "GET /x/419691 > HTTP/1.1" 200 4174 "http://www..comm/x/all/308?p=3"; - 0.013 > 1n579ukg1gho7i7mr3q8ic8j97 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X > 10_5_7; en-us) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 > Safari/530.17; 360browser(securitypay,securityinstalled); > 360(x,uppayplugin); 360 Aphone Browser (5.3.1)" "0.013" > Very strange, I execute a query in Hive is normal. I really do not > understand. . . :-( > OK > ip hosttimemethod request protocolstatus size > referer cookieuid requesttime session httpxrequestedw
[jira] [Commented] (SPARK-1661) the result of querying table created with RegexSerDe is all null
[ https://issues.apache.org/jira/browse/SPARK-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985055#comment-13985055 ] Michael Armbrust commented on SPARK-1661: - Thanks for your report. This JIRA is for reporting bugs with Spark and its components. Shark is a separate project and issues with older versions of Shark should probably be filed on the Shark issue tracker. However, I did add a test to make sure the RegexSerDe was working with Spark SQL (which is a nearly from scratch rewrite of Shark, that will be included in the 1.0 release of Spark as an Alpha component). If you find you are still having problems with Spark SQL, please reopen this issue. New spark tests: https://github.com/apache/spark/pull/595 > the result of querying table created with RegexSerDe is all null > > > Key: SPARK-1661 > URL: https://issues.apache.org/jira/browse/SPARK-1661 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 0.9.0 > Environment: linux 2.6.32-358.el6.x86_64,Hive 12.0,shark 0.9.0,Hadoop > 2.2.0 >Reporter: likunjian > Labels: HQL, hadoop, hive, regex, shark > Attachments: log.txt > > Original Estimate: 168h > Remaining Estimate: 168h > > the result of querying table created with RegexSerDe is all null > when i query the table created with > org.apache.hadoop.hive.contrib.serde2.RegexSerDe by shark,the columns in the > result is all null > select * from access_log where logdate='2014-04-28' limit 10; > OK > ip hosttimemethod request protocolstatus size > referer cookieuid requesttime session httpxrequestedwith agent > upstreamresponsetimelogdate > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL > NULLNULLNULLNULLNULL2014-04-28 > Time taken: 4.362 seconds > my regex is > ^([^ ]*) [^ ]* ([^ ]*) \\[([^\]]*)\\] \"([^ ]*) ([^ ]*) ([^ ]*)\" (-|[0-9]*) > (-|[0-9]*) \"(\.\+\?|-)\" ([^ ]*) ([^ ]*) ([^ ]*) \"(\.\+\?|-)\" > \"(\.\+\?|-)\" \"(\.\+\?|-)\"$ > nginx log example: > 42.49.44.61 - www..comm [20/Apr/2014:23:58:03 +0800] "GET /x/296837 > HTTP/1.1" 200 3871 "http://www.x.com/x/296837"; - 0.015 > 63hbb4om2cvtjs0f7d969n1uf4 "com.x.browser" "Mozilla/5.0 (Linux; U; x > 4.1.2; zh-cn; ZTE N919 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) > Version/4.0 Mobile Safari/534.30" "0.015" > 111.121.176.149 - www..comm [20/Apr/2014:23:58:03 +0800] "GET > /x/264904 HTTP/1.1" 200 3827 > "http://m.baidu.com/s?from=2001a&bd_page_type=1&word=%E8%8E%B2%E8%97%95%E6%80%8E%E6%A0%B7%E5%8D%A4%E6%89%8D%E5%A5%BD%E5%90%83"; > - 0.015 ft7tr4b06b23ub9lnugdf4gcq3 "-" "Mozilla/5.0 (Linux; U; x 4.1.2; > zh-CN; 8190Q Build/JZO54K) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 > UCBrowser/9.5.2.394 U3/0.8.0 Mobile Safari/533.1" "0.015" > 222.209.97.169 - www..comm [20/Apr/2014:23:58:04 +0800] "GET / HTTP/1.1" > 200 3188 "http://m.idea123.cn/food.html"; - 0.014 - "-" "Lenovo S890/S100 > Linux/3.0.13 x/4.0.3 Release/12.12.2011 Browser/AppleWebKit534.30 > Profile/MIDP-2.0 Configuration/CLDC-1.1 Mobile Safari/534.30" "0.014" > 59.36.84.241 - www..comm [20/Apr/2014:23:58:05 +0800] "GET > /app/x/topic/view.php?id=138555 HTTP/1.1" 200 3151 "-" - 0.009 - "-" > "Mozilla/5.0 (Linux; U; x 2.3.7; zh-cn; TD500 Build/GWK74) > AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" > "0.009" > 113.242.39.81 - www..comm [20/Apr/2014:23:58:07 +0800] "GET
[jira] [Resolved] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe
[ https://issues.apache.org/jira/browse/SPARK-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1674. -- Resolution: Fixed Fix Version/s: 1.0.0 > Interrupted system call error in pyspark's RDD.pipe > --- > > Key: SPARK-1674 > URL: https://issues.apache.org/jira/browse/SPARK-1674 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.0.0 > > > RDD.pipe's doctest throws interrupted system call exception on Mac. It can be > fixed by wrapping pipe.stdout.readline in an iterator. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-544) Provide a Configuration class in addition to system properties
[ https://issues.apache.org/jira/browse/SPARK-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-544. - Resolution: Fixed Fix Version/s: 0.9.0 > Provide a Configuration class in addition to system properties > -- > > Key: SPARK-544 > URL: https://issues.apache.org/jira/browse/SPARK-544 > Project: Spark > Issue Type: New Feature >Reporter: Matei Zaharia >Assignee: Matei Zaharia > Fix For: 0.9.0 > > > This is a much better option for people who want to connect to multiple Spark > clusters in the same program, and for unit tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-544) Provide a Configuration class in addition to system properties
[ https://issues.apache.org/jira/browse/SPARK-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-544: --- Assignee: Matei Zaharia (was: Evan Chan) > Provide a Configuration class in addition to system properties > -- > > Key: SPARK-544 > URL: https://issues.apache.org/jira/browse/SPARK-544 > Project: Spark > Issue Type: New Feature >Reporter: Matei Zaharia >Assignee: Matei Zaharia > Fix For: 0.9.0 > > > This is a much better option for people who want to connect to multiple Spark > clusters in the same program, and for unit tests. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1268) Adding XOR and AND-NOT operations to spark.util.collection.BitSet
[ https://issues.apache.org/jira/browse/SPARK-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1268. -- Resolution: Fixed Fix Version/s: 1.0.0 > Adding XOR and AND-NOT operations to spark.util.collection.BitSet > - > > Key: SPARK-1268 > URL: https://issues.apache.org/jira/browse/SPARK-1268 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Petko Nikolov >Priority: Minor > Labels: starter > Fix For: 1.0.0 > > > BitSet collection is missing some important bit-wise operations. Symmetric > difference (xor) in particular is useful for computing some distance metrics > (e.g. Hamming). Difference (and-not) as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-615) Add mapPartitionsWithIndex() to the Java API
[ https://issues.apache.org/jira/browse/SPARK-615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-615. - Resolution: Fixed Fix Version/s: 1.0.0 > Add mapPartitionsWithIndex() to the Java API > > > Key: SPARK-615 > URL: https://issues.apache.org/jira/browse/SPARK-615 > Project: Spark > Issue Type: New Feature > Components: Java API >Affects Versions: 0.6.0 >Reporter: Josh Rosen >Assignee: Holden Karau >Priority: Minor > Labels: Starter > Fix For: 1.0.0 > > > We should add {{mapPartitionsWithIndex()}} to the Java API. > What should the interface for this look like? We could require the user to > pass in a {{FlatMapFunction[(Int, Iterator[T]))}}, but this requires them to > unpack the tuple from Java. It would be nice if the UDF had a signature like > {{f(int partition, Iterator[T] iterator)}}, but this will require defining a > new set of {{Function}} classes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError
[ https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985007#comment-13985007 ] Vlad Frolov commented on SPARK-1394: [~idanzalz] unfortunately, it had helped to avoid only one exception, so I commented signal binding in PySpark and these crashes went away. I hope it will be fixed somehow in next Spark release. > calling system.platform on worker raises IOError > > > Key: SPARK-1394 > URL: https://issues.apache.org/jira/browse/SPARK-1394 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0 > Environment: Tested on Ubuntu and Linux, local and remote master, > python 2.7.* >Reporter: Idan Zalzberg > Labels: pyspark > > A simple program that calls system.platform() on the worker fails most of the > time (it works some times but very rarely). > This is critical since many libraries call that method (e.g. boto). > Here is the trace of the attempt to call that method: > $ /usr/local/spark/bin/pyspark > Python 2.7.3 (default, Feb 27 2014, 20:00:17) > [GCC 4.6.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback > address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1) > 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started > 14/04/02 18:18:38 INFO Remoting: Starting remoting > 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://spark@10.33.102.46:36640] > 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://spark@10.33.102.46:36640] > 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster > 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140402181839-919f > 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 > MB. > 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id > = ConnectionManagerId(10.33.102.46,43357) > 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager > 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering > block manager 10.33.102.46:43357 with 294.6 MB RAM > 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager > 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server > 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at > http://10.33.102.46:51803 > 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker > 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b > 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server > 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at > http://10.33.102.46:4040 > 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 0.9.0 > /_/ > Using Python version 2.7.3 (default, Feb 27 2014 20:00:17) > Spark context available as sc. > >>> import platform > >>> sc.parallelize([1]).map(lambda x : platform.system()).collect() > 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at :1 > 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at :1) with 1 > output partitions (allowLocal=false) > 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at > :1) > 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List() > 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List() > 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at > collect at :1), which has no missing parents > 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 > (PythonRDD[1] at collect at :1) > 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks > 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on > executor localhost: localhost (PROCESS_LOCAL) > 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in > 12 ms > 14/04/02 18:19:17 INFO Executor: Running task ID 0 > PySpark worker failed with exception: > Traceback (most recent call last): > File "/usr/local/spark/python/pyspark/worker.py", line 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/local/spark/python/pyspark/serializers.py", line 182, in > dump_stream > self.seriali
[jira] [Commented] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe
[ https://issues.apache.org/jira/browse/SPARK-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985000#comment-13985000 ] Xiangrui Meng commented on SPARK-1674: -- PR: https://github.com/apache/spark/pull/594 > Interrupted system call error in pyspark's RDD.pipe > --- > > Key: SPARK-1674 > URL: https://issues.apache.org/jira/browse/SPARK-1674 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > RDD.pipe's doctest throws interrupted system call exception on Mac. It can be > fixed by wrapping pipe.stdout.readline in an iterator. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1373) Compression for In-Memory Columnar storage
[ https://issues.apache.org/jira/browse/SPARK-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984997#comment-13984997 ] Michael Armbrust commented on SPARK-1373: - Note that the code is in a large part adapted from Shark, including: https://github.com/amplab/shark/blob/master/src/test/scala/shark/memstore2/column/CompressionAlgorithmSuite.scala > Compression for In-Memory Columnar storage > -- > > Key: SPARK-1373 > URL: https://issues.apache.org/jira/browse/SPARK-1373 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe
Xiangrui Meng created SPARK-1674: Summary: Interrupted system call error in pyspark's RDD.pipe Key: SPARK-1674 URL: https://issues.apache.org/jira/browse/SPARK-1674 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng RDD.pipe's doctest throws interrupted system call exception on Mac. It can be fixed by wrapping pipe.stdout.readline in an iterator. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1673) GLMNET implementation in Spark
Sung Chung created SPARK-1673: - Summary: GLMNET implementation in Spark Key: SPARK-1673 URL: https://issues.apache.org/jira/browse/SPARK-1673 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Sung Chung This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, Rob Tibshirani. http://www.jstatsoft.org/v33/i01/paper It's a straightforward implementation of the Coordinate-Descent based L1/L2 regularized linear models, including Linear/Logistic/Multinomial regressions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1672) Support separate partitioners (and numbers of partitions) for users and products
Tor Myklebust created SPARK-1672: Summary: Support separate partitioners (and numbers of partitions) for users and products Key: SPARK-1672 URL: https://issues.apache.org/jira/browse/SPARK-1672 Project: Spark Issue Type: Improvement Reporter: Tor Myklebust Priority: Minor The user ought to be able to specify a partitioning of his data if he knows a good one. It's convenient to have separate partitioners for users and products so that no strange mapping step needs to happen. It may also be reasonable to partition the users and products into different numbers of partitions (for instance, to balance memory requirements) if the dataset is tall, thin, and very sparse. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1672) Support separate partitioners (and numbers of partitions) for users and products
[ https://issues.apache.org/jira/browse/SPARK-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tor Myklebust updated SPARK-1672: - Component/s: MLlib > Support separate partitioners (and numbers of partitions) for users and > products > > > Key: SPARK-1672 > URL: https://issues.apache.org/jira/browse/SPARK-1672 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tor Myklebust >Priority: Minor > > The user ought to be able to specify a partitioning of his data if he knows a > good one. It's convenient to have separate partitioners for users and > products so that no strange mapping step needs to happen. > It may also be reasonable to partition the users and products into different > numbers of partitions (for instance, to balance memory requirements) if the > dataset is tall, thin, and very sparse. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1671) Cached tables should follow write-through policy
Cheng Lian created SPARK-1671: - Summary: Cached tables should follow write-through policy Key: SPARK-1671 URL: https://issues.apache.org/jira/browse/SPARK-1671 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Writing (insert / load) to a cached table causes cache inconsistency, and user have to unpersist and cache the whole table again. The write-through policy may be implemented with {{RDD.union}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984950#comment-13984950 ] Hari Shreedharan commented on SPARK-1645: - The first one is not exactly accurate, though it explains the idea. The second is what I suggest. As a first step, we do what is currently done - the receiver stores the data locally and acknowledges, so the reliability does not improve. Later we can make the improvement for all receivers that the data becomes persisted all the way to the driver (add a new API like storeReliably or something). We would have to do a two step Poll-ACK process. We can have the initial poll create a new request added to the ones that are pending for commit in the sink. Once the receiver has written it (for now in the current way, later reliably) - it sends out an ACK for the request id, that causes the request to be committed, so Flume can remove the events. If the receiver does not send the ACK, then the sink can have a scheduled thread come (this timeout can be specified in the flume config) and rollback and make the data available again (Flume already has the capability to make uncommitted txns to be made available if that agent fails). > Improve Spark Streaming compatibility with Flume > > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1670) PySpark Fails to Create SparkContext Due To Debugging Options in conf/java-opts
[ https://issues.apache.org/jira/browse/SPARK-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984948#comment-13984948 ] Pat McDonough commented on SPARK-1670: -- FYI [~ahirreddy] [~matei], here's the pyspark issue I was talking to you guys about > PySpark Fails to Create SparkContext Due To Debugging Options in > conf/java-opts > --- > > Key: SPARK-1670 > URL: https://issues.apache.org/jira/browse/SPARK-1670 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 > Environment: pats-air:spark pat$ IPYTHON=1 bin/pyspark > Python 2.7.5 (default, Aug 25 2013, 00:04:04) > ... > IPython 1.1.0 > ... > Spark version 1.0.0-SNAPSHOT > Using Python version 2.7.5 (default, Aug 25 2013 00:04:04) >Reporter: Pat McDonough > > When JVM debugging options are in conf/java-opts, it causes pyspark to fail > when creating the SparkContext. The java-opts file looks like the following: > {code}-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 > {code} > Here's the error: > {code}--- > ValueErrorTraceback (most recent call last) > /Library/Python/2.7/site-packages/IPython/utils/py3compat.pyc in > execfile(fname, *where) > 202 else: > 203 filename = fname > --> 204 __builtin__.execfile(filename, *where) > /Users/pat/Projects/spark/python/pyspark/shell.py in () > 41 SparkContext.setSystemProperty("spark.executor.uri", > os.environ["SPARK_EXECUTOR_URI"]) > 42 > ---> 43 sc = SparkContext(os.environ.get("MASTER", "local[*]"), > "PySparkShell", pyFiles=add_files) > 44 > 45 print("""Welcome to > /Users/pat/Projects/spark/python/pyspark/context.pyc in __init__(self, > master, appName, sparkHome, pyFiles, environment, batchSize, serializer, > conf, gateway) > 92 tempNamedTuple = namedtuple("Callsite", "function file > linenum") > 93 self._callsite = tempNamedTuple(function=None, file=None, > linenum=None) > ---> 94 SparkContext._ensure_initialized(self, gateway=gateway) > 95 > 96 self.environment = environment or {} > /Users/pat/Projects/spark/python/pyspark/context.pyc in > _ensure_initialized(cls, instance, gateway) > 172 with SparkContext._lock: > 173 if not SparkContext._gateway: > --> 174 SparkContext._gateway = gateway or launch_gateway() > 175 SparkContext._jvm = SparkContext._gateway.jvm > 176 SparkContext._writeToFile = > SparkContext._jvm.PythonRDD.writeToFile > /Users/pat/Projects/spark/python/pyspark/java_gateway.pyc in launch_gateway() > 44 proc = Popen(command, stdout=PIPE, stdin=PIPE) > 45 # Determine which ephemeral port the server started on: > ---> 46 port = int(proc.stdout.readline()) > 47 # Create a thread to echo output from the GatewayServer, which is > required > 48 # for Java log output to show up: > ValueError: invalid literal for int() with base 10: 'Listening for transport > dt_socket at address: 5005\n' > {code} > Note that when you use JVM debugging, the very first line of output (e.g. > when running spark-shell) looks like this: > {code}Listening for transport dt_socket at address: 5005{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1670) PySpark Fails to Create SparkContext Due To Debugging Options in conf/java-opts
Pat McDonough created SPARK-1670: Summary: PySpark Fails to Create SparkContext Due To Debugging Options in conf/java-opts Key: SPARK-1670 URL: https://issues.apache.org/jira/browse/SPARK-1670 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: pats-air:spark pat$ IPYTHON=1 bin/pyspark Python 2.7.5 (default, Aug 25 2013, 00:04:04) ... IPython 1.1.0 ... Spark version 1.0.0-SNAPSHOT Using Python version 2.7.5 (default, Aug 25 2013 00:04:04) Reporter: Pat McDonough When JVM debugging options are in conf/java-opts, it causes pyspark to fail when creating the SparkContext. The java-opts file looks like the following: {code}-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 {code} Here's the error: {code}--- ValueErrorTraceback (most recent call last) /Library/Python/2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where) 202 else: 203 filename = fname --> 204 __builtin__.execfile(filename, *where) /Users/pat/Projects/spark/python/pyspark/shell.py in () 41 SparkContext.setSystemProperty("spark.executor.uri", os.environ["SPARK_EXECUTOR_URI"]) 42 ---> 43 sc = SparkContext(os.environ.get("MASTER", "local[*]"), "PySparkShell", pyFiles=add_files) 44 45 print("""Welcome to /Users/pat/Projects/spark/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway) 92 tempNamedTuple = namedtuple("Callsite", "function file linenum") 93 self._callsite = tempNamedTuple(function=None, file=None, linenum=None) ---> 94 SparkContext._ensure_initialized(self, gateway=gateway) 95 96 self.environment = environment or {} /Users/pat/Projects/spark/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway) 172 with SparkContext._lock: 173 if not SparkContext._gateway: --> 174 SparkContext._gateway = gateway or launch_gateway() 175 SparkContext._jvm = SparkContext._gateway.jvm 176 SparkContext._writeToFile = SparkContext._jvm.PythonRDD.writeToFile /Users/pat/Projects/spark/python/pyspark/java_gateway.pyc in launch_gateway() 44 proc = Popen(command, stdout=PIPE, stdin=PIPE) 45 # Determine which ephemeral port the server started on: ---> 46 port = int(proc.stdout.readline()) 47 # Create a thread to echo output from the GatewayServer, which is required 48 # for Java log output to show up: ValueError: invalid literal for int() with base 10: 'Listening for transport dt_socket at address: 5005\n' {code} Note that when you use JVM debugging, the very first line of output (e.g. when running spark-shell) looks like this: {code}Listening for transport dt_socket at address: 5005{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1669) SQLContext.cacheTable() should be idempotent
Cheng Lian created SPARK-1669: - Summary: SQLContext.cacheTable() should be idempotent Key: SPARK-1669 URL: https://issues.apache.org/jira/browse/SPARK-1669 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Calling {{cacheTable()}} on some table {{t} multiple times causes table {{t}} to be cached multiple times. This semantics is different from {{RDD.cache()}}, which is idempotent. We can check whether a table is already cached by checking: # whether the structure of the underlying logical plan of the table is matches the pattern {{Subquery(\_, SparkLogicalPlan(inMem @ InMemoryColumnarTableScan(_, _)))}} # whether {{inMem.cachedColumnBuffers.getStorageLevel.useMemory}} is true -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984927#comment-13984927 ] Tathagata Das commented on SPARK-1645: -- Ah, I think get it now. So instead of the default push-based as it is now (where a sink is running with the receiver), you simply want to make pull-based. So if the current situation is this !http://i.imgur.com/m8oiOwl.png?1! you propose this !http://i.imgur.com/N6Ee1cb.png?1! Right? Assuming it is right, that does make it very convenient for Spark Streaming's receivers. However what does it mean for reliable receiving? When the receiver pulls the data from the source, it will acknowledge the source only when the Spark acknowledges that it has reliably saved the data? > Improve Spark Streaming compatibility with Flume > > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1515) Specialized ColumnTypes for Array, Map and Struct
[ https://issues.apache.org/jira/browse/SPARK-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1515: Labels: compression (was: ) > Specialized ColumnTypes for Array, Map and Struct > - > > Key: SPARK-1515 > URL: https://issues.apache.org/jira/browse/SPARK-1515 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > Labels: compression > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1513) Specialized ColumnType for Timestamp
[ https://issues.apache.org/jira/browse/SPARK-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1513: Labels: compression (was: ) > Specialized ColumnType for Timestamp > > > Key: SPARK-1513 > URL: https://issues.apache.org/jira/browse/SPARK-1513 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > Labels: compression > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1512) improve spark sql to support table with more than 22 fields
[ https://issues.apache.org/jira/browse/SPARK-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1512. - Resolution: Fixed > improve spark sql to support table with more than 22 fields > --- > > Key: SPARK-1512 > URL: https://issues.apache.org/jira/browse/SPARK-1512 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: wangfei > Fix For: 1.0.0 > > > spark sql use case class to define a table, so 22 fields limit in case > classes lead to spark sql not support wide(more than 22 fields) tables. wide > table is common in many cases -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1512) improve spark sql to support table with more than 22 fields
[ https://issues.apache.org/jira/browse/SPARK-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984922#comment-13984922 ] Michael Armbrust commented on SPARK-1512: - Now that we have updated the docs to talk about creating custom Product classes, I'm going to mark this as resolved. > improve spark sql to support table with more than 22 fields > --- > > Key: SPARK-1512 > URL: https://issues.apache.org/jira/browse/SPARK-1512 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: wangfei > Fix For: 1.0.0 > > > spark sql use case class to define a table, so 22 fields limit in case > classes lead to spark sql not support wide(more than 22 fields) tables. wide > table is common in many cases -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1610) Cast from BooleanType to NumericType should use exact type value.
[ https://issues.apache.org/jira/browse/SPARK-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1610. - Resolution: Fixed Fix Version/s: 1.0.0 > Cast from BooleanType to NumericType should use exact type value. > - > > Key: SPARK-1610 > URL: https://issues.apache.org/jira/browse/SPARK-1610 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > Fix For: 1.0.0 > > > Cast from BooleanType to NumericType are all using Int value. > But it causes ClassCastException when the casted value is used by the > following evaluation like the code below: > {quote} > scala> import org.apache.spark.sql.catalyst._ > import org.apache.spark.sql.catalyst._ > scala> import types._ > import types._ > scala> import expressions._ > import expressions._ > scala> Add(Cast(Literal(true), ShortType), Literal(1.toShort)).eval() > java.lang.ClassCastException: java.lang.Integer cannot be cast to > java.lang.Short > at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102) > at scala.math.Numeric$ShortIsIntegral$.plus(Numeric.scala:72) > at > org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58) > at > org.apache.spark.sql.catalyst.expressions.Add$$anonfun$eval$2.apply(arithmetic.scala:58) > at > org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:114) > at > org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:58) > at .(:17) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734) > at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983) > at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604) > at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568) > at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760) > at > scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805) > at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717) > at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581) > at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588) > at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837) > at > scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:83) > at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:96) > at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:105) > at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala) > {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1608) Cast.nullable should be true when cast from StringType to NumericType/TimestampType
[ https://issues.apache.org/jira/browse/SPARK-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1608. - Resolution: Fixed Fix Version/s: 1.0.0 > Cast.nullable should be true when cast from StringType to > NumericType/TimestampType > --- > > Key: SPARK-1608 > URL: https://issues.apache.org/jira/browse/SPARK-1608 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > Fix For: 1.0.0 > > > Cast.nullable should be true when cast from StringType to NumericType or > TimestampType. > Because if StringType expression has an illegal number string or illegal > timestamp string, the casted value becomes null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984909#comment-13984909 ] Michael Armbrust commented on SPARK-1649: - Oh, I see. I forgot that we would also need this inside of ArrayType. Also, for MapType it seems like it only matters for the value, not the key as I'm not sure we would allow null keys. This is something we need to consider. However, I think I'm going to change the title to something less prescriptive. Could we just for now say that null values are not supported in arrays of parquet files? > Figure out Nullability semantics for Array elements and Map values > -- > > Key: SPARK-1649 > URL: https://issues.apache.org/jira/browse/SPARK-1649 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Andre Schumacher >Priority: Critical > > For the underlying storage layer it would simplify things such as schema > conversions, predicate filter determination and such to record in the data > type itself whether a column can be nullable. So the DataType type could look > like like this: > abstract class DataType(nullable: Boolean = true) > Concrete subclasses could then override the nullable val. Mostly this could > be left as the default but when types can be contained in nested types one > could optimize for, e.g., arrays with elements that are nullable and those > that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values
[ https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1649: Summary: Figure out Nullability semantics for Array elements and Map values (was: DataType should contain nullable bit) > Figure out Nullability semantics for Array elements and Map values > -- > > Key: SPARK-1649 > URL: https://issues.apache.org/jira/browse/SPARK-1649 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Andre Schumacher >Priority: Critical > > For the underlying storage layer it would simplify things such as schema > conversions, predicate filter determination and such to record in the data > type itself whether a column can be nullable. So the DataType type could look > like like this: > abstract class DataType(nullable: Boolean = true) > Concrete subclasses could then override the nullable val. Mostly this could > be left as the default but when types can be contained in nested types one > could optimize for, e.g., arrays with elements that are nullable and those > that are not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1667) Should re-fetch when intermediate data for shuffle is lost
[ https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984900#comment-13984900 ] Kousuke Saruta commented on SPARK-1667: --- Now I'm trying to address this issue. > Should re-fetch when intermediate data for shuffle is lost > -- > > Key: SPARK-1667 > URL: https://issues.apache.org/jira/browse/SPARK-1667 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.0.0 >Reporter: Kousuke Saruta > > I met a case that re-fetch wouldn't occur although that should occur. > When intermediate data (phisical file of intermediate data on local file > system) which is used for shuffle is lost from a Executor, > FileNotFoundException was thrown and refetch wouldn't occur. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984816#comment-13984816 ] Hari Shreedharan commented on SPARK-1645: - No, Flume source and sink reside within the same JVM(http://flume.apache.org/FlumeUserGuide.html#architecture). So the receiver polls the Flume sink running on a different node (the node that runs the Flume agent pushing the data). If the node running the receiver goes down, then another worker starts up and reads from the same Flume agent. If the Flume agent goes down the receiver polls and fails to get data until the agent is back up. > Improve Spark Streaming compatibility with Flume > > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984800#comment-13984800 ] Tathagata Das commented on SPARK-1645: -- But this does not solve the scenario where the whole worker running the receiver dies. If worker dies, then receiver and sink all of it is gone, and Flume source has nowhere to send the data to, isnt it? As far as I understand, the only way to deal with a worker failure is to configure a pool of workers as sinks. If the one of the sinks dont work because the worker failed, the second sink on a the second worker can still receive data. Am I missing something? > Improve Spark Streaming compatibility with Flume > > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984780#comment-13984780 ] Hari Shreedharan commented on SPARK-1645: - No, the sink would run inside the Flume agent that Spark is receiving data from. (Sink is a flume component that pushes data out - this is managed by Flume). Basically, this sink pulls data from the Flume agent's buffer when Spark receiver polls it. If the receiver dies and restarts, as long as the receiver knows which agent to poll the receiver will be able to get the data. This solves the case where Flume is pushing data to a receiver which may have died and restarted elsewhere - since Spark now polls Flume > Improve Spark Streaming compatibility with Flume > > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984768#comment-13984768 ] Tathagata Das commented on SPARK-1645: -- Let me understand this. Is this sink going to run as a separate process outside the Spark executor? If it is running as a thread in the same executor process as the receiver, then that is no better than what it is now, as it will fail when the executor fails. So I am guessing it will be a process outside the executor. Doesnt introduce the headache of managing that process separately? And what happens when the whole worker node dies? > Improve Spark Streaming compatibility with Flume > > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984734#comment-13984734 ] Hari Shreedharan commented on SPARK-1645: - Yes, so I have a rough design for that in mind. The idea is to add a sink which plugs into Flume, which gets polled by the Spark receiver. That way, even if the node on which the worker is running fails, the receiver on another node can poll the sink and pull data. From the Flume point of view, the sink does not "conform" to the definition of standard sinks (all Flume sinks are push only), but it can be written such that we don't lose data. Later if/when Flume adds support for pollable sinks this sink can be ported. > Improve Spark Streaming compatibility with Flume > > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1569) Spark on Yarn, authentication broken by pr299
[ https://issues.apache.org/jira/browse/SPARK-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984713#comment-13984713 ] Thomas Graves edited comment on SPARK-1569 at 4/29/14 8:00 PM: --- {quote} @tgravescs ah I see, you're right. I think I assumed incorrectly that the executor launcher would bundle up the options and send them over, but I don't actually see that happening anywhere. So this part of the code is actually not used: https://github.com/apache/spark/blob/df6d81425bf3b8830988288069f6863de873aee2/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L328 What happens is the executor is just getting its configuration from the driver when the executor launches. And that works in most cases except for security, which it needs to know about before connecting. Is that right? {quote} That is correct. It needs it before connecting. The code you reference handles add it for the application master but not the executors. it looks like we need similar code in ExecutorRunnableUtil.prepareCommand. Yes before we only had SPARK_JAVA_OPTS and that got put as -D on the command line so it was always set correctly when the executors launched. was (Author: tgraves): @tgravescs ah I see, you're right. I think I assumed incorrectly that the executor launcher would bundle up the options and send them over, but I don't actually see that happening anywhere. So this part of the code is actually not used: https://github.com/apache/spark/blob/df6d81425bf3b8830988288069f6863de873aee2/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L328 What happens is the executor is just getting its configuration from the driver when the executor launches. And that works in most cases except for security, which it needs to know about before connecting. Is that right? That is correct. It needs it before connecting. The code you reference handles add it for the application master but not the executors. it looks like we need similar code in ExecutorRunnableUtil.prepareCommand. Yes before we only had SPARK_JAVA_OPTS and that got put as -D on the command line so it was always set correctly when the executors launched. > Spark on Yarn, authentication broken by pr299 > - > > Key: SPARK-1569 > URL: https://issues.apache.org/jira/browse/SPARK-1569 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Priority: Blocker > > https://github.com/apache/spark/pull/299 changed the way configuration was > done and passed to the executors. This breaks use of authentication as the > executor needs to know that authentication is enabled before connecting to > the driver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1569) Spark on Yarn, authentication broken by pr299
[ https://issues.apache.org/jira/browse/SPARK-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984713#comment-13984713 ] Thomas Graves commented on SPARK-1569: -- @tgravescs ah I see, you're right. I think I assumed incorrectly that the executor launcher would bundle up the options and send them over, but I don't actually see that happening anywhere. So this part of the code is actually not used: https://github.com/apache/spark/blob/df6d81425bf3b8830988288069f6863de873aee2/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L328 What happens is the executor is just getting its configuration from the driver when the executor launches. And that works in most cases except for security, which it needs to know about before connecting. Is that right? That is correct. It needs it before connecting. The code you reference handles add it for the application master but not the executors. it looks like we need similar code in ExecutorRunnableUtil.prepareCommand. Yes before we only had SPARK_JAVA_OPTS and that got put as -D on the command line so it was always set correctly when the executors launched. > Spark on Yarn, authentication broken by pr299 > - > > Key: SPARK-1569 > URL: https://issues.apache.org/jira/browse/SPARK-1569 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Priority: Blocker > > https://github.com/apache/spark/pull/299 changed the way configuration was > done and passed to the executors. This breaks use of authentication as the > executor needs to know that authentication is enabled before connecting to > the driver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1588) SPARK_JAVA_OPTS and SPARK_YARN_USER_ENV are not getting propagated
[ https://issues.apache.org/jira/browse/SPARK-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1588. Resolution: Fixed Fix Version/s: 1.0.0 > SPARK_JAVA_OPTS and SPARK_YARN_USER_ENV are not getting propagated > -- > > Key: SPARK-1588 > URL: https://issues.apache.org/jira/browse/SPARK-1588 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.0.0 >Reporter: Mridul Muralidharan >Assignee: Sandy Ryza >Priority: Blocker > Fix For: 1.0.0 > > > We could use SPARK_JAVA_OPTS to pass JAVA_OPTS to be used in the master. > This is no longer working in current master. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1668) Add implicit preference as an option to examples/MovieLensALS
Xiangrui Meng created SPARK-1668: Summary: Add implicit preference as an option to examples/MovieLensALS Key: SPARK-1668 URL: https://issues.apache.org/jira/browse/SPARK-1668 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Priority: Minor Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/. For evaluation, we should map ratings to range [0, 1] and compare it with predictions. It would be better if we add unobserved ratings (assuming negatives) to evaluation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1645) Improve Spark Streaming compatibility with Flume
[ https://issues.apache.org/jira/browse/SPARK-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984654#comment-13984654 ] Tathagata Das commented on SPARK-1645: -- Yes, we will keep you posted. Though one thing that is reasonably independent is to add the ability for Flume receivers to be launched on multiple workers, such that one can act ask standby, when the primary receiver fails. > Improve Spark Streaming compatibility with Flume > > > Key: SPARK-1645 > URL: https://issues.apache.org/jira/browse/SPARK-1645 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan > > Currently the following issues affect Spark Streaming and Flume compatibilty: > * If a spark worker goes down, it needs to be restarted on the same node, > else Flume cannot send data to it. We can fix this by adding a Flume receiver > that is polls Flume, and a Flume sink that supports this. > * Receiver sends acks to Flume before the driver knows about the data. The > new receiver should also handle this case. > * Data loss when driver goes down - This is true for any streaming ingest, > not just Flume. I will file a separate jira for this and we should work on it > there. This is a longer term project and requires considerable development > work. > I intend to start working on these soon. Any input is appreciated. (It'd be > great if someone can add me as a contributor on jira, so I can assign the > jira to myself). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1100) saveAsTextFile shouldn't clobber by default
[ https://issues.apache.org/jira/browse/SPARK-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1100: --- Assignee: Patrick Wendell (was: Patrick Cogan) > saveAsTextFile shouldn't clobber by default > --- > > Key: SPARK-1100 > URL: https://issues.apache.org/jira/browse/SPARK-1100 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 0.9.0 >Reporter: Diana Carroll >Assignee: Patrick Wendell > Fix For: 1.0.0 > > > If I call rdd.saveAsTextFile with an existing directory, it will cheerfully > and silently overwrite the files in there. This is bad enough if it means > I've accidentally blown away the results of a job that might have taken > minutes or hours to run. But it's worse if the second job happens to have > fewer partitions than the first...in that case, my output directory now > contains some "part" files from the earlier job, and some "part" files from > the later job. The only way to know the difference is timestamp. > I wonder if Spark's saveAsTextFile shouldn't work more like Hadoop MapReduce > which insists that the output directory not exist before the job starts. > Similarly HDFS won't override files by default. Perhaps there could be an > optional argument for saveAsTextFile that indicates if it should delete the > existing directory before starting. (I can't see any time I'd want to allow > writing to an existing directory with data already in it. Would the mix of > output from different tasks ever be desirable?) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1667) Should re-fetch when intermediate data for shuffle is lost
Kousuke Saruta created SPARK-1667: - Summary: Should re-fetch when intermediate data for shuffle is lost Key: SPARK-1667 URL: https://issues.apache.org/jira/browse/SPARK-1667 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.0.0 Reporter: Kousuke Saruta I met a case that re-fetch wouldn't occur although that should occur. When intermediate data (phisical file of intermediate data on local file system) which is used for shuffle is lost from a Executor, FileNotFoundException was thrown and refetch wouldn't occur. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1569) Spark on Yarn, authentication broken by pr299
[ https://issues.apache.org/jira/browse/SPARK-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-1569: - Issue Type: Sub-task (was: Bug) Parent: SPARK-1652 > Spark on Yarn, authentication broken by pr299 > - > > Key: SPARK-1569 > URL: https://issues.apache.org/jira/browse/SPARK-1569 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Priority: Blocker > > https://github.com/apache/spark/pull/299 changed the way configuration was > done and passed to the executors. This breaks use of authentication as the > executor needs to know that authentication is enabled before connecting to > the driver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit
[ https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves closed SPARK-1625. Resolution: Fixed > Ensure all legacy YARN options are supported with spark-submit > -- > > Key: SPARK-1625 > URL: https://issues.apache.org/jira/browse/SPARK-1625 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit
[ https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984519#comment-13984519 ] Thomas Graves commented on SPARK-1625: -- I'll create a separate jira for that. > Ensure all legacy YARN options are supported with spark-submit > -- > > Key: SPARK-1625 > URL: https://issues.apache.org/jira/browse/SPARK-1625 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1664) spark-submit --name doesn't work in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-1664: - Issue Type: Sub-task (was: Bug) Parent: SPARK-1652 > spark-submit --name doesn't work in yarn-client mode > > > Key: SPARK-1664 > URL: https://issues.apache.org/jira/browse/SPARK-1664 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Priority: Blocker > > When using spark-submit in yarn-client mode, the --name option doesn't > properly set the application name in either the ResourceManager UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1666) document examples
Diana Carroll created SPARK-1666: Summary: document examples Key: SPARK-1666 URL: https://issues.apache.org/jira/browse/SPARK-1666 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 0.9.1 Reporter: Diana Carroll It would be great if there were some guidance about what the example code shipped with Spark (under $SPARKHOME/examples and $SPARKHOME/python/examples) does and how to run it. Perhaps a comment block at the beginning explaining what the code accomplishes and what parameters it takes. Also, if there are sample datasets on which the example is designed to run, please point to those. (As an example, look at kmeans.py, which takes a file argument, but has no hint about what sort of data is in the file or what format the data should be in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1665) add a config to replace SPARK_YARN_USER_ENV
Thomas Graves created SPARK-1665: Summary: add a config to replace SPARK_YARN_USER_ENV Key: SPARK-1665 URL: https://issues.apache.org/jira/browse/SPARK-1665 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves we should add a config to replace the env variable SPARK_YARN_USER_ENV. If it makes sense we should make it generic to all Spark. If it doesn't then we should atleast have a yarn config so we aren't using environment variables anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1664) spark-submit --name doesn't work in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-1664: - Priority: Blocker (was: Major) > spark-submit --name doesn't work in yarn-client mode > > > Key: SPARK-1664 > URL: https://issues.apache.org/jira/browse/SPARK-1664 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Priority: Blocker > > When using spark-submit in yarn-client mode, the --name option doesn't > properly set the application name in either the ResourceManager UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1664) spark-submit --name doesn't work in yarn-client mode
Thomas Graves created SPARK-1664: Summary: spark-submit --name doesn't work in yarn-client mode Key: SPARK-1664 URL: https://issues.apache.org/jira/browse/SPARK-1664 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves When using spark-submit in yarn-client mode, the --name option doesn't properly set the application name in either the ResourceManager UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1663) Spark Streaming docs code has several small errors
[ https://issues.apache.org/jira/browse/SPARK-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984466#comment-13984466 ] Sean Owen commented on SPARK-1663: -- PR: https://github.com/apache/spark/pull/589 > Spark Streaming docs code has several small errors > -- > > Key: SPARK-1663 > URL: https://issues.apache.org/jira/browse/SPARK-1663 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 0.9.1 >Reporter: Sean Owen >Priority: Minor > Labels: streaming > > The changes are easiest to elaborate in the PR, which I will open shortly. > Those changes raised a few little questions about the API too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1663) Spark Streaming docs code has several small errors
Sean Owen created SPARK-1663: Summary: Spark Streaming docs code has several small errors Key: SPARK-1663 URL: https://issues.apache.org/jira/browse/SPARK-1663 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.1 Reporter: Sean Owen Priority: Minor The changes are easiest to elaborate in the PR, which I will open shortly. Those changes raised a few little questions about the API too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1557) Set permissions on event log files/directories
[ https://issues.apache.org/jira/browse/SPARK-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-1557. -- Resolution: Fixed Fix Version/s: 1.0.0 > Set permissions on event log files/directories > -- > > Key: SPARK-1557 > URL: https://issues.apache.org/jira/browse/SPARK-1557 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 1.0.0 > > > We should set the permissions on the event log directories and files so that > it restricts access to only those users who own them, but could also allow a > super user to read them so that they could be displayed by the history server > in a multi-tenant secure environment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1639) Some tidying of Spark on YARN code
[ https://issues.apache.org/jira/browse/SPARK-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984318#comment-13984318 ] Thomas Graves commented on SPARK-1639: -- https://github.com/apache/spark/pull/561 > Some tidying of Spark on YARN code > -- > > Key: SPARK-1639 > URL: https://issues.apache.org/jira/browse/SPARK-1639 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 0.9.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > I found a few places where we can consolidate duplicate methods, fix typos, > add comments, and make what's going on more clear. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-1625) Ensure all legacy YARN options are supported with spark-submit
[ https://issues.apache.org/jira/browse/SPARK-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reopened SPARK-1625: -- These aren't the only things broken. One big issue is that authentication isn't being passed properly anymore. Unless that was fixed under another jira? > Ensure all legacy YARN options are supported with spark-submit > -- > > Key: SPARK-1625 > URL: https://issues.apache.org/jira/browse/SPARK-1625 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1662) PySpark fails if python class is used as a data container
Chandan Kumar created SPARK-1662: Summary: PySpark fails if python class is used as a data container Key: SPARK-1662 URL: https://issues.apache.org/jira/browse/SPARK-1662 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Environment: Ubuntu 14, Python 2.7.6 Reporter: Chandan Kumar Priority: Minor PySpark fails if RDD operations are performed on data encapsulated in Python objects (rare use case where plain python objects are used as data containers instead of regular dict or tuples). I have written a small piece of code to reproduce the bug: https://gist.github.com/nrchandan/11394440 https://gist.github.com/nrchandan/11394440.js";> -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1644) hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") throw an exception
[ https://issues.apache.org/jira/browse/SPARK-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984114#comment-13984114 ] Guoqiang Li commented on SPARK-1644: {code} # When Hive support is needed, Datanucleus jars must be included on the classpath. # Datanucleus jars do not work if only included in the uber jar as plugin.xml metadata is lost. # Both sbt and maven will populate "lib_managed/jars/" with the datanucleus jars when Spark is # built with Hive, so first check if the datanucleus jars exist, and then ensure the current Spark # assembly is built for Hive, before actually populating the CLASSPATH with the jars. # Note that this check order is faster (by up to half a second) in the case where Hive is not used. num_datanucleus_jars=$(ls "$FWDIR"/lib_managed/jars/ 2>/dev/null | grep "datanucleus-.*\\.jar" | wc -l) if [ $num_datanucleus_jars -gt 0 ]; then AN_ASSEMBLY_JAR=${ASSEMBLY_JAR:-$DEPS_ASSEMBLY_JAR} num_hive_files=$(jar tvf "$AN_ASSEMBLY_JAR" org/apache/hadoop/hive/ql/exec 2>/dev/null | wc -l) if [ $num_hive_files -gt 0 ]; then echo "Spark assembly has been built with Hive, including Datanucleus jars on classpath" 1>&2 DATANUCLEUSJARS=$(echo "$FWDIR/lib_managed/jars"/datanucleus-*.jar | tr " " :) CLASSPATH=$CLASSPATH:$DATANUCLEUSJARS fi fi {code} only add /lib_managed/jars of files to the CLASSPATH. In the current directory is dist is unable to work > hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") throw an > exception > - > > Key: SPARK-1644 > URL: https://issues.apache.org/jira/browse/SPARK-1644 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Guoqiang Li >Assignee: Guoqiang Li > Attachments: spark.log > > > cat conf/hive-site.xml > {code:xml} > > > javax.jdo.option.ConnectionURL > jdbc:postgresql://bj-java-hugedata1:7432/hive > > > javax.jdo.option.ConnectionDriverName > org.postgresql.Driver > > > javax.jdo.option.ConnectionUserName > hive > > > javax.jdo.option.ConnectionPassword > passwd > > > hive.metastore.local > false > > > hive.metastore.warehouse.dir > hdfs://host:8020/user/hive/warehouse > > > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1629) Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS`
[ https://issues.apache.org/jira/browse/SPARK-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li reassigned SPARK-1629: -- Assignee: Guoqiang Li > Spark should inline use of commons-lang `SystemUtils.IS_OS_WINDOWS` > > > Key: SPARK-1629 > URL: https://issues.apache.org/jira/browse/SPARK-1629 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Minor > > Right now we use this but don't depend on it explicitly (which is wrong). We > should probably just inline this function and remove the need to add a > dependency. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1644) hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") throw an exception
[ https://issues.apache.org/jira/browse/SPARK-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li reassigned SPARK-1644: -- Assignee: Guoqiang Li > hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") throw an > exception > - > > Key: SPARK-1644 > URL: https://issues.apache.org/jira/browse/SPARK-1644 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Guoqiang Li >Assignee: Guoqiang Li > Attachments: spark.log > > > cat conf/hive-site.xml > {code:xml} > > > javax.jdo.option.ConnectionURL > jdbc:postgresql://bj-java-hugedata1:7432/hive > > > javax.jdo.option.ConnectionDriverName > org.postgresql.Driver > > > javax.jdo.option.ConnectionUserName > hive > > > javax.jdo.option.ConnectionPassword > passwd > > > hive.metastore.local > false > > > hive.metastore.warehouse.dir > hdfs://host:8020/user/hive/warehouse > > > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1636) Move main methods to examples
[ https://issues.apache.org/jira/browse/SPARK-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1636. -- Resolution: Done Fix Version/s: 1.0.0 > Move main methods to examples > - > > Key: SPARK-1636 > URL: https://issues.apache.org/jira/browse/SPARK-1636 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.0.0 > > > Move the main methods to examples and make them compatible with spark-submit. -- This message was sent by Atlassian JIRA (v6.2#6252)