[jira] [Created] (SPARK-37295) illegal reflective access operation has occurred; Please consider reporting this to the maintainers
Andrew Davidson created SPARK-37295: --- Summary: illegal reflective access operation has occurred; Please consider reporting this to the maintainers Key: SPARK-37295 URL: https://issues.apache.org/jira/browse/SPARK-37295 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.2 Environment: MacBook pro running mac OS 11.6 spark-3.1.2-bin-hadoop3.2 it is not clear to me how spark finds java. I believe I also have java 8 installed somewhere ``` $ which java ~/anaconda3/envs/extraCellularRNA/bin/java $ java -version openjdk version "11.0.6" 2020-01-14 OpenJDK Runtime Environment (build 11.0.6+8-b765.1) OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode) ``` Reporter: Andrew Davidson ``` spark = SparkSession\ .builder\ .appName("TestEstimatedScalingFactors")\ .getOrCreate() ``` generates the following warning ``` WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/xxx/googleUCSC/kimLab/extraCellularRNA/terra/deseq/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 21/11/11 12:51:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ``` I am using pyspark spark-3.1.2-bin-hadoop3.2 on a MacBook pro running mac OS 11.6 My small unit test see to work okay how ever It fails when I try and run on 3.2.0 I Any idea how I track down this issue? Kind regards Andy -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428517#comment-16428517 ] Andrew Davidson commented on SPARK-23878: - yes please mark resolved > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > Attachments: eclipsePyDevPySparkConfig.png > > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428509#comment-16428509 ] Andrew Davidson commented on SPARK-23878: - added screen shot to https://www.pinterest.com/pin/842454674019358931/ > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > Attachments: eclipsePyDevPySparkConfig.png > > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428494#comment-16428494 ] Andrew Davidson commented on SPARK-23878: - Hi Hyukjin many thanks! I am not a python expert. I did not know about dynamic name spaces. Did a little googling to configure dynamic namespace in eclipse preferences ->pydev -> interpreter ->python interpreters add pyspark to 'forced build' I attached a screen shot. Any idea how this can be added to the documentation so others do not have to waste times on this detail p.s. you sent me a link to "support vitualenv in PySpark" https://issues.apache.org/jira/browse/SPARK-13587 Once I made the pydev configuration change I am able to use pyspark in a virtualenv. I use this environment in eclipse pydev IDE with out any problems. I am able to run Juypter notebooks with out an problem > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > Attachments: eclipsePyDevPySparkConfig.png > > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-23878: Attachment: eclipsePyDevPySparkConfig.png > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > Attachments: eclipsePyDevPySparkConfig.png > > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23878) unable to import col() or lit()
[ https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427883#comment-16427883 ] Andrew Davidson commented on SPARK-23878: - Hi Hyukjin you are correct. Most IDE's are primarily language aware editors and builders. For example consider eclipse or IntelJ for developing a javascript website, or java servlet. The editor functionality knows about the syntax of the language you are working with along with the libraries and packages you are using. Often the IDE does some sort of continuous build or code analysis to help you find bugs without having to deploy Often the IDE makes it easy build, package, to actually deploy on some sort of test server and debug and or run unit tests. So if pyspark is generating functions at turn time that going to cause problems for the IDE. the functions are not defined in the edit session. [http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext] describes how to write unititests for pyspark that you can run from your command line and or from with in elipse. I think a side effect is that they might cause the functions lit() and col() to be generated? I could not find a work around for col() and lit(). ret = df.select( col(columnName).cast("string").alias("key"), lit(value).alias("source") ) Kind regards Andy > unable to import col() or lit() > --- > > Key: SPARK-23878 > URL: https://issues.apache.org/jira/browse/SPARK-23878 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: eclipse 4.7.3 > pyDev 6.3.2 > pyspark==2.3.0 >Reporter: Andrew Davidson >Priority: Major > > I have some code I am moving from a jupyter notebook to separate python > modules. My notebook uses col() and list() and works fine > when I try to work with module files in my IDE I get the following errors. I > am also not able to run my unit tests. > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: lit load.py > /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} > {color:#FF}Description Resource Path Location Type{color} > {color:#FF}Unresolved import: col load.py > /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} > I suspect that when you run pyspark it is generating the col and lit > functions? > I found a discription of the problem @ > [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] > I do not understand how to make this work in my IDE. I am not running > pyspark just an editor > is there some sort of workaround or replacement for these missing functions? > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23878) unable to import col() or lit()
Andrew Davidson created SPARK-23878: --- Summary: unable to import col() or lit() Key: SPARK-23878 URL: https://issues.apache.org/jira/browse/SPARK-23878 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Environment: eclipse 4.7.3 pyDev 6.3.2 pyspark==2.3.0 Reporter: Andrew Davidson I have some code I am moving from a jupyter notebook to separate python modules. My notebook uses col() and list() and works fine when I try to work with module files in my IDE I get the following errors. I am also not able to run my unit tests. {color:#FF}Description Resource Path Location Type{color} {color:#FF}Unresolved import: lit load.py /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color} {color:#FF}Description Resource Path Location Type{color} {color:#FF}Unresolved import: col load.py /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color} I suspect that when you run pyspark it is generating the col and lit functions? I found a discription of the problem @ [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark] I do not understand how to make this work in my IDE. I am not running pyspark just an editor is there some sort of workaround or replacement for these missing functions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431409#comment-15431409 ] Andrew Davidson commented on SPARK-17172: - Hi Sean I forgot about that older jira issue. I never resolved it. I am using juypter. I believe each notebook gets it own spark context. I googled around and found some old issue that seem to suggest that a hive and sql context where being created . I have not figure out how to either use a different database for the hive context or prevent the original spark context from being created. > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431371#comment-15431371 ] Andrew Davidson commented on SPARK-17172: - Hi Sean the data center was created using spark-ec2 from spark-1.6.1-bin-hadoop2.6 ec2-user@ip-172-31-22-140 root]$ cat /root/spark/RELEASE Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 Build flags: -Psparkr -Phadoop-1 -Phive -Phive-thriftserver -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DzincPort=3032 [ec2-user@ip-172-31-22-140 root]$ > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431018#comment-15431018 ] Andrew Davidson commented on SPARK-17172: - Hi Sean It should be very easy to use the attached notebook to reproduce the hive bug. I got the code example from a blog. The original code worked in spark 1.5.x I also attached an html version of the notebook so you can see the entire stack trace with out having to start jupyter thanks Andy > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430004#comment-15430004 ] Andrew Davidson commented on SPARK-17172: - Hi Sean I do not think it is the same error. In the related to bug, I could not create a udf using sqlcontext. The work around solution was to change the permission on hdfs:///tmp The error msg actually mentioned problem with /tmp. (I thought the msg referred to the file:///tmp ) not sure how permission got messed up? maybe some one deleted it by accident and spark does not recreated it if its missing? so I am able to create udf using sqlcontext. hiveContext does not work. Given I fixed the hdfs:/// permission problem I think its probably something else. Hopefully the attached notebook makes it easy to reproduce thanks Andy > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-17172: Attachment: hiveUDFBug.ipynb hiveUDFBug.html > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429465#comment-15429465 ] Andrew Davidson commented on SPARK-17172: - attached a notebook that demonstrates the bug. Also attaced an html version of notebook > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > Attachments: hiveUDFBug.html, hiveUDFBug.ipynb > > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
[ https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429463#comment-15429463 ] Andrew Davidson commented on SPARK-17172: - related bug report : https://issues.apache.org/jira/browse/SPARK-17143 > pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while > calling None.org.apache.spark.sql.hive.HiveContext. > -- > > Key: SPARK-17172 > URL: https://issues.apache.org/jira/browse/SPARK-17172 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: spark version: 1.6.2 > python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) > [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] >Reporter: Andrew Davidson > > from pyspark.sql import HiveContext > sqlContext = HiveContext(sc) > # Define udf > from pyspark.sql.functions import udf > def scoreToCategory(score): > if score >= 80: return 'A' > elif score >= 60: return 'B' > elif score >= 35: return 'C' > else: return 'D' > > udfScoreToCategory=udf(scoreToCategory, StringType()) > throws exception > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
Andrew Davidson created SPARK-17172: --- Summary: pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. Key: SPARK-17172 URL: https://issues.apache.org/jira/browse/SPARK-17172 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.2 Environment: spark version: 1.6.2 python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] Reporter: Andrew Davidson from pyspark.sql import HiveContext sqlContext = HiveContext(sc) # Define udf from pyspark.sql.functions import udf def scoreToCategory(score): if score >= 80: return 'A' elif score >= 60: return 'B' elif score >= 35: return 'C' else: return 'D' udfScoreToCategory=udf(scoreToCategory, StringType()) throws exception Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427394#comment-15427394 ] Andrew Davidson commented on SPARK-17143: - See email from user's group. I was able to find a work around. Not sure how hdfs:///tmp/ got created or how the permissions got messed up ## NICE CATCH!!! Many thanks. I spent all day on this bug The error msg report /tmp. I did not think to look on hdfs. [ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/ Found 1 items -rw-r--r-- 3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp [ec2-user@ip-172-31-22-140 notebooks]$ I have no idea how hdfs:///tmp got created. I deleted it. This causes a bunch of exceptions. These exceptions has useful message. I was able to fix the problem as follows $ hadoop fs -rmr hdfs:///tmp Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong $ hadoop fs -chmod 777 hdfs:///tmp/hive From: Felix Cheung Date: Thursday, August 18, 2016 at 3:37 PM To: Andrew Davidson , "user @spark" Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp Do you have a file called tmp at / on HDFS? > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes =
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427278#comment-15427278 ] Andrew Davidson commented on SPARK-17143: - given the exception metioned an issue with /tmp I decide to track how /tmp changed when run my cell # no spark jobs are running [ec2-user@ip-172-31-22-140 notebooks]$ !ls ls /tmp/ hsperfdata_ec2-user hsperfdata_root pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # start notebook server $ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out & [ec2-user@ip-172-31-22-140 notebooks]$ !ls ls /tmp/ hsperfdata_ec2-user hsperfdata_root pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # start the udfBug notebook [ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/ hsperfdata_ec2-user hsperfdata_root libnetty-transport-native-epoll818283657820702.so pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # execute cell that define UDF [ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/ hsperfdata_ec2-user hsperfdata_root libnetty-transport-native-epoll818283657820702.so pip_build_ec2-user spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9 [ec2-user@ip-172-31-22-140 notebooks]$ [ec2-user@ip-172-31-22-140 notebooks]$ find /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/ /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/ /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/db.lck /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log.ctrl /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/README_DO_NOT_TOUCH_FILES.txt /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/logmirror.ctrl /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/service.properties /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/README_DO_NOT_TOUCH_FILES.txt /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0 /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c230.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c4b0.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c241.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c180.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c2b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c311.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c880.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c541.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c9f1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c20.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c590.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c721.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c470.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c441.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c8e1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c361.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c421.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c331.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c461.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c5d0.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c851.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c621.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c101.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3d1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c891.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c641.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c871.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c6a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/cb1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca01.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c391.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7f1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c41.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c990.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63
[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-17143: Attachment: udfBug.html This html version of the notebook shows the output when run in my data center > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1062 answer = self._gateway_client.send_command(command) >1063 return_value = get_return_value( > -> 10
[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-17143: Attachment: udfBug.ipynb The attached notebook demonstrated the reported bug. Note it includes the output when run on my mac book pro. The bug report contains the stack trace when the same code is run in my data center > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1062 answ
[jira] [Created] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
Andrew Davidson created SPARK-17143: --- Summary: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp Key: SPARK-17143 URL: https://issues.apache.org/jira/browse/SPARK-17143 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Environment: spark version: 1.6.1 python version: 3.4.3 (default, Apr 1 2015, 18:10:40) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] Reporter: Andrew Davidson For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp The notebook runs fine on my Mac In general I am able to run non UDF spark code with out any trouble I start the notebook server as the user “ec2-user" and uses master URL spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 I found the following message in the notebook server log file. I have log level set to warn 16/08/18 21:38:45 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException The cluster was originally created using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 #from pyspark.sql import SQLContext, HiveContext #sqlContext = SQLContext(sc) #from pyspark.sql import DataFrame #from pyspark.sql import functions from pyspark.sql.types import StringType from pyspark.sql.functions import udf print("spark version: {}".format(sc.version)) import sys print("python version: {}".format(sys.version)) spark version: 1.6.1 python version: 3.4.3 (default, Apr 1 2015, 18:10:40) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] # functions.lower() raises # py4j.Py4JException: Method lower([class java.lang.String]) does not exist # work around define a UDF toLowerUDFRetType = StringType() #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) toLowerUDF = udf(lambda s : s.lower(), StringType()) You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly Py4JJavaErrorTraceback (most recent call last) in () 4 toLowerUDFRetType = StringType() 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) 1595 [Row(slen=5), Row(slen=3)] 1596 """ -> 1597 return UserDefinedFunction(f, returnType) 1598 1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] /root/spark/python/pyspark/sql/functions.py in __init__(self, func, returnType, name) 1556 self.returnType = returnType 1557 self._broadcast = None -> 1558 self._judf = self._create_judf(name) 1559 1560 def _create_judf(self, name): /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) 1567 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command, self) 1568 ctx = SQLContext.getOrCreate(sc) -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) 1570 if name is None: 1571 name = f.__name__ if hasattr(f, '__name__') else f.__class__.__name__ /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) 681 try: 682 if not hasattr(self, '_scala_HiveContext'): --> 683 self._scala_HiveContext = self._get_hive_ctx() 684 return self._scala_HiveContext 685 except Py4JError as e: /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) 690 691 def _get_hive_ctx(self): --> 692 return self._jvm.HiveContext(self._jsc.sc()) 693 694 def refreshTable(self, tableName): /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1062 answer = self._gateway_client.send_command(command) 1063 return_value = get_return_value( -> 1064 answer, self._gateway_client, None, self._fqn) 1065 1066 for temp_arg in temp_args: /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /root/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred
[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359277#comment-15359277 ] Andrew Davidson commented on SPARK-15829: - Hi Sean you mention the ec2 script is not supported anymore? What was the last release it was supported in? Its still part of the 1.6.x documentation Is there a replacement or alternative? thanks Andy > spark master webpage links to application UI broke when running in cluster > mode > --- > > Key: SPARK-15829 > URL: https://issues.apache.org/jira/browse/SPARK-15829 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.6.1 > Environment: AWS ec2 cluster >Reporter: Andrew Davidson >Priority: Critical > > Hi > I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > I use the stand alone cluster manager. I have a streaming app running in > cluster mode. I notice the master webpage links to the application UI page > are incorrect > It does not look like jira will let my upload images. I'll try and describe > the web pages and the bug > My master is running on > http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/ > It has a section marked "applications". If I click on one of the running > application ids I am taken to a page showing "Executor Summary". This page > has a link to teh 'application detail UI' the url is > http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/ > Notice it things the application UI is running on the cluster master. > It is actually running on the same machine as the driver on port 4041. I was > able to reverse engine the url by noticing the private ip address is part of > the worker id . For example worker-20160322041632-172.31.23.201-34909 > next I went on the aws ec2 console to find the public DNS name for this > machine > http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/ > Kind regards > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359215#comment-15359215 ] Andrew Davidson commented on SPARK-15829: - Hi Sean I am not sure how to check the value of 'SPARK_MASTER_HOST'. I looked at the documentation pags http://spark.apache.org/docs/latest/configuration.html and http://spark.apache.org/docs/latest/ec2-scripts.html. They do not mention SPARK_MASTER_HOST when I submit my jobs I use MASTER_URL=spark://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:6066 I use the stand alone cluster manager I think the problem may be that the web UI assume the driver is always running on the master machine. I assume the cluster manager decides which worker the driver will run on. Is there a way for the web UI to discover where the driver is running? On my master [ec2-user@ip-172-31-22-140 conf]$ pwd /root/spark/conf [ec2-user@ip-172-31-22-140 conf]$ cat slaves ec2-54-193-94-207.us-west-1.compute.amazonaws.com ec2-54-67-13-246.us-west-1.compute.amazonaws.com ec2-54-67-48-49.us-west-1.compute.amazonaws.com ec2-54-193-104-169.us-west-1.compute.amazonaws.com [ec2-user@ip-172-31-22-140 conf]$ [ec2-user@ip-172-31-22-140 conf]$ grep SPARK_MASTER_HOST * [ec2-user@ip-172-31-22-140 conf]$ pwd /root/spark/conf [ec2-user@ip-172-31-22-140 conf]$ grep SPARK_MASTER_HOST * [ec2-user@ip-172-31-22-140 conf]$ ec2-user@ip-172-31-22-140 sbin]$ pwd /root/spark/sbin [ec2-user@ip-172-31-22-140 sbin]$ grep SPARK_MASTER_HOST * [ec2-user@ip-172-31-22-140 bin]$ !grep grep SPARK_MASTER_HOST * [ec2-user@ip-172-31-22-140 bin]$ Thanks for looking into this Andy > spark master webpage links to application UI broke when running in cluster > mode > --- > > Key: SPARK-15829 > URL: https://issues.apache.org/jira/browse/SPARK-15829 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.6.1 > Environment: AWS ec2 cluster >Reporter: Andrew Davidson >Priority: Critical > > Hi > I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > I use the stand alone cluster manager. I have a streaming app running in > cluster mode. I notice the master webpage links to the application UI page > are incorrect > It does not look like jira will let my upload images. I'll try and describe > the web pages and the bug > My master is running on > http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/ > It has a section marked "applications". If I click on one of the running > application ids I am taken to a page showing "Executor Summary". This page > has a link to teh 'application detail UI' the url is > http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/ > Notice it things the application UI is running on the cluster master. > It is actually running on the same machine as the driver on port 4041. I was > able to reverse engine the url by noticing the private ip address is part of > the worker id . For example worker-20160322041632-172.31.23.201-34909 > next I went on the aws ec2 console to find the public DNS name for this > machine > http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/ > Kind regards > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325001#comment-15325001 ] Andrew Davidson commented on SPARK-15829: - Hi Xin I ran netstat on my master. I do not think the port are in use. To submit in cluster mode I use port 6066. If you are using port 7077 you are in client mode. In client mode the application UI will run on the spark master. In cluster mode the application UI runs on which ever slave the driver is running on. If you notice in my original description the url is incorrect. the ip is wrong, the port is correct. Kind regards Andy # bash-4.2# netstat -tulpn Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp0 0 0.0.0.0:86520.0.0.0:* LISTEN 3832/gmetad tcp0 0 0.0.0.0:87870.0.0.0:* LISTEN 2584/rserver tcp0 0 0.0.0.0:36757 0.0.0.0:* LISTEN 2905/java tcp0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 2905/java tcp0 0 0.0.0.0:22 0.0.0.0:* LISTEN 2144/sshd tcp0 0 127.0.0.1:631 0.0.0.0:* LISTEN 2095/cupsd tcp0 0 127.0.0.1:7000 0.0.0.0:* LISTEN 6512/python3.4 tcp0 0 127.0.0.1:250.0.0.0:* LISTEN 2183/sendmail tcp0 0 0.0.0.0:43813 0.0.0.0:* LISTEN 3093/java tcp0 0 172.31.22.140:9000 0.0.0.0:* LISTEN 2905/java tcp0 0 0.0.0.0:86490.0.0.0:* LISTEN 3810/gmond tcp0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 3093/java tcp0 0 0.0.0.0:86510.0.0.0:* LISTEN 3832/gmetad tcp0 0 :::8080 :::* LISTEN 23719/java tcp0 0 :::8081 :::* LISTEN 5588/java tcp0 0 :::172.31.22.140:6066 :::* LISTEN 23719/java tcp0 0 :::172.31.22.140:6067 :::* LISTEN 5588/java tcp0 0 :::22 :::* LISTEN 2144/sshd tcp0 0 ::1:631 :::* LISTEN 2095/cupsd tcp0 0 :::19998:::* LISTEN 3709/java tcp0 0 :::1:::* LISTEN 3709/java tcp0 0 :::172.31.22.140:7077 :::* LISTEN 23719/java tcp0 0 :::172.31.22.140:7078 :::* LISTEN 5588/java udp0 0 0.0.0.0:86490.0.0.0:* 3810/gmond udp0 0 0.0.0.0:631 0.0.0.0:* 2095/cupsd udp0 0 0.0.0.0:38546 0.0.0.0:* 2905/java udp0 0 0.0.0.0:68 0.0.0.0:* 1142/dhclient udp0 0 172.31.22.140:123 0.0.0.0:* 2168/ntpd udp0 0 127.0.0.1:123 0.0.0.0:* 2168/ntpd udp0 0 0.0.0.0:123 0.0.0.0:* 2168/ntpd bash-4.2# > spark master webpage links to application UI broke when running in cluster > mode > --- > > Key: SPARK-15829 > URL: https://issues.apache.org/jira/browse/SPARK-15829 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.6.1 > Environment: AWS ec2 cluster >Reporter: Andrew Davidson >Priority: Critical > > Hi > I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > I use the stand alone cluster manager. I have a streaming app running in > cluster mode. I notice the master webpage links to the application UI page > are incorrect > It does not look like jira will let my upload imag
[jira] [Created] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode
Andrew Davidson created SPARK-15829: --- Summary: spark master webpage links to application UI broke when running in cluster mode Key: SPARK-15829 URL: https://issues.apache.org/jira/browse/SPARK-15829 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.6.1 Environment: AWS ec2 cluster Reporter: Andrew Davidson Priority: Critical Hi I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 I use the stand alone cluster manager. I have a streaming app running in cluster mode. I notice the master webpage links to the application UI page are incorrect It does not look like jira will let my upload images. I'll try and describe the web pages and the bug My master is running on http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/ It has a section marked "applications". If I click on one of the running application ids I am taken to a page showing "Executor Summary". This page has a link to teh 'application detail UI' the url is http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/ Notice it things the application UI is running on the cluster master. It is actually running on the same machine as the driver on port 4041. I was able to reverse engine the url by noticing the private ip address is part of the worker id . For example worker-20160322041632-172.31.23.201-34909 next I went on the aws ec2 console to find the public DNS name for this machine http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/ Kind regards Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15506) only one notebook can define a UDF; java.sql.SQLException: Another instance of Derby may have already booted the database
[ https://issues.apache.org/jira/browse/SPARK-15506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302359#comment-15302359 ] Andrew Davidson commented on SPARK-15506: - Hi Jeff Here is how I start the notebook server. I believe spark uses jupyter $SPARK_ROOT/bin/pyspark Can you tell me where I can find out more about configuration details? I do not think the issue is multiple users. I discovered the bug while running two notebooks on my local machine. I.E. I was running both notebooks. It seem like the each notebook server needs it own data base? Kind regards Andy p.s. even in our data center I start the notebook server the same way. I am the only data scientist > only one notebook can define a UDF; java.sql.SQLException: Another instance > of Derby may have already booted the database > - > > Key: SPARK-15506 > URL: https://issues.apache.org/jira/browse/SPARK-15506 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: Mac OSX El Captain > Python 3.4.2 >Reporter: Andrew Davidson > > I am using a sqlContext to create dataframes. I noticed that if I open up and > run 'notebook a' and 'a' defines a udf. That I will not be able to open a > second notebook that also defines a udf unless I shut down notebook a first. > In the second notebook I get a big long stack trace. The problem seems to be > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database > /Users/andrewdavidson/workSpace/bigPWSWorkspace/dataScience/notebooks/gnip/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 86 more > Here is the complete stack track > Kind regards > Andy > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > --- > Py4JJavaError Traceback (most recent call last) > in () > 16 #fooUDF = udf(lambda arg : "aedwip") > 17 > ---> 18 paddedStrUDF = udf(lambda zipInt : str(zipInt).zfill(5)) > /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py > in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py > in __init__(self, func, returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py > in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py > in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py > in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1062 answer = self._gateway_client.send_command(command) >1063 return_value = get_return_value( > -> 1064 answer, self._gateway_client, None, self._fqn) >1065 >1066 for temp_arg in temp_args: > /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark
[jira] [Created] (SPARK-15506) only one notebook can define a UDF; java.sql.SQLException: Another instance of Derby may have already booted the database
Andrew Davidson created SPARK-15506: --- Summary: only one notebook can define a UDF; java.sql.SQLException: Another instance of Derby may have already booted the database Key: SPARK-15506 URL: https://issues.apache.org/jira/browse/SPARK-15506 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Environment: Mac OSX El Captain Python 3.4.2 Reporter: Andrew Davidson I am using a sqlContext to create dataframes. I noticed that if I open up and run 'notebook a' and 'a' defines a udf. That I will not be able to open a second notebook that also defines a udf unless I shut down notebook a first. In the second notebook I get a big long stack trace. The problem seems to be Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /Users/andrewdavidson/workSpace/bigPWSWorkspace/dataScience/notebooks/gnip/metastore_db. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) ... 86 more Here is the complete stack track Kind regards Andy You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly --- Py4JJavaError Traceback (most recent call last) in () 16 #fooUDF = udf(lambda arg : "aedwip") 17 ---> 18 paddedStrUDF = udf(lambda zipInt : str(zipInt).zfill(5)) /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py in udf(f, returnType) 1595 [Row(slen=5), Row(slen=3)] 1596 """ -> 1597 return UserDefinedFunction(f, returnType) 1598 1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py in __init__(self, func, returnType, name) 1556 self.returnType = returnType 1557 self._broadcast = None -> 1558 self._judf = self._create_judf(name) 1559 1560 def _create_judf(self, name): /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py in _create_judf(self, name) 1567 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command, self) 1568 ctx = SQLContext.getOrCreate(sc) -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) 1570 if name is None: 1571 name = f.__name__ if hasattr(f, '__name__') else f.__class__.__name__ /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py in _ssql_ctx(self) 681 try: 682 if not hasattr(self, '_scala_HiveContext'): --> 683 self._scala_HiveContext = self._get_hive_ctx() 684 return self._scala_HiveContext 685 except Py4JError as e: /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py in _get_hive_ctx(self) 690 691 def _get_hive_ctx(self): --> 692 return self._jvm.HiveContext(self._jsc.sc()) 693 694 def refreshTable(self, tableName): /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1062 answer = self._gateway_client.send_command(command) 1063 return_value = get_return_value( -> 1064 answer, self._gateway_client, None, self._fqn) 1065 1066 for temp_arg in temp_args: /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.py in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".", name), value) 309 else: 310 raise Py4JError( Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.ql.sessio
[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones
[ https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228529#comment-15228529 ] Andrew Davidson commented on SPARK-14057: - Hi Vijay I should have some time to take a look tomorrow Kind regards Andy From: "Vijay Parmar (JIRA)" Date: Tuesday, April 5, 2016 at 3:23 PM To: Andrew Davidson Subject: [jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones > sql time stamps do not respect time zones > - > > Key: SPARK-14057 > URL: https://issues.apache.org/jira/browse/SPARK-14057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Andrew Davidson >Priority: Minor > > we have time stamp data. The time stamp data is UTC how ever when we load the > data into spark data frames, the system assume the time stamps are in the > local time zone. This causes problems for our data scientists. Often they > pull data from our data center into their local macs. The data centers run > UTC. There computers are typically in PST or EST. > It is possible to hack around this problem > This cause a lot of errors in their analysis > A complete description of this issue can be found in the following mail msg > https://www.mail-archive.com/user@spark.apache.org/msg48121.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones
[ https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216293#comment-15216293 ] Andrew Davidson commented on SPARK-14057: - Hi Vijay I am fairly new to this also. I think I would contact the Russel about his analysis spark-connector potentially making a change to the way dates are handled could break a lot of existing apps. I think the key is to figure how to come up with a solution that is backwards compatible. I have filed several bugs in the past but only after discussion on the users email group. I would suggest before you do a lot of work you see what the user group thinks many thanks for taking this on Andy > sql time stamps do not respect time zones > - > > Key: SPARK-14057 > URL: https://issues.apache.org/jira/browse/SPARK-14057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Andrew Davidson >Priority: Minor > > we have time stamp data. The time stamp data is UTC how ever when we load the > data into spark data frames, the system assume the time stamps are in the > local time zone. This causes problems for our data scientists. Often they > pull data from our data center into their local macs. The data centers run > UTC. There computers are typically in PST or EST. > It is possible to hack around this problem > This cause a lot of errors in their analysis > A complete description of this issue can be found in the following mail msg > https://www.mail-archive.com/user@spark.apache.org/msg48121.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones
[ https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214334#comment-15214334 ] Andrew Davidson commented on SPARK-14057: - Hi Vijay here is some more info from the email thread mentioned above Russel is very involved in the development of cassandra and the spark cassandra connector. He has a suggestion for how to fix this bug kind regards Andy http://www.slideshare.net/RussellSpitzer On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer wrote: > Unfortunately part of Spark SQL. They have based their type on > java.sql.timestamp (and date) which adjust to the client timezone when > displaying and storing. > See discussions > http://stackoverflow.com/questions/9202857/timezones-in-sql-date-vs-java-sql-d > ate > And Code > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/ > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.s > cala#L81-L93 > > sql time stamps do not respect time zones > - > > Key: SPARK-14057 > URL: https://issues.apache.org/jira/browse/SPARK-14057 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Andrew Davidson >Priority: Minor > > we have time stamp data. The time stamp data is UTC how ever when we load the > data into spark data frames, the system assume the time stamps are in the > local time zone. This causes problems for our data scientists. Often they > pull data from our data center into their local macs. The data centers run > UTC. There computers are typically in PST or EST. > It is possible to hack around this problem > This cause a lot of errors in their analysis > A complete description of this issue can be found in the following mail msg > https://www.mail-archive.com/user@spark.apache.org/msg48121.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14057) sql time stamps do not respect time zones
Andrew Davidson created SPARK-14057: --- Summary: sql time stamps do not respect time zones Key: SPARK-14057 URL: https://issues.apache.org/jira/browse/SPARK-14057 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Andrew Davidson Priority: Minor we have time stamp data. The time stamp data is UTC how ever when we load the data into spark data frames, the system assume the time stamps are in the local time zone. This causes problems for our data scientists. Often they pull data from our data center into their local macs. The data centers run UTC. There computers are typically in PST or EST. It is possible to hack around this problem This cause a lot of errors in their analysis A complete description of this issue can be found in the following mail msg https://www.mail-archive.com/user@spark.apache.org/msg48121.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130923#comment-15130923 ] Andrew Davidson commented on SPARK-13065: - The code looks really nice well done Andy > streaming-twitter pass twitter4j.FilterQuery argument to > TwitterUtils.createStream() > > > Key: SPARK-13065 > URL: https://issues.apache.org/jira/browse/SPARK-13065 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 > Environment: all >Reporter: Andrew Davidson >Priority: Minor > Labels: twitter > Attachments: twitterFilterQueryPatch.tar.gz > > Original Estimate: 2h > Remaining Estimate: 2h > > The twitter stream api is very powerful provides a lot of support for > twitter.com side filtering of status objects. When ever possible we want to > let twitter do as much work as possible for us. > currently the spark twitter api only allows you to configure a small sub set > of possible filters > String{} filters = {"tag1", tag2"} > JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, > filters); > The current implemenation does > private[streaming] > class TwitterReceiver( > twitterAuth: Authorization, > filters: Seq[String], > storageLevel: StorageLevel > ) extends Receiver[Status](storageLevel) with Logging { > . . . > val query = new FilterQuery > if (filters.size > 0) { > query.track(filters.mkString(",")) > newTwitterStream.filter(query) > } else { > newTwitterStream.sample() > } > ... > rather than construct the FilterQuery object in TwitterReceiver.onStart(). we > should be able to pass a FilterQueryObject > looks like an easy fix. See source code links bellow > kind regards > Andy > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60 > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 > $ 2/2/16 > attached is my java implementation for this problem. Feel free to reuse it > how ever you like. In my streaming spark app main() I have the following code >FilterQuery query = config.getFilterQuery().fetch(); > if (query != null) { > // TODO https://issues.apache.org/jira/browse/SPARK-13065 > tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, > query); > } /*else > spark native api > String[] filters = {"tag1", tag2"} > tweets = TwitterUtils.createStream(ssc, twitterAuth, filters); > > see > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 > > causes > val query = new FilterQuery > if (filters.size > 0) { > query.track(filters.mkString(",")) > newTwitterStream.filter(query) > } */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json
[ https://issues.apache.org/jira/browse/SPARK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128843#comment-15128843 ] Andrew Davidson commented on SPARK-13009: - I Sean I total agree with you. The Twitter4j people asked me to file a RFE with spark. I agree it is their problem. I just looking for some sort of work around. My down stream systems will not be able to process the data I am capturing. I guess in the short term I create the wrapper object and modify the spark twitter source code kind regards Andy > spark-streaming-twitter_2.10 does not make it possible to access the raw > twitter json > - > > Key: SPARK-13009 > URL: https://issues.apache.org/jira/browse/SPARK-13009 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Andrew Davidson >Priority: Minor > > The Streaming-twitter package makes it easy for Java programmers to work with > twitter. The implementation returns the raw twitter data in JSON formate as a > twitter4J StatusJSONImpl object > JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth); > The status class is different then the raw JSON. I.E. serializing the status > object will be the same as the original json. I have down stream systems that > can only process raw tweets not twitter4J Status objects. > Here is my bug/RFE request made to Twitter4J . > They asked I create a spark tracking issue. > On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote: > Hi All > Quick problem summary: > My system uses the Status objects to do some analysis how ever I need to > store the raw JSON. There are other systems that process that data that are > not written in Java. > Currently we are serializing the Status Object. The JSON is going to break > down stream systems. > I am using the Apache Spark Streaming spark-streaming-twitter_2.10 > http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources > Request For Enhancement: > I imagine easy access to the raw JSON is a common requirement. Would it be > possible to add a member function to StatusJSONImpl getRawJson(). By default > the returned value would be null unless jsonStoreEnabled=True is set in the > config. > Alternative implementations: > > It should be possible to modify the spark-streaming-twitter_2.10 to provide > this support. The solutions is not very clean > It would required apache spark to define their own Status Pojo. The current > StatusJSONImpl class is marked final > The Wrapper is not going to work nicely with existing code. > spark-streaming-twitter_2.10 does not expose all of the twitter streaming > API so many developers are writing their implementations of > org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance > difficult. Its not easy to know when the spark implementation for twitter has > changed. > Code listing for > spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala > private[streaming] > class TwitterReceiver( > twitterAuth: Authorization, > filters: Seq[String], > storageLevel: StorageLevel > ) extends Receiver[Status](storageLevel) with Logging { > @volatile private var twitterStream: TwitterStream = _ > @volatile private var stopped = false > def onStart() { > try { > val newTwitterStream = new > TwitterStreamFactory().getInstance(twitterAuth) > newTwitterStream.addListener(new StatusListener { > def onStatus(status: Status): Unit = { > store(status) > } > Ref: > https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html > What do people think? > Kind regards > Andy > From: on behalf of Igor Brigadir > > Reply-To: > Date: Tuesday, January 19, 2016 at 5:55 AM > To: Twitter4J > Subject: Re: [Twitter4J] trouble writing unit test > Main issue is that the Json object is in the wrong json format. > eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 > + 2015", ... > It looks like the json you have was serialized from a java Status object, > which makes json objects different to what you get from the API, > TwitterObjectFactory expects json from Twitter (I haven't had any problems > using TwitterObjectFactory instead of the Deprecated DataObjectFactory). > You could "fix" it by matching the keys & values you have with the correct, > twitter API json - it should look like the example here: > https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid > But it might be easier to download the tweets again, but this time use > TwitterObjectFactory.getRawJSON(status) to get the Original Json from the > Twitter API, and save tha
[jira] [Comment Edited] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128832#comment-15128832 ] Andrew Davidson edited comment on SPARK-13065 at 2/2/16 7:20 PM: - Hi Sachin I attached my java implementation for this enhancement as a reference. I also changed the description above. I added the code I use in my streaming spark app main() I chose a bad name for the attachement . its not in patch format Kind regards Andy was (Author: aedwip): sorry bad name. its not in patch format > streaming-twitter pass twitter4j.FilterQuery argument to > TwitterUtils.createStream() > > > Key: SPARK-13065 > URL: https://issues.apache.org/jira/browse/SPARK-13065 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 > Environment: all >Reporter: Andrew Davidson >Priority: Minor > Labels: twitter > Attachments: twitterFilterQueryPatch.tar.gz > > Original Estimate: 2h > Remaining Estimate: 2h > > The twitter stream api is very powerful provides a lot of support for > twitter.com side filtering of status objects. When ever possible we want to > let twitter do as much work as possible for us. > currently the spark twitter api only allows you to configure a small sub set > of possible filters > String{} filters = {"tag1", tag2"} > JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, > filters); > The current implemenation does > private[streaming] > class TwitterReceiver( > twitterAuth: Authorization, > filters: Seq[String], > storageLevel: StorageLevel > ) extends Receiver[Status](storageLevel) with Logging { > . . . > val query = new FilterQuery > if (filters.size > 0) { > query.track(filters.mkString(",")) > newTwitterStream.filter(query) > } else { > newTwitterStream.sample() > } > ... > rather than construct the FilterQuery object in TwitterReceiver.onStart(). we > should be able to pass a FilterQueryObject > looks like an easy fix. See source code links bellow > kind regards > Andy > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60 > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 > $ 2/2/16 > attached is my java implementation for this problem. Feel free to reuse it > how ever you like. In my streaming spark app main() I have the following code >FilterQuery query = config.getFilterQuery().fetch(); > if (query != null) { > // TODO https://issues.apache.org/jira/browse/SPARK-13065 > tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, > query); > } /*else > spark native api > String[] filters = {"tag1", tag2"} > tweets = TwitterUtils.createStream(ssc, twitterAuth, filters); > > see > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 > > causes > val query = new FilterQuery > if (filters.size > 0) { > query.track(filters.mkString(",")) > newTwitterStream.filter(query) > } */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-13065: Attachment: twitterFilterQueryPatch.tar.gz sorry bad name. its not in patch format > streaming-twitter pass twitter4j.FilterQuery argument to > TwitterUtils.createStream() > > > Key: SPARK-13065 > URL: https://issues.apache.org/jira/browse/SPARK-13065 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 > Environment: all >Reporter: Andrew Davidson >Priority: Minor > Labels: twitter > Attachments: twitterFilterQueryPatch.tar.gz > > Original Estimate: 2h > Remaining Estimate: 2h > > The twitter stream api is very powerful provides a lot of support for > twitter.com side filtering of status objects. When ever possible we want to > let twitter do as much work as possible for us. > currently the spark twitter api only allows you to configure a small sub set > of possible filters > String{} filters = {"tag1", tag2"} > JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, > filters); > The current implemenation does > private[streaming] > class TwitterReceiver( > twitterAuth: Authorization, > filters: Seq[String], > storageLevel: StorageLevel > ) extends Receiver[Status](storageLevel) with Logging { > . . . > val query = new FilterQuery > if (filters.size > 0) { > query.track(filters.mkString(",")) > newTwitterStream.filter(query) > } else { > newTwitterStream.sample() > } > ... > rather than construct the FilterQuery object in TwitterReceiver.onStart(). we > should be able to pass a FilterQueryObject > looks like an easy fix. See source code links bellow > kind regards > Andy > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60 > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 > $ 2/2/16 > attached is my java implementation for this problem. Feel free to reuse it > how ever you like. In my streaming spark app main() I have the following code >FilterQuery query = config.getFilterQuery().fetch(); > if (query != null) { > // TODO https://issues.apache.org/jira/browse/SPARK-13065 > tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, > query); > } /*else > spark native api > String[] filters = {"tag1", tag2"} > tweets = TwitterUtils.createStream(ssc, twitterAuth, filters); > > see > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 > > causes > val query = new FilterQuery > if (filters.size > 0) { > query.track(filters.mkString(",")) > newTwitterStream.filter(query) > } */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-13065: Description: The twitter stream api is very powerful provides a lot of support for twitter.com side filtering of status objects. When ever possible we want to let twitter do as much work as possible for us. currently the spark twitter api only allows you to configure a small sub set of possible filters String{} filters = {"tag1", tag2"} JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, filters); The current implemenation does private[streaming] class TwitterReceiver( twitterAuth: Authorization, filters: Seq[String], storageLevel: StorageLevel ) extends Receiver[Status](storageLevel) with Logging { . . . val query = new FilterQuery if (filters.size > 0) { query.track(filters.mkString(",")) newTwitterStream.filter(query) } else { newTwitterStream.sample() } ... rather than construct the FilterQuery object in TwitterReceiver.onStart(). we should be able to pass a FilterQueryObject looks like an easy fix. See source code links bellow kind regards Andy https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60 https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 $ 2/2/16 attached is my java implementation for this problem. Feel free to reuse it how ever you like. In my streaming spark app main() I have the following code FilterQuery query = config.getFilterQuery().fetch(); if (query != null) { // TODO https://issues.apache.org/jira/browse/SPARK-13065 tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, query); } /*else spark native api String[] filters = {"tag1", tag2"} tweets = TwitterUtils.createStream(ssc, twitterAuth, filters); see https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 causes val query = new FilterQuery if (filters.size > 0) { query.track(filters.mkString(",")) newTwitterStream.filter(query) } */ was: The twitter stream api is very powerful provides a lot of support for twitter.com side filtering of status objects. When ever possible we want to let twitter do as much work as possible for us. currently the spark twitter api only allows you to configure a small sub set of possible filters String{} filters = {"tag1", tag2"} JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, filters); The current implemenation does private[streaming] class TwitterReceiver( twitterAuth: Authorization, filters: Seq[String], storageLevel: StorageLevel ) extends Receiver[Status](storageLevel) with Logging { . . . val query = new FilterQuery if (filters.size > 0) { query.track(filters.mkString(",")) newTwitterStream.filter(query) } else { newTwitterStream.sample() } ... rather than construct the FilterQuery object in TwitterReceiver.onStart(). we should be able to pass a FilterQueryObject looks like an easy fix. See source code links bellow kind regards Andy https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60 https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 > streaming-twitter pass twitter4j.FilterQuery argument to > TwitterUtils.createStream() > > > Key: SPARK-13065 > URL: https://issues.apache.org/jira/browse/SPARK-13065 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 > Environment: all >Reporter: Andrew Davidson >Priority: Minor > Labels: twitter > Original Estimate: 2h > Remaining Estimate: 2h > > The twitter stream api is very powerful provides a lot of support for > twitter.com side filtering of status objects. When ever possible we want to > let twitter do as much work as possible for us. > currently the spark twitter api only allows you to configure a small sub set > of possible filters > String{} filters = {"tag1", tag2"} > JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, > filters); > The current implemenation does > private[streaming] > class TwitterReceiver( > twitterAuth: Authorization, > filter
[jira] [Commented] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json
[ https://issues.apache.org/jira/browse/SPARK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126936#comment-15126936 ] Andrew Davidson commented on SPARK-13009: - Agreed. they ask me to file a spark RFE. I post your comment back to them and see what they say. If the StatusJSONImpl class was not marked final it would be easy for spark to create a the wrapper. Andhy > spark-streaming-twitter_2.10 does not make it possible to access the raw > twitter json > - > > Key: SPARK-13009 > URL: https://issues.apache.org/jira/browse/SPARK-13009 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Andrew Davidson >Priority: Blocker > Labels: twitter > > The Streaming-twitter package makes it easy for Java programmers to work with > twitter. The implementation returns the raw twitter data in JSON formate as a > twitter4J StatusJSONImpl object > JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth); > The status class is different then the raw JSON. I.E. serializing the status > object will be the same as the original json. I have down stream systems that > can only process raw tweets not twitter4J Status objects. > Here is my bug/RFE request made to Twitter4J . > They asked I create a spark tracking issue. > On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote: > Hi All > Quick problem summary: > My system uses the Status objects to do some analysis how ever I need to > store the raw JSON. There are other systems that process that data that are > not written in Java. > Currently we are serializing the Status Object. The JSON is going to break > down stream systems. > I am using the Apache Spark Streaming spark-streaming-twitter_2.10 > http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources > Request For Enhancement: > I imagine easy access to the raw JSON is a common requirement. Would it be > possible to add a member function to StatusJSONImpl getRawJson(). By default > the returned value would be null unless jsonStoreEnabled=True is set in the > config. > Alternative implementations: > > It should be possible to modify the spark-streaming-twitter_2.10 to provide > this support. The solutions is not very clean > It would required apache spark to define their own Status Pojo. The current > StatusJSONImpl class is marked final > The Wrapper is not going to work nicely with existing code. > spark-streaming-twitter_2.10 does not expose all of the twitter streaming > API so many developers are writing their implementations of > org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance > difficult. Its not easy to know when the spark implementation for twitter has > changed. > Code listing for > spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala > private[streaming] > class TwitterReceiver( > twitterAuth: Authorization, > filters: Seq[String], > storageLevel: StorageLevel > ) extends Receiver[Status](storageLevel) with Logging { > @volatile private var twitterStream: TwitterStream = _ > @volatile private var stopped = false > def onStart() { > try { > val newTwitterStream = new > TwitterStreamFactory().getInstance(twitterAuth) > newTwitterStream.addListener(new StatusListener { > def onStatus(status: Status): Unit = { > store(status) > } > Ref: > https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html > What do people think? > Kind regards > Andy > From: on behalf of Igor Brigadir > > Reply-To: > Date: Tuesday, January 19, 2016 at 5:55 AM > To: Twitter4J > Subject: Re: [Twitter4J] trouble writing unit test > Main issue is that the Json object is in the wrong json format. > eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 > + 2015", ... > It looks like the json you have was serialized from a java Status object, > which makes json objects different to what you get from the API, > TwitterObjectFactory expects json from Twitter (I haven't had any problems > using TwitterObjectFactory instead of the Deprecated DataObjectFactory). > You could "fix" it by matching the keys & values you have with the correct, > twitter API json - it should look like the example here: > https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid > But it might be easier to download the tweets again, but this time use > TwitterObjectFactory.getRawJSON(status) to get the Original Json from the > Twitter API, and save that for later. (You must have jsonStoreEnabled=True in > your config, and call getRawJSON in the same thread as .show
[jira] [Commented] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15126931#comment-15126931 ] Andrew Davidson commented on SPARK-13065: - Thanks Sachin I do not know Scala. As a temporary work around I would up rewriting the twitter stuff in Java so I could pass twitter4j.query object. It would be much better to improve the spark api Andy > streaming-twitter pass twitter4j.FilterQuery argument to > TwitterUtils.createStream() > > > Key: SPARK-13065 > URL: https://issues.apache.org/jira/browse/SPARK-13065 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 > Environment: all >Reporter: Andrew Davidson >Priority: Minor > Labels: twitter > Original Estimate: 2h > Remaining Estimate: 2h > > The twitter stream api is very powerful provides a lot of support for > twitter.com side filtering of status objects. When ever possible we want to > let twitter do as much work as possible for us. > currently the spark twitter api only allows you to configure a small sub set > of possible filters > String{} filters = {"tag1", tag2"} > JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, > filters); > The current implemenation does > private[streaming] > class TwitterReceiver( > twitterAuth: Authorization, > filters: Seq[String], > storageLevel: StorageLevel > ) extends Receiver[Status](storageLevel) with Logging { > . . . > val query = new FilterQuery > if (filters.size > 0) { > query.track(filters.mkString(",")) > newTwitterStream.filter(query) > } else { > newTwitterStream.sample() > } > ... > rather than construct the FilterQuery object in TwitterReceiver.onStart(). we > should be able to pass a FilterQueryObject > looks like an easy fix. See source code links bellow > kind regards > Andy > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60 > https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()
Andrew Davidson created SPARK-13065: --- Summary: streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream() Key: SPARK-13065 URL: https://issues.apache.org/jira/browse/SPARK-13065 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.6.0 Environment: all Reporter: Andrew Davidson Priority: Minor The twitter stream api is very powerful provides a lot of support for twitter.com side filtering of status objects. When ever possible we want to let twitter do as much work as possible for us. currently the spark twitter api only allows you to configure a small sub set of possible filters String{} filters = {"tag1", tag2"} JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, filters); The current implemenation does private[streaming] class TwitterReceiver( twitterAuth: Authorization, filters: Seq[String], storageLevel: StorageLevel ) extends Receiver[Status](storageLevel) with Logging { . . . val query = new FilterQuery if (filters.size > 0) { query.track(filters.mkString(",")) newTwitterStream.filter(query) } else { newTwitterStream.sample() } ... rather than construct the FilterQuery object in TwitterReceiver.onStart(). we should be able to pass a FilterQueryObject looks like an easy fix. See source code links bellow kind regards Andy https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60 https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json
Andrew Davidson created SPARK-13009: --- Summary: spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json Key: SPARK-13009 URL: https://issues.apache.org/jira/browse/SPARK-13009 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.6.0 Reporter: Andrew Davidson Priority: Blocker The Streaming-twitter package makes it easy for Java programmers to work with twitter. The implementation returns the raw twitter data in JSON formate as a twitter4J StatusJSONImpl object JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth); The status class is different then the raw JSON. I.E. serializing the status object will be the same as the original json. I have down stream systems that can only process raw tweets not twitter4J Status objects. Here is my bug/RFE request made to Twitter4J . They asked I create a spark tracking issue. On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote: Hi All Quick problem summary: My system uses the Status objects to do some analysis how ever I need to store the raw JSON. There are other systems that process that data that are not written in Java. Currently we are serializing the Status Object. The JSON is going to break down stream systems. I am using the Apache Spark Streaming spark-streaming-twitter_2.10 http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources Request For Enhancement: I imagine easy access to the raw JSON is a common requirement. Would it be possible to add a member function to StatusJSONImpl getRawJson(). By default the returned value would be null unless jsonStoreEnabled=True is set in the config. Alternative implementations: It should be possible to modify the spark-streaming-twitter_2.10 to provide this support. The solutions is not very clean It would required apache spark to define their own Status Pojo. The current StatusJSONImpl class is marked final The Wrapper is not going to work nicely with existing code. spark-streaming-twitter_2.10 does not expose all of the twitter streaming API so many developers are writing their implementations of org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance difficult. Its not easy to know when the spark implementation for twitter has changed. Code listing for spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala private[streaming] class TwitterReceiver( twitterAuth: Authorization, filters: Seq[String], storageLevel: StorageLevel ) extends Receiver[Status](storageLevel) with Logging { @volatile private var twitterStream: TwitterStream = _ @volatile private var stopped = false def onStart() { try { val newTwitterStream = new TwitterStreamFactory().getInstance(twitterAuth) newTwitterStream.addListener(new StatusListener { def onStatus(status: Status): Unit = { store(status) } Ref: https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html What do people think? Kind regards Andy From: on behalf of Igor Brigadir Reply-To: Date: Tuesday, January 19, 2016 at 5:55 AM To: Twitter4J Subject: Re: [Twitter4J] trouble writing unit test Main issue is that the Json object is in the wrong json format. eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 + 2015", ... It looks like the json you have was serialized from a java Status object, which makes json objects different to what you get from the API, TwitterObjectFactory expects json from Twitter (I haven't had any problems using TwitterObjectFactory instead of the Deprecated DataObjectFactory). You could "fix" it by matching the keys & values you have with the correct, twitter API json - it should look like the example here: https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid But it might be easier to download the tweets again, but this time use TwitterObjectFactory.getRawJSON(status) to get the Original Json from the Twitter API, and save that for later. (You must have jsonStoreEnabled=True in your config, and call getRawJSON in the same thread as .showStatus() or lookup() or whatever you're using to load tweets.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?
Andrew Davidson created SPARK-12606: --- Summary: Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ? Key: SPARK-12606 URL: https://issues.apache.org/jira/browse/SPARK-12606 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.5.2 Environment: Java 8, Mac OS, Spark-1.5.2 Reporter: Andrew Davidson Hi Andy, I suspect that you hit the Scala/Java compatibility issue, I can also reproduce this issue, so could you file a JIRA to track this issue? Yanbo 2016-01-02 3:38 GMT+08:00 Andy Davidson : I am trying to write a trivial transformer I use use in my pipeline. I am using java and spark 1.5.2. It was suggested that I use the Tokenize.scala class as an example. This should be very easy how ever I do not understand Scala, I am having trouble debugging the following exception. Any help would be greatly appreciated. Happy New Year Andy java.lang.IllegalArgumentException: requirement failed: Param null__inputCol does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557) at org.apache.spark.ml.param.Params$class.set(params.scala:436) at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) at org.apache.spark.ml.param.Params$class.set(params.scala:422) at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) at org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83) at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30) public class StemmerTest extends AbstractSparkTest { @Test public void test() { Stemmer stemmer = new Stemmer() .setInputCol("raw”) //line 30 .setOutputCol("filtered"); } } /** * @ see spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala * @ see https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ * @ see http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/ * * @author andrewdavidson * */ public class Stemmer extends UnaryTransformer, List, Stemmer> implements Serializable{ static Logger logger = LoggerFactory.getLogger(Stemmer.class); private static final long serialVersionUID = 1L; private static final ArrayType inputType = DataTypes.createArrayType(DataTypes.StringType, true); private final String uid = Stemmer.class.getSimpleName() + "_" + UUID.randomUUID().toString(); @Override public String uid() { return uid; } /* override protected def validateInputType(inputType: DataType): Unit = { require(inputType == StringType, s"Input type must be string type but got $inputType.") } */ @Override public void validateInputType(DataType inputTypeArg) { String msg = "inputType must be " + inputType.simpleString() + " but got " + inputTypeArg.simpleString(); assert (inputType.equals(inputTypeArg)) : msg; } @Override public Function1, List> createTransformFunc() { // http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters Function1, List> f = new AbstractFunction1, List>() { public List apply(List words) { for(String word : words) { logger.error("AEDWIP input word: {}", word); } return words; } }; return f; } @Override public DataType outputDataType() { return DataTypes.createArrayType(DataTypes.StringType, true); } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069061#comment-15069061 ] Andrew Davidson commented on SPARK-12484: - Hi Xiao thanks for looking into this Andy > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069027#comment-15069027 ] Andrew Davidson commented on SPARK-12483: - Hi Xiao thanks for looking at the issue is there a way to change a column name? If you do a select() using a data frame, the column name is really strange see attachement for https://issues.apache.org/jira/browse/SPARK-12484 // get column from data frame call df.withColumnName Column newCol = udfDF.col("_c0"); renaming data frame columns is very common in R Kind regards Andy > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068698#comment-15068698 ] Andrew Davidson commented on SPARK-12484: - releated issue https://issues.apache.org/jira/browse/SPARK-12483 > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068696#comment-15068696 ] Andrew Davidson commented on SPARK-12484: - What I am really trying to do is rewrite the following python code in Java. Ideally I would implement this code as a MLib.transformation how ever that does not seem possible at this point in time using the Java API Kind regards Andy def convertMultinomialLabelToBinary(dataFrame): newColName = "binomialLabel" binomial = udf(lambda labelStr: labelStr if (labelStr == "noise") else "signal", StringType()) ret = dataFrame.withColumn(newColName, binomial(dataFrame["label"])) return ret > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068684#comment-15068684 ] Andrew Davidson commented on SPARK-12484: - you can find some more back ground on the email thread 'should I file a bug? Re: trouble implementing Transformer and calling DataFrame.withColumn()' > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12484: Attachment: UDFTest.java Add a unit test file > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12484) DataFrame withColumn() does not work in Java
Andrew Davidson created SPARK-12484: --- Summary: DataFrame withColumn() does not work in Java Key: SPARK-12484 URL: https://issues.apache.org/jira/browse/SPARK-12484 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Environment: mac El Cap. 10.11.2 Java 8 Reporter: Andrew Davidson DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS transformedByUDF#3]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Attachment: SPARK_12483_unitTest.java added a unit test file > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Comment: was deleted (was: add a unit test file ) > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Attachment: (was: SPARK_12483_unitTest.java) > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Attachment: SPARK_12483_unitTest.java add a unit test file > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12483) Data Frame as() does not work in Java
Andrew Davidson created SPARK-12483: --- Summary: Data Frame as() does not work in Java Key: SPARK-12483 URL: https://issues.apache.org/jira/browse/SPARK-12483 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Environment: Mac El Cap 10.11.2 Java 8 Reporter: Andrew Davidson Following unit test demonstrates a bug in as(). The column name for aliasDF was not changed @Test public void bugDataFrameAsTest() { DataFrame df = createData(); df.printSchema(); df.show(); DataFrame aliasDF = df.select("id").as("UUID"); aliasDF.printSchema(); aliasDF.show(); } DataFrame createData() { Features f1 = new Features(1, category1); Features f2 = new Features(2, category2); ArrayList data = new ArrayList(2); data.add(f1); data.add(f2); //JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(f1, f2)); JavaRDD rdd = javaSparkContext.parallelize(data); DataFrame df = sqlContext.createDataFrame(rdd, Features.class); return df; } This is the output I got (without the spark log msgs) root |-- id: integer (nullable = false) |-- labelStr: string (nullable = true) +---++ | id|labelStr| +---++ | 1| noise| | 2|questionable| +---++ root |-- id: integer (nullable = false) +---+ | id| +---+ | 1| | 2| +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive
[ https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15038574#comment-15038574 ] Andrew Davidson commented on SPARK-12110: - Hi Davies attached is a script I wrote to launch the cluster and the output it produced when I ran it on on nov 5th, 2015 I also included al file LaunchingSparkCluster.md with directions for how to configure the cluster to use java 8, python3, ... On an related note. I would like to update my cluster to 1.5.2. I have not been able to find any directions for how to this Kind regards Andy p > spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build > Spark with Hive > > > Key: SPARK-12110 > URL: https://issues.apache.org/jira/browse/SPARK-12110 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.5.1 > Environment: cluster created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 >Reporter: Andrew Davidson > Attachments: launchCluster.sh, launchCluster.sh.out, > launchingSparkCluster.md > > > I am using spark-1.5.1-bin-hadoop2.6. I used > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured > spark-env to use python3. I can not run the tokenizer sample code. Is there a > work around? > Kind regards > Andy > {code} > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 658 raise Exception("You must build Spark with Hive. " > 659 "Export 'SPARK_HIVE=true' and run " > --> 660 "build/sbt assembly", e) > 661 > 662 def _get_hive_ctx(self): > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError('An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38)) > http://spark.apache.org/docs/latest/ml-features.html#tokenizer > from pyspark.ml.feature import Tokenizer, RegexTokenizer > sentenceDataFrame = sqlContext.createDataFrame([ > (0, "Hi I heard about Spark"), > (1, "I wish Java could use case classes"), > (2, "Logistic,regression,models,are,neat") > ], ["label", "sentence"]) > tokenizer = Tokenizer(inputCol="sentence", outputCol="words") > wordsDataFrame = tokenizer.transform(sentenceDataFrame) > for words_label in wordsDataFrame.select("words", "label").take(3): > print(words_label) > --- > Py4JJavaError Traceback (most recent call last) > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 654 if not hasattr(self, '_scala_HiveContext'): > --> 655 self._scala_HiveContext = self._get_hive_ctx() > 656 return self._scala_HiveContext > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 662 def _get_hive_ctx(self): > --> 663 return self._jvm.HiveContext(self._jsc.sc()) > 664 > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 700 return_value = get_return_value(answer, self._gateway_client, > None, > --> 701 self._fqn) > 702 > /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.io.IOException: Filesystem closed > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker
[jira] [Updated] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive
[ https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12110: Attachment: launchingSparkCluster.md > spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build > Spark with Hive > > > Key: SPARK-12110 > URL: https://issues.apache.org/jira/browse/SPARK-12110 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.5.1 > Environment: cluster created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 >Reporter: Andrew Davidson > Attachments: launchCluster.sh, launchCluster.sh.out, > launchingSparkCluster.md > > > I am using spark-1.5.1-bin-hadoop2.6. I used > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured > spark-env to use python3. I can not run the tokenizer sample code. Is there a > work around? > Kind regards > Andy > {code} > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 658 raise Exception("You must build Spark with Hive. " > 659 "Export 'SPARK_HIVE=true' and run " > --> 660 "build/sbt assembly", e) > 661 > 662 def _get_hive_ctx(self): > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError('An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38)) > http://spark.apache.org/docs/latest/ml-features.html#tokenizer > from pyspark.ml.feature import Tokenizer, RegexTokenizer > sentenceDataFrame = sqlContext.createDataFrame([ > (0, "Hi I heard about Spark"), > (1, "I wish Java could use case classes"), > (2, "Logistic,regression,models,are,neat") > ], ["label", "sentence"]) > tokenizer = Tokenizer(inputCol="sentence", outputCol="words") > wordsDataFrame = tokenizer.transform(sentenceDataFrame) > for words_label in wordsDataFrame.select("words", "label").take(3): > print(words_label) > --- > Py4JJavaError Traceback (most recent call last) > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 654 if not hasattr(self, '_scala_HiveContext'): > --> 655 self._scala_HiveContext = self._get_hive_ctx() > 656 return self._scala_HiveContext > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 662 def _get_hive_ctx(self): > --> 663 return self._jvm.HiveContext(self._jsc.sc()) > 664 > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 700 return_value = get_return_value(answer, self._gateway_client, > None, > --> 701 self._fqn) > 702 > /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.io.IOException: Filesystem closed > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:214) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:
[jira] [Updated] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive
[ https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12110: Attachment: launchCluster.sh.out launchCluster.sh launchCluster.sh is a wrapper around spark-ec2 script launchCluster.sh is the output from when I ran this script on nov 5th 2015 > spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build > Spark with Hive > > > Key: SPARK-12110 > URL: https://issues.apache.org/jira/browse/SPARK-12110 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.5.1 > Environment: cluster created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 >Reporter: Andrew Davidson > Attachments: launchCluster.sh, launchCluster.sh.out > > > I am using spark-1.5.1-bin-hadoop2.6. I used > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured > spark-env to use python3. I can not run the tokenizer sample code. Is there a > work around? > Kind regards > Andy > {code} > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 658 raise Exception("You must build Spark with Hive. " > 659 "Export 'SPARK_HIVE=true' and run " > --> 660 "build/sbt assembly", e) > 661 > 662 def _get_hive_ctx(self): > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError('An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38)) > http://spark.apache.org/docs/latest/ml-features.html#tokenizer > from pyspark.ml.feature import Tokenizer, RegexTokenizer > sentenceDataFrame = sqlContext.createDataFrame([ > (0, "Hi I heard about Spark"), > (1, "I wish Java could use case classes"), > (2, "Logistic,regression,models,are,neat") > ], ["label", "sentence"]) > tokenizer = Tokenizer(inputCol="sentence", outputCol="words") > wordsDataFrame = tokenizer.transform(sentenceDataFrame) > for words_label in wordsDataFrame.select("words", "label").take(3): > print(words_label) > --- > Py4JJavaError Traceback (most recent call last) > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 654 if not hasattr(self, '_scala_HiveContext'): > --> 655 self._scala_HiveContext = self._get_hive_ctx() > 656 return self._scala_HiveContext > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 662 def _get_hive_ctx(self): > --> 663 return self._jvm.HiveContext(self._jsc.sc()) > 664 > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 700 return_value = get_return_value(answer, self._gateway_client, > None, > --> 701 self._fqn) > 702 > /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.io.IOException: Filesystem closed > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:214) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > at py4j.commands.ConstructorCommand.execute(Constru
[jira] [Commented] (SPARK-12111) need upgrade instruction
[ https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15038563#comment-15038563 ] Andrew Davidson commented on SPARK-12111: - Hi Sean I am unable to find instructions for upgrading existing installations. Can you point me at the documentation for upgrading ? I built the culster using the using spark-ec2 script. Kind Regards Andy > need upgrade instruction > > > Key: SPARK-12111 > URL: https://issues.apache.org/jira/browse/SPARK-12111 > Project: Spark > Issue Type: Documentation > Components: EC2 >Affects Versions: 1.5.1 >Reporter: Andrew Davidson > Labels: build, documentation > > I have looked all over the spark website and googled. I have not found > instructions for how to upgrade spark in general let alone a cluster created > by using spark-ec2 script > thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive
[ https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037022#comment-15037022 ] Andrew Davidson commented on SPARK-12110: - Hi Patrick when I run the same example code on my local Macbook Pro. It runs fine. I am newbie. Is the spark-ec2 script deprecated? I noticed on my cluster [ec2-user@ip-172-31-29-60 notebooks]$ cat /root/spark/RELEASE Spark 1.5.1 built for Hadoop 1.2.1 Build flags: -Psparkr -Phadoop-1 -Phive -Phive-thriftserver -DzincPort=3030 [ec2-user@ip-172-31-29-60 notebooks]$ on my local mac $ cat ./spark-1.5.1-bin-hadoop2.6/RELEASE Spark 1.5.1 built for Hadoop 2.6.0 Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -DzincPort=3034 $ It looks like the ./spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 maybe installed the wrong version of spark? > spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build > Spark with Hive > > > Key: SPARK-12110 > URL: https://issues.apache.org/jira/browse/SPARK-12110 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.5.1 > Environment: cluster created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 >Reporter: Andrew Davidson > > I am using spark-1.5.1-bin-hadoop2.6. I used > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured > spark-env to use python3. I can not run the tokenizer sample code. Is there a > work around? > Kind regards > Andy > {code} > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 658 raise Exception("You must build Spark with Hive. " > 659 "Export 'SPARK_HIVE=true' and run " > --> 660 "build/sbt assembly", e) > 661 > 662 def _get_hive_ctx(self): > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError('An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38)) > http://spark.apache.org/docs/latest/ml-features.html#tokenizer > from pyspark.ml.feature import Tokenizer, RegexTokenizer > sentenceDataFrame = sqlContext.createDataFrame([ > (0, "Hi I heard about Spark"), > (1, "I wish Java could use case classes"), > (2, "Logistic,regression,models,are,neat") > ], ["label", "sentence"]) > tokenizer = Tokenizer(inputCol="sentence", outputCol="words") > wordsDataFrame = tokenizer.transform(sentenceDataFrame) > for words_label in wordsDataFrame.select("words", "label").take(3): > print(words_label) > --- > Py4JJavaError Traceback (most recent call last) > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 654 if not hasattr(self, '_scala_HiveContext'): > --> 655 self._scala_HiveContext = self._get_hive_ctx() > 656 return self._scala_HiveContext > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 662 def _get_hive_ctx(self): > --> 663 return self._jvm.HiveContext(self._jsc.sc()) > 664 > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 700 return_value = get_return_value(answer, self._gateway_client, > None, > --> 701 self._fqn) > 702 > /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.io.IOException: Filesystem closed > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons
[jira] [Commented] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive
[ https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037017#comment-15037017 ] Andrew Davidson commented on SPARK-12110: - Hi Patrick Here is how I start my notebook on my cluster. $ cat ../bin/startIPythonNotebook.sh export SPARK_ROOT=/root/spark export MASTER_URL=spark://ec2-54-215-207-132.us-west-1.compute.amazonaws.com:7077 export PYSPARK_PYTHON=python3.4 export PYSPARK_DRIVER_PYTHON=python3.4 export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN" extraPkgs='--packages com.databricks:spark-csv_2.11:1.3.0' numCores=3 # one for driver 2 for workers $SPARK_ROOT/bin/pyspark \ --master $MASTER_URL \ --total-executor-cores $numCores \ --driver-memory 2G \ --executor-memory 2G \ $extraPkgs \ $* > spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build > Spark with Hive > > > Key: SPARK-12110 > URL: https://issues.apache.org/jira/browse/SPARK-12110 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.5.1 > Environment: cluster created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 >Reporter: Andrew Davidson > > I am using spark-1.5.1-bin-hadoop2.6. I used > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured > spark-env to use python3. I can not run the tokenizer sample code. Is there a > work around? > Kind regards > Andy > {code} > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 658 raise Exception("You must build Spark with Hive. " > 659 "Export 'SPARK_HIVE=true' and run " > --> 660 "build/sbt assembly", e) > 661 > 662 def _get_hive_ctx(self): > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError('An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38)) > http://spark.apache.org/docs/latest/ml-features.html#tokenizer > from pyspark.ml.feature import Tokenizer, RegexTokenizer > sentenceDataFrame = sqlContext.createDataFrame([ > (0, "Hi I heard about Spark"), > (1, "I wish Java could use case classes"), > (2, "Logistic,regression,models,are,neat") > ], ["label", "sentence"]) > tokenizer = Tokenizer(inputCol="sentence", outputCol="words") > wordsDataFrame = tokenizer.transform(sentenceDataFrame) > for words_label in wordsDataFrame.select("words", "label").take(3): > print(words_label) > --- > Py4JJavaError Traceback (most recent call last) > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 654 if not hasattr(self, '_scala_HiveContext'): > --> 655 self._scala_HiveContext = self._get_hive_ctx() > 656 return self._scala_HiveContext > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 662 def _get_hive_ctx(self): > --> 663 return self._jvm.HiveContext(self._jsc.sc()) > 664 > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 700 return_value = get_return_value(answer, self._gateway_client, > None, > --> 701 self._fqn) > 702 > /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: java.io.IOException: Filesystem closed > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >
[jira] [Commented] (SPARK-12111) need upgrade instruction
[ https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036887#comment-15036887 ] Andrew Davidson commented on SPARK-12111: - Hi Sean I understand I will need to stop by cluster to change to a different version. I am looking for directions for how to "change to a different version" E.G. on my local mac I have several different versions of spark down loaded. I have an env var SPARK_ROOT=pathToVersion I want to use. To use something like pyspark I would $ $SPARK_ROOT/bin/pyspark I am looking for direction for how to do something similar in a cluster env. I think the rough steps would be 1) stop the cluster 2) down load the binary. Is the binary the same on all the machines (ie. masters and slaves?) 3) I am not sure what do do about the config/* > need upgrade instruction > > > Key: SPARK-12111 > URL: https://issues.apache.org/jira/browse/SPARK-12111 > Project: Spark > Issue Type: Documentation > Components: EC2 >Affects Versions: 1.5.1 >Reporter: Andrew Davidson > Labels: build, documentation > > I have looked all over the spark website and googled. I have not found > instructions for how to upgrade spark in general let alone a cluster created > by using spark-ec2 script > thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12111) need upgrade instruction
[ https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036888#comment-15036888 ] Andrew Davidson commented on SPARK-12111: - This is where someone that knows the details of how spark gets build and installed needs to provide some directions > need upgrade instruction > > > Key: SPARK-12111 > URL: https://issues.apache.org/jira/browse/SPARK-12111 > Project: Spark > Issue Type: Documentation > Components: EC2 >Affects Versions: 1.5.1 >Reporter: Andrew Davidson > Labels: build, documentation > > I have looked all over the spark website and googled. I have not found > instructions for how to upgrade spark in general let alone a cluster created > by using spark-ec2 script > thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12111) need upgrade instruction
[ https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson reopened SPARK-12111: - Hi Sean It must be possible for customers to upgrade installations. Given Spark is written in java it is probably a matter of replacing jar files and maybe making a few changes to config files. Who ever is responsible for build/release of spark can probably write down the instructions. Its not reasonable to say destroy your old cluster and re-install it. In my experience spark does not work out of the box. You have to do a lot of work to configure it properly. I have a lot of data on HDFS I can not simply move it sincerely yours Andy > need upgrade instruction > > > Key: SPARK-12111 > URL: https://issues.apache.org/jira/browse/SPARK-12111 > Project: Spark > Issue Type: Documentation > Components: EC2 >Affects Versions: 1.5.1 >Reporter: Andrew Davidson > Labels: build, documentation > > I have looked all over the spark website and googled. I have not found > instructions for how to upgrade spark in general let alone a cluster created > by using spark-ec2 script > thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12111) need upgrade instruction
Andrew Davidson created SPARK-12111: --- Summary: need upgrade instruction Key: SPARK-12111 URL: https://issues.apache.org/jira/browse/SPARK-12111 Project: Spark Issue Type: Documentation Components: EC2 Affects Versions: 1.5.1 Reporter: Andrew Davidson I have looked all over the spark website and googled. I have not found instructions for how to upgrade spark in general let alone a cluster created by using spark-ec2 script thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive
Andrew Davidson created SPARK-12110: --- Summary: spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive Key: SPARK-12110 URL: https://issues.apache.org/jira/browse/SPARK-12110 Project: Spark Issue Type: Bug Components: ML, PySpark, SQL Affects Versions: 1.5.1 Environment: cluster created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 Reporter: Andrew Davidson I am using spark-1.5.1-bin-hadoop2.6. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured spark-env to use python3. I can not run the tokenizer sample code. Is there a work around? Kind regards Andy /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) 658 raise Exception("You must build Spark with Hive. " 659 "Export 'SPARK_HIVE=true' and run " --> 660 "build/sbt assembly", e) 661 662 def _get_hive_ctx(self): Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError('An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38)) http://spark.apache.org/docs/latest/ml-features.html#tokenizer from pyspark.ml.feature import Tokenizer, RegexTokenizer sentenceDataFrame = sqlContext.createDataFrame([ (0, "Hi I heard about Spark"), (1, "I wish Java could use case classes"), (2, "Logistic,regression,models,are,neat") ], ["label", "sentence"]) tokenizer = Tokenizer(inputCol="sentence", outputCol="words") wordsDataFrame = tokenizer.transform(sentenceDataFrame) for words_label in wordsDataFrame.select("words", "label").take(3): print(words_label) --- Py4JJavaError Traceback (most recent call last) /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) 654 if not hasattr(self, '_scala_HiveContext'): --> 655 self._scala_HiveContext = self._get_hive_ctx() 656 return self._scala_HiveContext /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) 662 def _get_hive_ctx(self): --> 663 return self._jvm.HiveContext(self._jsc.sc()) 664 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 700 return_value = get_return_value(answer, self._gateway_client, None, --> 701 self._fqn) 702 /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 35 try: ---> 36 return f(*a, **kw) 37 except py4j.protocol.Py4JJavaError as e: /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.io.IOException: Filesystem closed at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160) at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1057) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:554) at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599) at org.apache.hadoop.hive.ql.sess
[jira] [Created] (SPARK-12100) bug in spark/python/pyspark/rdd.py portable_hash()
Andrew Davidson created SPARK-12100: --- Summary: bug in spark/python/pyspark/rdd.py portable_hash() Key: SPARK-12100 URL: https://issues.apache.org/jira/browse/SPARK-12100 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.1 Reporter: Andrew Davidson Priority: Minor I am using spark-1.5.1-bin-hadoop2.6. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured spark-env to use python3. I get and exception 'Randomness of hash of string should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py should not just set PYTHONHASHSEED ? Should I file a bug? Kind regards Andy details http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract Example from documentation does not work out of the box Subtract(other, numPartitions=None) Return each value in self that is not contained in other. >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)]) >>> y = sc.parallelize([("a", 3), ("c", None)]) >>> sorted(x.subtract(y).collect()) [('a', 1), ('b', 4), ('b', 5)] It raises if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ: raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED") The following script fixes the problem Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate Exception'Randomness of hash of string should be disabled via PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"` Sudo for i in `cat slaves` ; do scp spark-env.sh root@$i:/root/spark/conf/spark-env.sh; done This is how I am starting spark export PYSPARK_PYTHON=python3.4 export PYSPARK_DRIVER_PYTHON=python3.4 export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN" $SPARK_ROOT/bin/pyspark \ --master $MASTER_URL \ --total-executor-cores $numCores \ --driver-memory 2G \ --executor-memory 2G \ $extraPkgs \ $* see email thread "possible bug spark/python/pyspark/rdd.py portable_hash()' on user@spark for more info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
[ https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994616#comment-14994616 ] Andrew Davidson commented on SPARK-11509: - okay after a couple of days hacking it looks like my test program might be working. Here is my recipe. (I hope this helps others) My test program is now In [1]: from pyspark import SparkContext textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") In [2]: print("hello world”) hello world In [3]: textFile.take(3) Out[3]: [' hello world', ''] Installation instructions Ssh to cluster master Sudo su install python3.4 on all machines ``` yum install -y python34 bash-4.2# which python3 /usr/bin/python3 pssh -h /root/spark-ec2/slaves yum install -y python34 ``` 4. Install pip on all machines ``` yum list available |grep pip yum install -y python34-pip find /usr/bin -name "*pip*" -print /usr/bin/pip-3.4 pssh -h /root/spark-ec2/slaves yum install -y python34-pip ``` 5. install python on master ``` /usr/bin/pip-3.4 install ipython pssh -h /root/spark-ec2/slaves /usr/bin/pip-3.4 install python ``` 6. Install python develop stuff and jupiter on master ``` yum install -y python34-devel /usr/bin/pip-3.4 install jupyter ``` 7. Set up update spark-env.sh on all machine so by default we use python3.4 ``` cd /root/spark/conf printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=python3.4\n" >> /root/spark/conf/spark-env.sh for i in `cat slaves` ; do scp spark-env.sh root@$i:/root/spark/conf/spark-env.sh; done ``` 8. Restart cluster ``` /root/spark/sbin/stop-all.sh /root/spark/sbin/start-all.sh ``` Running ipython notebook set up an ssh tunnel on your local machine ssh -i $KEY_FILE -N -f -L localhost::localhost:7000 ec2-user@$SPARK_MASTER 2. Log on to cluster master and start ipython notebook server ``` export PYSPARK_PYTHON=python3.4 export PYSPARK_DRIVER_PYTHON=python3.4 export IPYTHON_OPTS="notebook --no-browser --port=7000" $SPARK_ROOT/bin/pyspark --master local[2] ``` 3. On your local machine open http://localhost: > ipython notebooks do not work on clusters created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script > -- > > Key: SPARK-11509 > URL: https://issues.apache.org/jira/browse/SPARK-11509 > Project: Spark > Issue Type: Bug > Components: Documentation, EC2, PySpark >Affects Versions: 1.5.1 > Environment: AWS cluster > [ec2-user@ip-172-31-29-60 ~]$ uname -a > Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 > SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Andrew Davidson > > I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. > I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am > able to run the java SparkPi example on the cluster how ever I am not able to > run ipython notebooks on the cluster. (I connect using ssh tunnel) > According to the 1.5.1 getting started doc > http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell > The following should work > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook > --no-browser --port=7000" /root/spark/bin/pyspark > I am able to connect to the notebook server and start a notebook how ever > bug 1) the default sparkContext does not exist > from pyspark import SparkContext > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3 > --- > NameError Traceback (most recent call last) > in () > 1 from pyspark import SparkContext > > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > 3 textFile.take(3) > NameError: name 'sc' is not defined > bug 2) > If I create a SparkContext I get the following python versions miss match > error > sc = SparkContext("local", "Simple App") > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3) > File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.7 than that in driver > 2.6, PySpark cannot run with different minor versions > I am able to run ipython notebooks on my local Mac as follows. (by default > you would get an error that the driver and works are using different version > of python) > $ cat ~/bin/pySparkNotebook.sh > #!/bin/sh > set -x # turn debugging on > #set +x # turn debugging off > export PYSPARK_PYTHON=python3 > export PYSPARK_DRIVER_PYTHON=python3 > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ > I have spent a lot of time trying to debug
[jira] [Reopened] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
[ https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson reopened SPARK-11509: - My issue is not resolved I am able to use ipython notebooks on my local mac but still can not run them in/on my cluster. I have tried launching the same way I do on my mac export PYSPARK_PYTHON=python2.7 export PYSPARK_DRIVER_PYTHON=python2.7 IPYTHON_OPTS="notebook --no-browser --port=7000" $SPARK_ROOT/bin/pyspark I also tried setting export PYSPARK_PYTHON=python2.7 in /root/spark/conf/spark-env.sh on all my machines The following code example from pyspark import SparkContext sc = SparkContext("local", "Simple App") # strange we should not have to create sc textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") textFile.take(3) generates the following error msg Py4JJavaError Traceback (most recent call last) in () 2 sc = SparkContext("local", "Simple App") 3 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > 4 textFile.take(3) /root/spark/python/pyspark/rdd.py in take(self, num) 1297 1298 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1299 res = self.context.runJob(self, takeUpToNumLeft, p) 1300 1301 items += res /root/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal) 914 # SparkContext#runJob. 915 mappedRDD = rdd.mapPartitions(partitionFunc) --> 916 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) 917 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) 918 /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main ("%d.%d" % sys.version_info[:2], version)) Exception: Python in worker has different version 2.7 than that in driver 2.6, PySpark cannot run with different minor versions at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGSched
[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
[ https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990746#comment-14990746 ] Andrew Davidson commented on SPARK-11509: - I forgot to mentioned. on my cluster master I was able to run bin/pyspark --master local[2] without any problems . I was able to access sc with any issues > ipython notebooks do not work on clusters created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script > -- > > Key: SPARK-11509 > URL: https://issues.apache.org/jira/browse/SPARK-11509 > Project: Spark > Issue Type: Bug > Components: Documentation, EC2, PySpark >Affects Versions: 1.5.1 > Environment: AWS cluster > [ec2-user@ip-172-31-29-60 ~]$ uname -a > Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 > SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Andrew Davidson > > I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. > I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am > able to run the java SparkPi example on the cluster how ever I am not able to > run ipython notebooks on the cluster. (I connect using ssh tunnel) > According to the 1.5.1 getting started doc > http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell > The following should work > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook > --no-browser --port=7000" /root/spark/bin/pyspark > I am able to connect to the notebook server and start a notebook how ever > bug 1) the default sparkContext does not exist > from pyspark import SparkContext > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3 > --- > NameError Traceback (most recent call last) > in () > 1 from pyspark import SparkContext > > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > 3 textFile.take(3) > NameError: name 'sc' is not defined > bug 2) > If I create a SparkContext I get the following python versions miss match > error > sc = SparkContext("local", "Simple App") > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3) > File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.7 than that in driver > 2.6, PySpark cannot run with different minor versions > I am able to run ipython notebooks on my local Mac as follows. (by default > you would get an error that the driver and works are using different version > of python) > $ cat ~/bin/pySparkNotebook.sh > #!/bin/sh > set -x # turn debugging on > #set +x # turn debugging off > export PYSPARK_PYTHON=python3 > export PYSPARK_DRIVER_PYTHON=python3 > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ > I have spent a lot of time trying to debug the pyspark script however I can > not figure out what the problem is > Please let me know if there is something I can do to help > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
[ https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990742#comment-14990742 ] Andrew Davidson commented on SPARK-11509: - yes , it appears the show stopper issue I am facing is the python versions to do not match. How ever on my local mac I was able to figure out how to get everything to match. That technique does not work on a spark cluster. I tried a lot of hacking how ever can not seem to get the version to match. I plan to install python 3 on all machines maybe that will work better kind regards Andy > ipython notebooks do not work on clusters created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script > -- > > Key: SPARK-11509 > URL: https://issues.apache.org/jira/browse/SPARK-11509 > Project: Spark > Issue Type: Bug > Components: Documentation, EC2, PySpark >Affects Versions: 1.5.1 > Environment: AWS cluster > [ec2-user@ip-172-31-29-60 ~]$ uname -a > Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 > SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Andrew Davidson > > I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. > I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am > able to run the java SparkPi example on the cluster how ever I am not able to > run ipython notebooks on the cluster. (I connect using ssh tunnel) > According to the 1.5.1 getting started doc > http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell > The following should work > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook > --no-browser --port=7000" /root/spark/bin/pyspark > I am able to connect to the notebook server and start a notebook how ever > bug 1) the default sparkContext does not exist > from pyspark import SparkContext > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3 > --- > NameError Traceback (most recent call last) > in () > 1 from pyspark import SparkContext > > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > 3 textFile.take(3) > NameError: name 'sc' is not defined > bug 2) > If I create a SparkContext I get the following python versions miss match > error > sc = SparkContext("local", "Simple App") > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3) > File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.7 than that in driver > 2.6, PySpark cannot run with different minor versions > I am able to run ipython notebooks on my local Mac as follows. (by default > you would get an error that the driver and works are using different version > of python) > $ cat ~/bin/pySparkNotebook.sh > #!/bin/sh > set -x # turn debugging on > #set +x # turn debugging off > export PYSPARK_PYTHON=python3 > export PYSPARK_DRIVER_PYTHON=python3 > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ > I have spent a lot of time trying to debug the pyspark script however I can > not figure out what the problem is > Please let me know if there is something I can do to help > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
Andrew Davidson created SPARK-11509: --- Summary: ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script Key: SPARK-11509 URL: https://issues.apache.org/jira/browse/SPARK-11509 Project: Spark Issue Type: Bug Components: Documentation, EC2, PySpark Affects Versions: 1.5.1 Environment: AWS cluster [ec2-user@ip-172-31-29-60 ~]$ uname -a Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Reporter: Andrew Davidson I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am able to run the java SparkPi example on the cluster how ever I am not able to run ipython notebooks on the cluster. (I connect using ssh tunnel) According to the 1.5.1 getting started doc http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell The following should work PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000" /root/spark/bin/pyspark I am able to connect to the notebook server and start a notebook how ever bug 1) the default sparkContext does not exist from pyspark import SparkContext textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") textFile.take(3 --- NameError Traceback (most recent call last) in () 1 from pyspark import SparkContext > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") 3 textFile.take(3) NameError: name 'sc' is not defined bug 2) If I create a SparkContext I get the following python versions miss match error sc = SparkContext("local", "Simple App") textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") textFile.take(3) File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main ("%d.%d" % sys.version_info[:2], version)) Exception: Python in worker has different version 2.7 than that in driver 2.6, PySpark cannot run with different minor versions I am able to run ipython notebooks on my local Mac as follows. (by default you would get an error that the driver and works are using different version of python) $ cat ~/bin/pySparkNotebook.sh #!/bin/sh set -x # turn debugging on #set +x # turn debugging off export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=python3 IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ I have spent a lot of time trying to debug the pyspark script however I can not figure out what the problem is Please let me know if there is something I can do to help Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168788#comment-14168788 ] Andrew Davidson commented on SPARK-922: --- also forgot the mention there are a couple of steps on http://nbviewer.ipython.org/gist/JoshRosen/6856670 that are important in the upgrade process # # restart spark # /root/spark/sbin/stop-all.sh /root/spark/sbin/start-all.sh > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.0 >Reporter: Josh Rosen > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168787#comment-14168787 ] Andrew Davidson commented on SPARK-922: --- Wow upgrading matplotlib was a bear. The following worked for me. The trick was getting the correct version of the source code. The recipe bellow is not 100% correct. I have not figured out how to use pssh with yum. yum prompts you y/n before downloading pip2.7 install six pssh -t0 -h /root/spark-ec2/slaves pip2.7 install six pip2.7 install python-dateutil pssh -t0 -h /root/spark-ec2/slaves pip2.7 install python-dateutil pip2.7 install pyparsing pssh -t0 -h /root/spark-ec2/slaves pip2.7 install pyparsing yum install yum-utils wget https://github.com/matplotlib/matplotlib/archive/master.tar.gz tar -zxvf master.tar.gz cd matplotlib-master/ yum install freetype-devel yum install libpng-devel python2.7 setup.py build python2.7 setup.py install > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.0 >Reporter: Josh Rosen > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150695#comment-14150695 ] Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:05 PM: here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran I ran into a small problem the ipython magic %matplotlib inline creates an error, you can work around this by commenting it out. Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh was (Author: aedwip): here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran Any idea what I missed? Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 0.9.1, 1.0.0 >Reporter: Josh Rosen > Fix For: 1.2.0 > > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150695#comment-14150695 ] Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:03 PM: here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran Any idea what I missed? Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh was (Author: aedwip): I must have missed something. I am running iPython notebook over a ssh tunnel. I am still running using the old version. I made sure to export PYSPARK_PYTHON=python2.7 I also tried export PYSPARK_PYTHON=/usr/bin/python2.7\ import IPython print IPython.sys_info() {'commit_hash': '858d539', 'commit_source': 'installation', 'default_encoding': 'UTF-8', 'ipython_path': '/usr/lib/python2.6/site-packages/ipython-0.13.2-py2.6.egg/IPython', 'ipython_version': '0.13.2', 'os_name': 'posix', 'platform': 'Linux-3.4.37-40.44.amzn1.x86_64-x86_64-with-glibc2.2.5', 'sys_executable': '/usr/bin/python2.6', 'sys_platform': 'linux2', 'sys_version': '2.6.9 (unknown, Sep 13 2014, 00:25:11) \n[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]'} here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran Any idea what I missed? Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 0.9.1, 1.0.0 >Reporter: Josh Rosen > Fix For: 1.2.0 > > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150695#comment-14150695 ] Andrew Davidson commented on SPARK-922: --- I must have missed something. I am running iPython notebook over a ssh tunnel. I am still running using the old version. I made sure to export PYSPARK_PYTHON=python2.7 I also tried export PYSPARK_PYTHON=/usr/bin/python2.7\ import IPython print IPython.sys_info() {'commit_hash': '858d539', 'commit_source': 'installation', 'default_encoding': 'UTF-8', 'ipython_path': '/usr/lib/python2.6/site-packages/ipython-0.13.2-py2.6.egg/IPython', 'ipython_version': '0.13.2', 'os_name': 'posix', 'platform': 'Linux-3.4.37-40.44.amzn1.x86_64-x86_64-with-glibc2.2.5', 'sys_executable': '/usr/bin/python2.6', 'sys_platform': 'linux2', 'sys_version': '2.6.9 (unknown, Sep 13 2014, 00:25:11) \n[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]'} here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS="notebook --pylab inline --no-browser --port=7000" $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran Any idea what I missed? Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves "wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27" easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n" >> /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh > Update Spark AMI to Python 2.7 > -- > > Key: SPARK-922 > URL: https://issues.apache.org/jira/browse/SPARK-922 > Project: Spark > Issue Type: Task > Components: EC2, PySpark >Affects Versions: 0.9.0, 0.9.1, 1.0.0 >Reporter: Josh Rosen > Fix For: 1.2.0 > > > Many Python libraries only support Python 2.7+, so we should make Python 2.7 > the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org