[jira] [Created] (SPARK-21020) How to implement custom input source for creating streaming DataFrames?
Vijay created SPARK-21020: - Summary: How to implement custom input source for creating streaming DataFrames? Key: SPARK-21020 URL: https://issues.apache.org/jira/browse/SPARK-21020 Project: Spark Issue Type: Brainstorming Components: Structured Streaming Affects Versions: 2.1.1 Reporter: Vijay Priority: Minor Can someone please explain how to implement a custom input source for creating the streaming dataframes. Just like custom receiver in DStreams. Any references/suggestions are appreciated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21182) Structured streaming on Spark-shell on windows
Vijay created SPARK-21182: - Summary: Structured streaming on Spark-shell on windows Key: SPARK-21182 URL: https://issues.apache.org/jira/browse/SPARK-21182 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.1 Environment: Windows 10 spark-2.1.1-bin-hadoop2.7 Reporter: Vijay Priority: Minor Structured streaming output operation is failing on Windows shell. As per the error message, path is being prefixed with File separator as in Linux. Thus, causing the IllegalArgumentException. Following is the error message. scala> val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() java.lang.IllegalArgumentException: Pathname {color:red}*/*{color}C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets from C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets is not a valid DFS filename. at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222) at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:280) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:268) ... 52 elided -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21182) Structured streaming on Spark-shell on windows
[ https://issues.apache.org/jira/browse/SPARK-21182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065921#comment-16065921 ] Vijay commented on SPARK-21182: --- I'm still facing the same issue. Actually I have configured Hadoop on windows along with Spark. will this be an issue? > Structured streaming on Spark-shell on windows > -- > > Key: SPARK-21182 > URL: https://issues.apache.org/jira/browse/SPARK-21182 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Windows 10 > spark-2.1.1-bin-hadoop2.7 >Reporter: Vijay >Priority: Minor > > Structured streaming output operation is failing on Windows shell. > As per the error message, path is being prefixed with File separator as in > Linux. > Thus, causing the IllegalArgumentException. > Following is the error message. > scala> val query = wordCounts.writeStream .outputMode("complete") > .format("console") .start() > java.lang.IllegalArgumentException: Pathname > {color:red}*/*{color}C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets > from > C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets > is not a valid DFS filename. > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106) > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:280) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:268) > ... 52 elided -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
Vijay created SPARK-4402: Summary: Output path validation of an action statement resulting in runtime exception Key: SPARK-4402 URL: https://issues.apache.org/jira/browse/SPARK-4402 Project: Spark Issue Type: Wish Reporter: Vijay Priority: Minor Output path validation is happening at the time of statement execution as a part of lazyevolution of action statement. But if the path already exists then it throws a runtime exception. Hence all the processing completed till that point is lost which results in resource wastage (processing time and CPU usage). If this I/O related validation is done before the RDD action operations then this runtime exception can be avoided. I believe similar validation/ feature is implemented in hadoop also. Example: SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213729#comment-14213729 ] Vijay commented on SPARK-4402: -- Thanks for the reply [~srowen] This is different scenario from the issue SPARK-1100. Issue SPARK-1100 says that output directory is over written if it exists. I think that fix works fine. But, my concern is that spark throws a runtime exception if the output directory exists. This is happening after executing all the previous action statements and resulting in abrupt termination of the program. Result of the previous action statements is lost. Please confirm whether this abrupt program termination is expected? > Output path validation of an action statement resulting in runtime exception > > > Key: SPARK-4402 > URL: https://issues.apache.org/jira/browse/SPARK-4402 > Project: Spark > Issue Type: Wish >Reporter: Vijay >Priority: Minor > > Output path validation is happening at the time of statement execution as a > part of lazyevolution of action statement. But if the path already exists > then it throws a runtime exception. Hence all the processing completed till > that point is lost which results in resource wastage (processing time and CPU > usage). > If this I/O related validation is done before the RDD action operations then > this runtime exception can be avoided. > I believe similar validation/ feature is implemented in hadoop also. > Example: > SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214320#comment-14214320 ] Vijay commented on SPARK-4402: -- Yes, output path is being validated in PairRDDFunctions.saveAsHadoopDataset. Please find the below exception details. So, the output path is validated only during the execution saveAsHadoopDataset. After completing all the preceding statements. My query is that is it possible to make this validation in the first place when the program executon starts. Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/home/HadoopUser/eclipse-scala/test/output1 already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:968) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:878) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:792) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1159) at test.OutputTest$.main(OutputTest.scala:19) at test.OutputTest.main(OutputTest.scala) > Output path validation of an action statement resulting in runtime exception > > > Key: SPARK-4402 > URL: https://issues.apache.org/jira/browse/SPARK-4402 > Project: Spark > Issue Type: Wish >Reporter: Vijay >Priority: Minor > > Output path validation is happening at the time of statement execution as a > part of lazyevolution of action statement. But if the path already exists > then it throws a runtime exception. Hence all the processing completed till > that point is lost which results in resource wastage (processing time and CPU > usage). > If this I/O related validation is done before the RDD action operations then > this runtime exception can be avoided. > I believe similar validation/ feature is implemented in hadoop also. > Example: > SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214635#comment-14214635 ] Vijay commented on SPARK-4402: -- Thanks for the explanation. It is clear now. > Output path validation of an action statement resulting in runtime exception > > > Key: SPARK-4402 > URL: https://issues.apache.org/jira/browse/SPARK-4402 > Project: Spark > Issue Type: Wish >Reporter: Vijay >Priority: Minor > > Output path validation is happening at the time of statement execution as a > part of lazyevolution of action statement. But if the path already exists > then it throws a runtime exception. Hence all the processing completed till > that point is lost which results in resource wastage (processing time and CPU > usage). > If this I/O related validation is done before the RDD action operations then > this runtime exception can be avoided. > I believe similar validation/ feature is implemented in hadoop also. > Example: > SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4402) Output path validation of an action statement resulting in runtime exception
[ https://issues.apache.org/jira/browse/SPARK-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay resolved SPARK-4402. -- Resolution: Not a Problem > Output path validation of an action statement resulting in runtime exception > > > Key: SPARK-4402 > URL: https://issues.apache.org/jira/browse/SPARK-4402 > Project: Spark > Issue Type: Wish >Reporter: Vijay >Priority: Minor > > Output path validation is happening at the time of statement execution as a > part of lazyevolution of action statement. But if the path already exists > then it throws a runtime exception. Hence all the processing completed till > that point is lost which results in resource wastage (processing time and CPU > usage). > If this I/O related validation is done before the RDD action operations then > this runtime exception can be avoided. > I believe similar validation/ feature is implemented in hadoop also. > Example: > SchemaRDD.saveAsTextFile() evaluated the path during runtime -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
vijay created SPARK-6435: Summary: spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar 3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala> import com.google.common.base.Strings :19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vijay updated SPARK-6435: - Description: Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala> import com.google.common.base.Strings :19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} was: Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar 3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala> import com.google.common.base.Strings :19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} > spark-shell --jars option does not add all jars to classpath > > > Key: SPARK-6435 > URL: https://issues.apache.org/jira/browse/SPARK-6435 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.3.0 > Environment: Win64 >Reporter: vijay > > Not all jars supplied via the --jars option will be added to the driver (and > presumably executor) classpath. The first jar(s) will be added, but not all. > To reproduce this, just add a few jars (I tested 5) to the --jars option, and > then try to import a class from the last jar. This fails. A simple > reproducer: > Create a bunch of dummy jars: > jar cfM jar1.jar log.txt > jar cfM jar2.jar log.txt > jar cfM jar3.jar log.txt > jar cfM jar4.jar log.txt > Start the spark-shell with the dummy jars and guava at the end: > %SPARK_HOME%\bin\spark-shell --master local --jars > jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar > In the shell, try importing from guava; you'll get an error: > {code} > scala> import com.google.common.base.Strings > :19: error: object Strings is not a member of package > com.google.common.base >import com.google.common.base.Strings > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371420#comment-14371420 ] vijay commented on SPARK-6435: -- It works when guava is the 1st or 2nd jar. Not sure at what point spark starts dropping jars, but I had this issue with multiple 'real' jars (i.e. containing .class files) in the --jars option: If I move a jar to the front of the list, it works; move it to the back, it fails. > spark-shell --jars option does not add all jars to classpath > > > Key: SPARK-6435 > URL: https://issues.apache.org/jira/browse/SPARK-6435 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.3.0 > Environment: Win64 >Reporter: vijay > > Not all jars supplied via the --jars option will be added to the driver (and > presumably executor) classpath. The first jar(s) will be added, but not all. > To reproduce this, just add a few jars (I tested 5) to the --jars option, and > then try to import a class from the last jar. This fails. A simple > reproducer: > Create a bunch of dummy jars: > jar cfM jar1.jar log.txt > jar cfM jar2.jar log.txt > jar cfM jar3.jar log.txt > jar cfM jar4.jar log.txt > Start the spark-shell with the dummy jars and guava at the end: > %SPARK_HOME%\bin\spark-shell --master local --jars > jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar > In the shell, try importing from guava; you'll get an error: > {code} > scala> import com.google.common.base.Strings > :19: error: object Strings is not a member of package > com.google.common.base >import com.google.common.base.Strings > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375707#comment-14375707 ] vijay commented on SPARK-6435: -- I tested this on Linux with the 1.3.0 release, works fine. Apparently a windows-specific issue. Apparently on windows only the 1st jar is picked up. This appears to be a problem with parsing the command line, introduced by the change in windows scripts between 1.2.0 and 1.3.0. A simple fix to bin\windows-utils.cmd resolves the issue. I ran this command to test with 'real' jars: {code} %SPARK_HOME%\bin\spark-shell --master local --jars c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar,c:\temp\guava-14.0.1.jar {code} Here are some snippets from the console - note that only the 1st jar is added; I can load classes from the 1st jar but not the 2nd: {code} 15/03/23 10:57:41 INFO SparkUI: Started SparkUI at http://vgarla-t440P.fritz.box :4040 15/03/23 10:57:41 INFO SparkContext: Added JAR file:/c:/code/elasticsearch-1.4.2/lib/lucene-core-4.10.2.jar at http://192.168.178.41:54601/jars/lucene-core-4.10.2.jar with timestamp 1427104661969 15/03/23 10:57:42 INFO Executor: Starting executor ID on host localhost ... scala> import org.apache.lucene.util.IOUtils import org.apache.lucene.util.IOUtils scala> import com.google.common.base.Strings :20: error: object Strings is not a member of package com.google.common.base {code} Looking at the command line in jvisualvm, I see that only the 1st jar is aded: {code} Main class: org.apache.spark.deploy.SparkSubmit Arguments: --class org.apache.spark.repl.Main --master local --jars c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar spark-shell c:\temp\guava-14.0.1.jar {code} In spark 1.2.0, spark-shell2.cmd just passed arguments "as is" to the java command line: {code} cmd /V /E /C %SPARK_HOME%\bin\spark-submit.cmd --class org.apache.spark.repl.Main %* spark-shell {code} In spark 1.3.0, spark-shell2.cmd calls windows-utils.cmd to parse arguments into SUBMISSION_OPTS and APPLICATION_OPTS. Only the first jar in the list passed to --jars makes it into the SUBMISSION_OPTS; latter jars are added to APPLICATION_OPTS: {code} call %SPARK_HOME%\bin\windows-utils.cmd %* if %ERRORLEVEL% equ 1 ( call :usage exit /b 1 ) echo SUBMISSION_OPTS=%SUBMISSION_OPTS% echo APPLICATION_OPTS=%APPLICATION_OPTS% cmd /V /E /C %SPARK_HOME%\bin\spark-submit.cmd --class org.apache.spark.repl.Main %SUBMISSION_OPTS% spark-shell %APPLICATION_OPTS% {code} The problem is that by the time the command line arguments get to windows-utils.cmd, the windows command line processor has split the comma-separated list into distinct arguments. The windows way of saying "treat this as a single arg" is to surround in double-quotes. However, when I surround the jars in quotes, I get an error: {code} %SPARK_HOME%\bin\spark-shell --master local --jars "c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar,c:\temp\guava-14.0.1.jar" c:\temp\guava-14.0.1.jar""=="x" was unexpected at this time. {code} Digging in, I see this is caused by this line from windows-utils.cmd: {code} if "x%2"=="x" ( {code} Replacing the quotes with square brackets does the trick: {code} if [x%2]==[x] ( {code} Now the command line is processed correctly. > spark-shell --jars option does not add all jars to classpath > > > Key: SPARK-6435 > URL: https://issues.apache.org/jira/browse/SPARK-6435 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.3.0 > Environment: Win64 >Reporter: vijay > > Not all jars supplied via the --jars option will be added to the driver (and > presumably executor) classpath. The first jar(s) will be added, but not all. > To reproduce this, just add a few jars (I tested 5) to the --jars option, and > then try to import a class from the last jar. This fails. A simple > reproducer: > Create a bunch of dummy jars: > jar cfM jar1.jar log.txt > jar cfM jar2.jar log.txt > jar cfM jar3.jar log.txt > jar cfM jar4.jar log.txt > Start the spark-shell with the dummy jars and guava at the end: > %SPARK_HOME%\bin\spark-shell --master local --jars > jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar > In the shell, try importing from guava; you'll get an error: > {code} > scala> import com.google.common.base.Strings > :19: error: object Strings is not a member of package > com.google.common.base >import com.google.common.base.Strings > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14376501#comment-14376501 ] vijay commented on SPARK-6435: -- I came up with square brackets after 2 minutes of googling/stackoverflowing; a more thorough search/understanding of bat scripts might result in a better/different solution (I can rule myself out of the more thorough bat script understanding). That being said, this test is used to check for an empty string. Square brackets is the most upvoted solution: http://stackoverflow.com/questions/2541767/what-is-the-proper-way-to-test-if-variable-is-empty-in-a-batch-file-if-not-1 > spark-shell --jars option does not add all jars to classpath > > > Key: SPARK-6435 > URL: https://issues.apache.org/jira/browse/SPARK-6435 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Affects Versions: 1.3.0 > Environment: Win64 >Reporter: vijay > > Not all jars supplied via the --jars option will be added to the driver (and > presumably executor) classpath. The first jar(s) will be added, but not all. > To reproduce this, just add a few jars (I tested 5) to the --jars option, and > then try to import a class from the last jar. This fails. A simple > reproducer: > Create a bunch of dummy jars: > jar cfM jar1.jar log.txt > jar cfM jar2.jar log.txt > jar cfM jar3.jar log.txt > jar cfM jar4.jar log.txt > Start the spark-shell with the dummy jars and guava at the end: > %SPARK_HOME%\bin\spark-shell --master local --jars > jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar > In the shell, try importing from guava; you'll get an error: > {code} > scala> import com.google.common.base.Strings > :19: error: object Strings is not a member of package > com.google.common.base >import com.google.common.base.Strings > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383566#comment-14383566 ] vijay commented on SPARK-6435: -- Strange - when I test it with multiple jars (with the fixed script) everything works > spark-shell --jars option does not add all jars to classpath > > > Key: SPARK-6435 > URL: https://issues.apache.org/jira/browse/SPARK-6435 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Affects Versions: 1.3.0 > Environment: Win64 >Reporter: vijay > > Not all jars supplied via the --jars option will be added to the driver (and > presumably executor) classpath. The first jar(s) will be added, but not all. > To reproduce this, just add a few jars (I tested 5) to the --jars option, and > then try to import a class from the last jar. This fails. A simple > reproducer: > Create a bunch of dummy jars: > jar cfM jar1.jar log.txt > jar cfM jar2.jar log.txt > jar cfM jar3.jar log.txt > jar cfM jar4.jar log.txt > Start the spark-shell with the dummy jars and guava at the end: > %SPARK_HOME%\bin\spark-shell --master local --jars > jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar > In the shell, try importing from guava; you'll get an error: > {code} > scala> import com.google.common.base.Strings > :19: error: object Strings is not a member of package > com.google.common.base >import com.google.common.base.Strings > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383566#comment-14383566 ] vijay edited comment on SPARK-6435 at 3/27/15 9:17 AM: --- Strange - when I test it with multiple jars (with the fixed script) everything works. Something has changed in some other script wrt the released 1.3.0 was (Author: vjapache): Strange - when I test it with multiple jars (with the fixed script) everything works > spark-shell --jars option does not add all jars to classpath > > > Key: SPARK-6435 > URL: https://issues.apache.org/jira/browse/SPARK-6435 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Affects Versions: 1.3.0 > Environment: Win64 >Reporter: vijay > > Not all jars supplied via the --jars option will be added to the driver (and > presumably executor) classpath. The first jar(s) will be added, but not all. > To reproduce this, just add a few jars (I tested 5) to the --jars option, and > then try to import a class from the last jar. This fails. A simple > reproducer: > Create a bunch of dummy jars: > jar cfM jar1.jar log.txt > jar cfM jar2.jar log.txt > jar cfM jar3.jar log.txt > jar cfM jar4.jar log.txt > Start the spark-shell with the dummy jars and guava at the end: > %SPARK_HOME%\bin\spark-shell --master local --jars > jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar > In the shell, try importing from guava; you'll get an error: > {code} > scala> import com.google.common.base.Strings > :19: error: object Strings is not a member of package > com.google.common.base >import com.google.common.base.Strings > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296775#comment-14296775 ] vijay commented on SPARK-2356: -- This is how I worked around this in Windows: * Download and extract https://codeload.github.com/srccodes/hadoop-common-2.2.0-bin/zip/master * Modify bin\spark-class2.cmd and add the hadoop.home.dir system property: {code} if not [%SPARK_SUBMIT_BOOTSTRAP_DRIVER%] == [] ( set SPARK_CLASS=1 "%RUNNER%" -Dhadoop.home.dir=C:\code\hadoop-common-2.2.0-bin-master org.apache.spark.deploy.SparkSubmitDriverBootstrapper %BOOTSTRAP_ARGS% ) else ( "%RUNNER%" -Dhadoop.home.dir=C:\code\hadoop-common-2.2.0-bin-master -cp "%CLASSPATH%" %JAVA_OPTS% %* ) {code} That being said, this is a workaround for what I consider a critical bug (if spark indeed is meant to support windows). > Exception: Could not locate executable null\bin\winutils.exe in the Hadoop > --- > > Key: SPARK-2356 > URL: https://issues.apache.org/jira/browse/SPARK-2356 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Kostiantyn Kudriavtsev >Priority: Critical > > I'm trying to run some transformation on Spark, it works fine on cluster > (YARN, linux machines). However, when I'm trying to run it on local machine > (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file > from local filesystem): > {code} > 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) > at org.apache.hadoop.util.Shell.(Shell.java:326) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:76) > at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) > at org.apache.hadoop.security.Groups.(Groups.java:77) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) > at org.apache.spark.SparkContext.(SparkContext.scala:228) > at org.apache.spark.SparkContext.(SparkContext.scala:97) > {code} > It's happened because Hadoop config is initialized each time when spark > context is created regardless is hadoop required or not. > I propose to add some special flag to indicate if hadoop config is required > (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5481) JdbcRDD requires JDBC 4 APIs, limiting compatible JDBC Drivers
vijay created SPARK-5481: Summary: JdbcRDD requires JDBC 4 APIs, limiting compatible JDBC Drivers Key: SPARK-5481 URL: https://issues.apache.org/jira/browse/SPARK-5481 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: vijay JdbcRDD makes unnecessary use of JDBC 4 APIs. To maintain broad jdbc driver support, Spark should support JDBC 3. The issue is calling isClosed() prior to closing JDBC object. isClosed() is part of JDBC 4. It is perfectly safe to close something that is already closed - this may throw an exception (which is caught) but has no negative side affects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5481) JdbcRDD requires JDBC 4 APIs, limiting compatible JDBC Drivers
[ https://issues.apache.org/jira/browse/SPARK-5481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296789#comment-14296789 ] vijay commented on SPARK-5481: -- JDBC 4 is an API. Drivers implement the API, or parts thereof. You can use JDBC 3 compliant drivers in Java 6; calls to the JDBC 4 functions against such drivers cause java.lang.AbstractMethodError exceptions. Spark isn't doing anything fancy that requires any of the JDBC 4 features; the only function AFAICT is isClosed(), which as mentioned above is superfluous. > JdbcRDD requires JDBC 4 APIs, limiting compatible JDBC Drivers > -- > > Key: SPARK-5481 > URL: https://issues.apache.org/jira/browse/SPARK-5481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: vijay > > JdbcRDD makes unnecessary use of JDBC 4 APIs. To maintain broad jdbc driver > support, Spark should support JDBC 3. > The issue is calling isClosed() prior to closing JDBC object. isClosed() is > part of JDBC 4. It is perfectly safe to close something that is already > closed - this may throw an exception (which is caught) but has no negative > side affects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5481) JdbcRDD requires JDBC 4 APIs, limiting compatible JDBC Drivers
[ https://issues.apache.org/jira/browse/SPARK-5481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14296838#comment-14296838 ] vijay commented on SPARK-5481: -- Legacy databases that have tons of data and are still in use; e.g. DB2 v 9.1 or lower: http://www-01.ibm.com/support/docview.wss?uid=swg21363866 > JdbcRDD requires JDBC 4 APIs, limiting compatible JDBC Drivers > -- > > Key: SPARK-5481 > URL: https://issues.apache.org/jira/browse/SPARK-5481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: vijay > > JdbcRDD makes unnecessary use of JDBC 4 APIs. To maintain broad jdbc driver > support, Spark should support JDBC 3. > The issue is calling isClosed() prior to closing JDBC object. isClosed() is > part of JDBC 4. It is perfectly safe to close something that is already > closed - this may throw an exception (which is caught) but has no negative > side affects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583290#comment-16583290 ] Vijay commented on SPARK-6305: -- Hello, I have a question and need help. I am using Spark 2 version. My spark submit application has Log4J2 jars shaded as part of the build. The log4j.xml is placed in resources folder. Can i create logs created using Log4J 2 API in a new file? Could you please tell me what i need to do to make it work. Thanks > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org