Re: Read Parquet file from scala directly
Here's a Java version https://github.com/cloudera/parquet-examples/tree/master/MapReduce It won't be that hard to make that in Scala. Thanks Best Regards On Mon, Mar 9, 2015 at 9:55 PM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I have a lot of parquet files, and I try to open them directly instead of load them into RDD in driver (so I can optimize some performance through special logic). But I do some research online and can’t find any example to access parquet directly from scala, anyone has done this before? Regards, Shuai
Re: saveAsTextFile extremely slow near finish
Don't you think 1000 is too less for 160GB of data? Also you could try using KryoSerializer, Enabling RDD Compression. Thanks Best Regards On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x m...@spokeo.com wrote: I'm basically running a sorting using spark. The spark program will read from HDFS, sort on composite keys, and then save the partitioned result back to HDFS. pseudo code is like this: input = sc.textFile pairs = input.mapToPair sorted = pairs.sortByKey values = sorted.values values.saveAsTextFile Input size is ~ 160G, and I made 1000 partitions specified in JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress and the last few jobs just took forever and never finishes. Cluster setup: 8 nodes on each node: 15gb memory, 8 cores running parameters: --executor-memory 12G --conf spark.cores.max=60 Thank you for any help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-on-YARN architecture
Thanks for the quick reply. I am running the application in YARN client mode. And I want to run the AM on the same node as RM inorder use the node which otherwise would run AM. How can I get AM run on the same node as RM? On Tue, Mar 10, 2015 at 3:49 PM, Sean Owen so...@cloudera.com wrote: In YARN cluster mode, there is no Spark master, since YARN is your resource manager. Yes you could force your AM somehow to run on the same node as the RM, but why -- what do think is faster about that? On Tue, Mar 10, 2015 at 10:06 AM, Harika matha.har...@gmail.com wrote: Hi all, I have Spark cluster setup on YARN with 4 nodes(1 master and 3 slaves). When I run an application, YARN chooses, at random, one Application Master from among the slaves. This means that my final computation is being carried only on two slaves. This decreases the performance of the cluster. 1. Is this the correct way of configuration? What is the architecture of Spark on YARN? 2. Is there a way in which I can run Spark master, YARN application master and resource manager on a single node?(so that I can use three other nodes for the computation) Thanks Harika -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-architecture-tp21986.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Registering custom UDAFs with HiveConetxt in SparkSQL, how?
Hi, I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can be registered as a function in HiveContext, I could not find any documentation of how UDAFs can be registered in the HiveContext?? so far what I have found is to make a JAR file, out of developed UDAF class, and then deploy the JAR file to SparkSQL . But is there any way to avoid deploying the jar file and register it programmatically? best, /Shahab
Re: Spark-on-YARN architecture
I suppose you just provision enough resource to run both on that node... but it really shouldn't matter. The RM and your AM aren't communicating heavily. On Tue, Mar 10, 2015 at 10:23 AM, Harika Matha matha.har...@gmail.com wrote: Thanks for the quick reply. I am running the application in YARN client mode. And I want to run the AM on the same node as RM inorder use the node which otherwise would run AM. How can I get AM run on the same node as RM? On Tue, Mar 10, 2015 at 3:49 PM, Sean Owen so...@cloudera.com wrote: In YARN cluster mode, there is no Spark master, since YARN is your resource manager. Yes you could force your AM somehow to run on the same node as the RM, but why -- what do think is faster about that? On Tue, Mar 10, 2015 at 10:06 AM, Harika matha.har...@gmail.com wrote: Hi all, I have Spark cluster setup on YARN with 4 nodes(1 master and 3 slaves). When I run an application, YARN chooses, at random, one Application Master from among the slaves. This means that my final computation is being carried only on two slaves. This decreases the performance of the cluster. 1. Is this the correct way of configuration? What is the architecture of Spark on YARN? 2. Is there a way in which I can run Spark master, YARN application master and resource manager on a single node?(so that I can use three other nodes for the computation) Thanks Harika -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-architecture-tp21986.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[SparkSQL] Reuse HiveContext to different Hive warehouse?
I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table src). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w) scala sqlContext.sql(SELECT * from src).saveAsTable(table1) scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2) scala sqlContext.sql(SELECT * from src).saveAsTable(table2) == After these steps, the tables are stored in /test/w only. I expect table2 to be stored in /test/w2 folder. Another question is: if I set hive.metastore.warehouse.dir to a HDFS folder, I cannot use saveAsTable()? Is this by design? Exception stack trace is below: == 15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:74 java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/space/warehouse/table2, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463) at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav a:118) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a pply(newParquet.scala:252) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a pply(newParquet.scala:251) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc ala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc ala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP arquet.scala:251) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:37 0) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca la:96) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca la:125) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308) at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru n(commands.scala:217) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu te(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands .scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65 ) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLConte xt.scala:1088) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:10 88) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC.init(console:35) at $iwC.init(console:37) at init(console:39) Thank you very much!
Spark-on-YARN architecture
Hi all, I have Spark cluster setup on YARN with 4 nodes(1 master and 3 slaves). When I run an application, YARN chooses, at random, one Application Master from among the slaves. This means that my final computation is being carried only on two slaves. This decreases the performance of the cluster. 1. Is this the correct way of configuration? What is the architecture of Spark on YARN? 2. Is there a way in which I can run Spark master, YARN application master and resource manager on a single node?(so that I can use three other nodes for the computation) Thanks Harika -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-architecture-tp21986.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: saveAsTextFile extremely slow near finish
This is more of an aside, but why repartition this data instead of letting it define partitions naturally? You will end up with a similar number. On Mar 9, 2015 5:32 PM, mingweili0x m...@spokeo.com wrote: I'm basically running a sorting using spark. The spark program will read from HDFS, sort on composite keys, and then save the partitioned result back to HDFS. pseudo code is like this: input = sc.textFile pairs = input.mapToPair sorted = pairs.sortByKey values = sorted.values values.saveAsTextFile Input size is ~ 160G, and I made 1000 partitions specified in JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress and the last few jobs just took forever and never finishes. Cluster setup: 8 nodes on each node: 15gb memory, 8 cores running parameters: --executor-memory 12G --conf spark.cores.max=60 Thank you for any help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-on-YARN architecture
In YARN cluster mode, there is no Spark master, since YARN is your resource manager. Yes you could force your AM somehow to run on the same node as the RM, but why -- what do think is faster about that? On Tue, Mar 10, 2015 at 10:06 AM, Harika matha.har...@gmail.com wrote: Hi all, I have Spark cluster setup on YARN with 4 nodes(1 master and 3 slaves). When I run an application, YARN chooses, at random, one Application Master from among the slaves. This means that my final computation is being carried only on two slaves. This decreases the performance of the cluster. 1. Is this the correct way of configuration? What is the architecture of Spark on YARN? 2. Is there a way in which I can run Spark master, YARN application master and resource manager on a single node?(so that I can use three other nodes for the computation) Thanks Harika -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-architecture-tp21986.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: error on training with logistic regression sgd
Hi, Can anyone give an idea about this? Just did some google search, it seems related to the 2gb limitation on block size, https://issues.apache.org/jira/browse/SPARK-1476. The whole process is that: 1. load the data 2. convert each line of data into labeled points using some feature hashing algorithm in python. 3. train a logistic regression model with the converted labeled points. Can any one give some advice for how to avoid the 2gb, if this is the cause? Thanks very much for the help. Best, Peng On Mon, Mar 9, 2015 at 3:54 PM, Peng Xia sparkpeng...@gmail.com wrote: Hi, I was launching a spark cluster with 4 work nodes, each work nodes contains 8 cores and 56gb ram, and I was testing my logistic regression problem. The training set is around 1.2 million records.When I was using 2**10 (1024) features, the whole program works fine, but when I use 2**14 features, the program has encountered the error: Py4JJavaError: An error occurred while calling o84.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 9, workernode0.sparkexperience4a7.d5.internal.cloudapp.net): java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:156) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:93) at
RE: sc.textFile() on windows cannot access UNC path
I think the work around is clear. Using JDK 7, and implement your own saveAsRemoteWinText() using java.nio.path. Yong From: ningjun.w...@lexisnexis.com To: java8...@hotmail.com; user@spark.apache.org Subject: RE: sc.textFile() on windows cannot access UNC path Date: Tue, 10 Mar 2015 03:02:37 + Hi Yong Thanks for the reply. Yes it works with local drive letter. But I really need to use UNC path because the path is input from at runtime. I cannot dynamically assign a drive letter to arbitrary UNC path at runtime. Is there any work around that I can use UNC path for sc.textFile(…)? Ningjun From: java8964 [mailto:java8...@hotmail.com] Sent: Monday, March 09, 2015 5:33 PM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: RE: sc.textFile() on windows cannot access UNC path This is a Java problem, not really Spark. From this page: http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path class in Hadoop will use java.io.*, instead of java.nio. You need to manually mount your windows remote share a local driver, like Z:, then it should work. Yong From: ningjun.w...@lexisnexis.com To: user@spark.apache.org Subject: sc.textFile() on windows cannot access UNC path Date: Mon, 9 Mar 2015 21:09:38 + I am running Spark on windows 2008 R2. I use sc.textFile() to load text file using UNC path, it does not work. sc.textFile(rawfile:10.196.119.230/folder1/abc.txt, 4).count() Input path does not exist: file:/10.196.119.230/folder1/abc.txt org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/10.196.119.230/tar/Enron/enron-207-short.load at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.count(RDD.scala:910) at ltn.analytics.tests.IndexTest$$anonfun$3.apply$mcV$sp(IndexTest.scala:104) at ltn.analytics.tests.IndexTest$$anonfun$3.apply(IndexTest.scala:103) at ltn.analytics.tests.IndexTest$$anonfun$3.apply(IndexTest.scala:103) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at
RE: Does any one know how to deploy a custom UDAF jar file in SparkSQL?
You can add the additional jar when submitting your job, something like: ./bin/spark-submit --jars xx.jar … More options can be listed by just typing ./bin/spark-submit From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 10, 2015 8:48 PM To: user@spark.apache.org Subject: Does any one know how to deploy a custom UDAF jar file in SparkSQL? Hi, Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where should i put the jar file so SparkSQL can pick it up and make it accessible for SparkSQL applications? I do not use spark-shell instead I want to use it in an spark application. best, /Shahab
RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?
Currently, Spark SQL doesn’t provide interface for developing the custom UDTF, but it can work seamless with Hive UDTF. I am working on the UDTF refactoring for Spark SQL, hopefully will provide an Hive independent UDTF soon after that. From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 10, 2015 5:44 PM To: user@spark.apache.org Subject: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Hi, I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can be registered as a function in HiveContext, I could not find any documentation of how UDAFs can be registered in the HiveContext?? so far what I have found is to make a JAR file, out of developed UDAF class, and then deploy the JAR file to SparkSQL . But is there any way to avoid deploying the jar file and register it programmatically? best, /Shahab
Does any one know how to deploy a custom UDAF jar file in SparkSQL?
Hi, Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where should i put the jar file so SparkSQL can pick it up and make it accessible for SparkSQL applications? I do not use spark-shell instead I want to use it in an spark application. best, /Shahab
Re: FW: RE: distribution of receivers in spark streaming
Thanks TD and Jerry for suggestions. I have done some experiments and worked out a reasonable solution to the problem of spreading receivers to a set of worker hosts. It would be a bit too tedious to document in email. So I discuss the solution in a blog: http://scala4fun.tumblr.com/post/113172936582/how-to-distribute-receivers-over-worker-hosts-in Please be free to give me feedback if you see any issue. Thanks,Du On Friday, March 6, 2015 4:10 PM, Tathagata Das t...@databricks.com wrote: Aaah, good point, about the same node. All right. Can you post this on the user mailing list for future reference to the community? Might be a good idea to post both methods with pros and cons, as different users may have different constraints. :)Thanks :) TD On Fri, Mar 6, 2015 at 4:07 PM, Du Li l...@yahoo-inc.com wrote: Yes but the caveat may not exist if we do this when the streaming app is launched, since we're trying the start receivers before any other tasks. Discovered in this way will only include the worker hosts. By using the API we may need some extra efforts to single out worker hosts. Sometimes we may run master and worker daemons on the same host while some other times we don't, depending on configuration. Du On Friday, March 6, 2015 3:59 PM, Tathagata Das t...@databricks.com wrote: That can definitely be done. In fact I have done that. Just one caveat. If one of the executors is fully occupied with a previous very-long job, then these fake tasks may not capture that worker even if there are lots of tasks. The executor storage status will work for sure, as long as you can filter out the master. TD On Fri, Mar 6, 2015 at 3:49 PM, Du Li l...@yahoo-inc.com wrote: Hi TD, Thanks for your response and the information. I just tried out the SparkContext getExecutorMemoryStatus and getExecutorStorageStatus methods. Due to their purposes, they do not differentiate master and worker nodes. However, for performance of my app, I prefer to distribute receivers only to the worker nodes. This morning I worked out another solution: Create an accumulator and a fake workload, parallelize the workload with a high level of parallelism which does nothing but adds its hostname to the accumulator. Repeat this until the accumulator value stops growing. In the end I get the set of worker hostnames. It worked pretty well! Thanks,Du On Friday, March 6, 2015 3:11 PM, Tathagata Das t...@databricks.com wrote: What Saisai said is correct. There is no good API. However there are jacky ways of finding out the current workers. See sparkContext.getExecutorStorageStatus() and you can get the host names of the current executors. You could use those. TD On Thu, Mar 5, 2015 at 6:55 AM, Du Li l...@yahoo-inc.com wrote: | Hi TD. Do you have any suggestion? Thanks /Du Sent from Yahoo Mail for iPhone Begin Forwarded Message From: Shao, Saisai'saisai.s...@intel.com' Date: Mar 4, 2015, 10:35:44 PM To: Du Li'l...@yahoo-inc.com', User'user@spark.apache.org' Subject: RE: distribution of receivers in spark streaming Yes, hostname is enough. I think currently it is hard for user code to get the worker list from standalone master. If you can get the Master object, you could get the worker list, but AFAIK may be it is difficult to get this object. All you could do is to manually get the worker list and assigned its hostname to each receiver. Thanks Jerry From: Du Li [mailto:l...@yahoo-inc.com] Sent: Thursday, March 5, 2015 2:29 PM To: Shao, Saisai; User Subject: Re: distribution of receivers in spark streaming Hi Jerry, Thanks for your response. Is there a way to get the list of currently registered/live workers? Even in order to provide preferredLocation, it would be safer to know which workers are active. Guess I only need to provide the hostname, right? Thanks, Du On Wednesday, March 4, 2015 10:08 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi Du, You could try to sleep for several seconds after creating streaming context to let all the executors registered, then all the receivers can distribute to the nodes more evenly. Also setting locality is another way as you mentioned. Thanks Jerry From: Du Li [mailto:l...@yahoo-inc.com.INVALID] Sent: Thursday, March 5, 2015 1:50 PM To: User Subject: Re: distribution of receivers in spark streaming Figured it out: I need to override method preferredLocation() in MyReceiver class. On Wednesday, March 4, 2015 3:35 PM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines. In my code I did something like the following, as suggested in the spark docs: val streams = (1 to numReceivers).map(_ = ssc.receiverStream(new MyKafkaReceiver())) ssc.union(streams) However, from the spark UI, I saw that some machines are not running any instance of the receiver while some get three. The
Re: SchemaRDD: SQL Queries vs Language Integrated Queries
Hi, On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote: I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference between the two approaches, besides using different syntax? Are they interchangeable? Which one has better performance? One difference is that the language integrated queries are method calls on the SchemaRDD you want to work on, which requires you have access to the object at hand. The SQL queries are passed to a method of the SQLContext and you have to call registerTempTable() on the SchemaRDD you want to use beforehand, which can basically happen at an arbitrary location of your program. (I don't know if I could express what I wanted to say.) That may have an influence on how you design your program and how the different parts work together. Tobias
Re: Why spark master consumes 100% CPU when we kill a spark streaming app?
Probably the cleanup work like clean shuffle files, tmp files cost too much of CPUs, since if we run Spark Streaming for a long time, lots of files will be generated, so cleanup this files before app is exited could be time-consuming. Thanks Jerry 2015-03-11 10:43 GMT+08:00 Tathagata Das t...@databricks.com: Do you have event logging enabled? That could be the problem. The Master tries to aggressively recreate the web ui of the completed job with the event logs (when it is enabled) causing the Master to stall. I created a JIRA for this. https://issues.apache.org/jira/browse/SPARK-6270 On Tue, Mar 10, 2015 at 7:10 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hey, Recently, we found in our cluster, that when we kill a spark streaming app, the whole cluster cannot response for 10 minutes. And, we investigate the master node, and found the master process consumes 100% CPU when we kill the spark streaming app. How could it happen? Did anyone had the similar problem before?
Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?
I'm trying to play with the implementation of least square solver (Ax = b) in mlmatrix.TSQR where A is a 5*1024 matrix and b a 5*10 matrix. It works but I notice that it's 8 times slower than the implementation given in the latest ampcamp : http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html . As far as I know these two implementations come from the same basis. What is the difference between these two codes ? On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala [2] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao
Re: Setting up Spark with YARN on EC2 cluster
Hi Harika, Did you get any solution for this? I want to use yarn , but the spark-ec2 script does not support it. Thanks -Roni -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818p21991.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: ANSI Standard Supported by the Spark-SQL
Spark SQL supports a subset of HiveQL: http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive On Mon, Mar 9, 2015 at 11:32 PM, Ravindra ravindra.baj...@gmail.com wrote: From the archives in this user list, It seems that Spark-SQL is yet to achieve SQL 92 level. But there are few things still not clear. 1. This is from an old post dated : Aug 09, 2014. 2. It clearly says that it doesn't support DDL and DML operations. Does that means, all reads (select) are sql 92 compliant? Please clarify. Regards, Ravi On Tue, Mar 10, 2015 at 11:46 AM Ravindra ravindra.baj...@gmail.com wrote: Hi All, I am new to spark and trying to understand what SQL Standard is supported by the Spark. I googled around a lot but didn't get clear answer. Some where I saw that Spark supports sql-92 and at other location I found that spark is not fully compliant with sql-92. I also noticed that using Hive Context I can execute almost all the queries supported by the Hive, so does that means that Spark is equivalent to Hive in terms of sql standards, given that HiveContext is used for querying. Please suggest. Regards, Ravi
Re: Compilation error
If you are using tools like SBT/Maven/Gradle/etc, they figure out all the recursive dependencies and includes them in the class path. I haven't touched Eclipse in years so I am not sure off the top of my head what's going on instead. Just in case you only downloaded the spark-streaming_2.10.jar then that is indeed insufficient and you have to download all the recursive dependencies. May be you should create a Maven project inside Eclipse? TD On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How do I do that? I haven't used Scala before. Also, linking page doesn't mention that: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote: It means you do not have Scala library classes in your project classpath. On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any import scala.Tuple2 class. http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example private void map(JavaReceiverInputDStreamString lines) { JavaDStreamString words = lines.flatMap( new FlatMapFunctionString, String() { @Override public IterableString call(String x) { return Arrays.asList(x.split( )); } }); // Count each word in each batch JavaPairDStreamString, Integer pairs = words.map( new PairFunctionString, String, Integer() { @Override public Tuple2String, Integer call(String s) throws Exception { return new Tuple2String, Integer(s, 1); } }); }
Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?
Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF( user defined type function ). I appreciate if you could point me to some starting point on UDAF development in Spark. Thanks Shahab On Tuesday, March 10, 2015, Cheng, Hao hao.ch...@intel.com wrote: Currently, Spark SQL doesn’t provide interface for developing the custom UDTF, but it can work seamless with Hive UDTF. I am working on the UDTF refactoring for Spark SQL, hopefully will provide an Hive independent UDTF soon after that. *From:* shahab [mailto:shahab.mok...@gmail.com javascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com');] *Sent:* Tuesday, March 10, 2015 5:44 PM *To:* user@spark.apache.org javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); *Subject:* Registering custom UDAFs with HiveConetxt in SparkSQL, how? Hi, I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can be registered as a function in HiveContext, I could not find any documentation of how UDAFs can be registered in the HiveContext?? so far what I have found is to make a JAR file, out of developed UDAF class, and then deploy the JAR file to SparkSQL . But is there any way to avoid deploying the jar file and register it programmatically? best, /Shahab
Re: Compilation error
You have to include Scala libraries in the Eclipse dependencies. TD On Tue, Mar 10, 2015 at 10:54 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any import scala.Tuple2 class. http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example *private* *void* map(JavaReceiverInputDStreamString lines) { JavaDStreamString words = lines.flatMap( *new* *FlatMapFunctionString, String()* { @Override *public* IterableString call(String x) { *return* Arrays.*asList*(x.split( )); } }); // Count each word in each batch JavaPairDStreamString, Integer pairs = words.*map*( *new* *PairFunctionString, String, Integer()* { @Override *public* *Tuple2*String, Integer call(String s) *throws* Exception { *return* *new* *Tuple2*String, Integer(s, 1); } }); }
Re: Can't cache RDD of collaborative filtering on MLlib
Thank you for your reply. 1. Which version of Spark do you use now? I use Spark 1.2.0. (CDH 5.3.1) 2. Why don't you check whether `productJavaRDD ` and `userJavaRDD ` are cached with Web UI or not? I checked SparkUI. Task was stopped at 1/2 (Succeeded/Total tasks). Here is stacktrace: org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:840) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendUsers(MatrixFactorizationModel.scala:126) ... 3. You can check a RDD cached or not with `getStorageLevel()` method, such as `model .userFeatures.getStorageLevel()`. I printed the return value of getStorageLevel() userFeatures and productFeatures, both were Memory Deserialized 1x Replicated . I think, two variables were configured to cache, but didn't cache at that time. (delayed ?) Thanks, Yuichiro Sakamoto -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962p21990.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compilation error
How do I do that? I haven't used Scala before. Also, linking page doesn't mention that: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote: It means you do not have Scala library classes in your project classpath. On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any import scala.Tuple2 class. http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example private void map(JavaReceiverInputDStreamString lines) { JavaDStreamString words = lines.flatMap( new FlatMapFunctionString, String() { @Override public IterableString call(String x) { return Arrays.asList(x.split( )); } }); // Count each word in each batch JavaPairDStreamString, Integer pairs = words.map( new PairFunctionString, String, Integer() { @Override public Tuple2String, Integer call(String s) throws Exception { return new Tuple2String, Integer(s, 1); } }); }
Re: Why spark master consumes 100% CPU when we kill a spark streaming app?
Do you have event logging enabled? That could be the problem. The Master tries to aggressively recreate the web ui of the completed job with the event logs (when it is enabled) causing the Master to stall. I created a JIRA for this. https://issues.apache.org/jira/browse/SPARK-6270 On Tue, Mar 10, 2015 at 7:10 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hey, Recently, we found in our cluster, that when we kill a spark streaming app, the whole cluster cannot response for 10 minutes. And, we investigate the master node, and found the master process consumes 100% CPU when we kill the spark streaming app. How could it happen? Did anyone had the similar problem before?
Re: Is it possible to use windows service to start and stop spark standalone cluster
Have you tried Apache Daemon? http://commons.apache.org/proper/commons-daemon/procrun.html From: Wang, Ningjun (LNG-NPV) Date: Tuesday, March 10, 2015 at 11:47 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Is it possible to use windows service to start and stop spark standalone cluster We are using spark stand alone cluster on Windows 2008 R2. I can start spark clusters by open an command prompt and run the following bin\spark-class.cmd org.apache.spark.deploy.master.Master bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://mywin.mydomain.com:7077 I can stop spark cluster by pressing Ctril-C. The problem is that if the machine is reboot, I have to manually start the spark cluster again as above. Is it possible to use a windows service to start cluster? This way when the machine is reboot, the windows service will automatically restart spark cluster. How to stop spark cluster using windows service is also a challenge. Please advise. Thanks Ningjun
S3 SubFolder Write Issues
Hi All, I am hoping someone has seen this issue before with S3, as I haven't been able to find a solution for this problem. When I try to save as Text file to s3 into a subfolder, it only ever writes out to the bucket level folder and produces block level generated file names and not my output folder as I specified. Below is the sample code in Scala, I have also seen this behavior in the Java code. val out = inputRdd.map {ir = mapFunction(ir)}.groupByKey().mapValues { x = mapValuesFunction(x) } .saveAsTextFile(s3://BUCKET/SUB_FOLDER/output Any ideas on how to get saveAsTextFile to write to an S3 subfolder? Thanks, Chris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3-SubFolder-Write-Issues-tp21997.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SQL with Spark Streaming
Hi, On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote: Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. The github repository is here: https://github.com/intel-spark/stream-sql Also, what I did is writing a wrapper class SchemaDStream that internally holds a DStream[Row] and a DStream[StructType] (the latter having just one element in every RDD) and then allows to do - operations SchemaRDD = SchemaRDD using `rowStream.transformWith(schemaStream, ...)` - in particular you can register this stream's data as a table this way - and via a companion object with a method `fromSQL(sql: String): SchemaDStream` you can get a new stream from previously registered tables. However, you are limited to batch-internal operations, i.e., you can't aggregate across batches. I am not able to share the code at the moment, but will within the next months. It is not very advanced code, though, and should be easy to replicate. Also, I have no idea about the performance of transformWith Tobias
Re: ANSI Standard Supported by the Spark-SQL
Thanks Michael, That helps. So just to summarise that we should not make any assumption about Spark being fully compliant with any SQL Standards until announced by the community and maintain the same status quo as you have suggested. Regards, Ravi. On Tue, Mar 10, 2015 at 11:14 PM Michael Armbrust mich...@databricks.com wrote: Spark SQL supports a subset of HiveQL: http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive On Mon, Mar 9, 2015 at 11:32 PM, Ravindra ravindra.baj...@gmail.com wrote: From the archives in this user list, It seems that Spark-SQL is yet to achieve SQL 92 level. But there are few things still not clear. 1. This is from an old post dated : Aug 09, 2014. 2. It clearly says that it doesn't support DDL and DML operations. Does that means, all reads (select) are sql 92 compliant? Please clarify. Regards, Ravi On Tue, Mar 10, 2015 at 11:46 AM Ravindra ravindra.baj...@gmail.com wrote: Hi All, I am new to spark and trying to understand what SQL Standard is supported by the Spark. I googled around a lot but didn't get clear answer. Some where I saw that Spark supports sql-92 and at other location I found that spark is not fully compliant with sql-92. I also noticed that using Hive Context I can execute almost all the queries supported by the Hive, so does that means that Spark is equivalent to Hive in terms of sql standards, given that HiveContext is used for querying. Please suggest. Regards, Ravi
Is it possible to use windows service to start and stop spark standalone cluster
We are using spark stand alone cluster on Windows 2008 R2. I can start spark clusters by open an command prompt and run the following bin\spark-class.cmd org.apache.spark.deploy.master.Master bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://mywin.mydomain.com:7077 I can stop spark cluster by pressing Ctril-C. The problem is that if the machine is reboot, I have to manually start the spark cluster again as above. Is it possible to use a windows service to start cluster? This way when the machine is reboot, the windows service will automatically restart spark cluster. How to stop spark cluster using windows service is also a challenge. Please advise. Thanks Ningjun
Re: sparse vector operations in Python
There isn't a great way currently. The best option is probably to convert to scipy.sparse column vectors and add using scipy. Joseph On Mon, Mar 9, 2015 at 4:21 PM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Hi, Sorry to ask this, but how do I compute the sum of 2 (or more) mllib SparseVectors in Python? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Why spark master consumes 100% CPU when we kill a spark streaming app?
Hey, Recently, we found in our cluster, that when we kill a spark streaming app, the whole cluster cannot response for 10 minutes. And, we investigate the master node, and found the master process consumes 100% CPU when we kill the spark streaming app. How could it happen? Did anyone had the similar problem before?
Pyspark not using all cores
Hi All, I need some help with a problem in pyspark which is causing a major issue. Recently I've noticed that the behaviour of the python.deamons on the worker nodes for compute-intensive tasks have changed from using all the avaliable cores to using only a single core. On each worker node, 8 python.deamons exist but they all seem to run on a single core. The remaining 7 cores idle. Our hardware consists of 9 hosts (1x driver node and 8x worker nodes) each with 8 cores and 64gb RAM - we are using Spark 1.2.0-SNAPSHOT (Python) in standalone mode via Cloudera 5.3.2. To give a better understanding of the problem I have made a quick script from one of the given examples which replicates the problem: When I run the calculating_pies.py script using the command spark-submit calculate_pies.py this is what I typically see on all my worker nodes: 1 [|||100.0%] 5 [||| 3.9%] 2 [ 0.0%] 6 [||| 2.6%] 3 [|| 1.3%] 7 [|8.4%] 4 [|| 1.3%] 8 [|| 8.7%] PID USER PRI NI VIRT RESSHR S CPU% MEM% TIME+ Command 30672 spark20 0 225M 112M 1156 R 13.0 0.2 0:03.10 python -m pyspark.daemon 30681 spark20 0 225M 112M 1152 R 13.0 0.2 0:03.10 python -m pyspark.daemon 30687 spark20 0 225M 112M 1152 R 13.0 0.2 0:03.10 python -m pyspark.daemon 30678 spark20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python -m pyspark.daemon 30693 spark20 0 225M 112M 1152 R 12.0 0.2 0:03.08 python -m pyspark.daemon 30674 spark20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python -m pyspark.daemon 30688 spark20 0 225M 112M 1152 R 12.0 0.2 0:03.08 python -m pyspark.daemon 30684 spark20 0 225M 112M 1152 R 12.0 0.2 0:03.10 python -m pyspark.daemon Through the spark UI I do see 8 executor ids with 8 active tasks on each. I also see the same behaviour if I use the flag --total-executor-cores 64 in spark-submit. Strangly, If I run the same script in local mode everything seems to run fine. This is what I see 1 [||100.0%] 5 [||100.0%] 2 [||100.0%] 6 [||100.0%] 3 [||100.0%] 7 [||100.0%] 4 [|||99.3%] 8 [|||99.4%] PID USER PRI NI VIRT RESSHR S CPU% MEM% TIME+ Command 22519 data 20 0 225M 106M 1368 R 99.0 0.20:10.97 python -m pyspark.daemon 22508 data 20 0 225M 106M 1368 R 99.0 0.20:10.92 python -m pyspark.daemon 22513 data 20 0 225M 106M 1368 R 99.0 0.20:11.02 python -m pyspark.daemon 22526 data 20 0 225M 106M 1368 R 99.0 0.20:10.84 python -m pyspark.daemon 22522 data 20 0 225M 106M 1368 R 98.0 0.20:10.95 python -m pyspark.daemon 22523 data 20 0 225M 106M 1368 R 97.0 0.20:10.92 python -m pyspark.daemon 22507 data 20 0 225M 106M 1368 R 97.0 0.20:10.83 python -m pyspark.daemon 22516 data 20 0 225M 106M 1368 R 93.0 0.20:10.88 python -m pyspark.daemon == calculating_pies.py == #!/usr/bin/pyspark import random from pyspark.conf import SparkConf from pyspark.context import SparkContext from pyspark.storagelevel import StorageLevel def pi(NUM_SAMPLE = 100): count = 0. for i in xrange(NUM_SAMPLE): x, y = random.random(), random.random() if (x * x) + (y * y) 1: count += 1 return 4.0 * (count / NUM_SAMPLE) if __name__ == __main__: sconf = (SparkConf() .set('spark.default.parallelism','256') .set('spark.app.name', 'Calculating PI')) # local # sc = SparkContext(conf=sconf) # standalone sc = SparkContext(spark://driver_host:7077, conf=sconf) # yarn # sc = SparkContext(yarn-client, conf=sconf) rdd_pies = sc.parallelize(range(1), 1000) rdd_pies.map(lambda x: pi()).collect() sc.stop() = Does anyone have any suggestions, or know of any config we should be looking at that could solve this problem? Does anyone else see the same problem? Any help is appreciated, Thanks. -- View this message
Re: Spark Streaming testing strategies
Hi Holden Thanks Holden for pointing me the package. Indeed StreamingSuiteBase trait hides a lot, especially regarding clock manipulation. Did you encounter problems with concurrent tests execution from SBT (SPARK-2243)? I had to disable parallel execution and configure SBT to use separate JVM for tests execution (fork). BTW. I added samples for SparkSQL as well. I would expect base trait for testing purposes in spark distribution. ManualClock should be exposed as well. And some documentation how to configure SBT to avoid problems with multiple spark contexts. I'm going to create improvement proposal on Spark issue tracker about it. On 1 March 2015 at 18:49, Holden Karau hol...@pigscanfly.ca wrote: There is also the Spark Testing Base package which is on spark-packages.org and hides the ugly bits (it's based on the existing streaming test code but I cleaned it up a bit to try and limit the number of internals it was touching). On Sunday, March 1, 2015, Marcin Kuthan marcin.kut...@gmail.com wrote: I have started using Spark and Spark Streaming and I'm wondering how do you test your applications? Especially Spark Streaming application with window based transformations. After some digging I found ManualClock class to take full control over stream processing. Unfortunately the class is not available outside spark.streaming package. Are you going to expose the class for other developers as well? Now I have to use my custom wrapper under spark.streaming package. My Spark and Spark Streaming unit tests strategies are documented here: http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ Your feedback is more than appreciated. Marcin -- Cell : 425-233-8271 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Compilation error on JavaPairDStream
I am getting following error. When I look at the sources it seems to be a scala source, but not sure why it's complaining about it. The method map(FunctionString,R) in the type JavaDStreamString is not applicable for the arguments (new PairFunctionString,String,Integer(){}) And my code has been taken from the spark examples site: JavaPairDStreamString, Integer pairs = words.*map*( *new* *PairFunctionString, String, Integer()* { @Override *public* Tuple2String, Integer call(String s) *throws* Exception { *return* *new* Tuple2String, Integer(s, 1); } });
java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local class incompatible: stream classdesc
Hi, I have a CDH5.3.2(Spark1.2) cluster. I am getting an local class incompatible exception for my spark application during an action. All my classes are case classes(To best of my knowledge) Appreciate any help. Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 346, datanode02): java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local class incompatible: stream classdesc serialVersionUID = 8789839749593513237, local class serialVersionUID = -4145741279224749316 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks Manas
Writing wide parquet file in Spark SQL
Hi All, I am currently trying to write a very wide file into parquet using spark sql. I have 100K column records that I am trying to write out, but of course I am running into space issues(out of memory - heap space). I was wondering if there are any tweaks or work arounds for this. I am basically calling saveAsParquetFile on the schemaRDD. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-wide-parquet-file-in-Spark-SQL-tp21995.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compilation error
I navigated to maven dependency and found scala library. I also found Tuple2.class and when I click on it in eclipse I get invalid LOC header (bad signature) java.util.zip.ZipException: invalid LOC header (bad signature) at java.util.zip.ZipFile.read(Native Method) I am wondering if I should delete that file from local repo and re-sync On Tue, Mar 10, 2015 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I ran the dependency command and see the following dependencies: I only see org.scala-lang. [INFO] org.spark.test:spak-test:jar:0.0.1-SNAPSHOT [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.2.0:compile [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:co mpile [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compil e [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile [INFO] | +- org.scala-lang:scala-library:jar:2.10.4:compile [INFO] | \- org.spark-project.spark:unused:jar:1.0.0:compile [INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.1:compile [INFO] +- com.twitter:chill_2.10:jar:0.5.0:compile [INFO] | \- com.esotericsoftware.kryo:kryo:jar:2.21:compile [INFO] | +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:co mpile [INFO] | +- com.esotericsoftware.minlog:minlog:jar:1.2:compile [INFO] | \- org.objenesis:objenesis:jar:1.2:compile [INFO] +- com.twitter:chill-java:jar:0.5.0:compile [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile [INFO] | | +- commons-cli:commons-cli:jar:1.2:compile [INFO] | | +- org.apache.commons:commons-math:jar:2.1:compile [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | +- commons-io:commons-io:jar:2.1:compile [INFO] | | +- commons-logging:commons-logging:jar:1.1.1:compile [INFO] | | +- commons-lang:commons-lang:jar:2.5:compile [INFO] | | +- commons-configuration:commons-configuration:jar:1.6:compile [INFO] | | | +- commons-collections:commons-collections:jar:3.2.1:compile [INFO] | | | +- commons-digester:commons-digester:jar:1.8:compile [INFO] | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile [INFO] | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile [INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile [INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] | | +- org.apache.avro:avro:jar:1.7.4:compile [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [INFO] | | +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile [INFO] | | \- org.apache.commons:commons-compress:jar:1.4.1:compile [INFO] | | \- org.tukaani:xz:jar:1.0:compile [INFO] | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile [INFO] | | \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile [INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:co mpile [INFO] | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile [INFO] | | | | +- com.google.inject:guice:jar:3.0:compile [INFO] | | | | | +- javax.inject:javax.inject:jar:1:compile [INFO] | | | | | \- aopalliance:aopalliance:jar:1.0:compile [INFO] | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-framew ork-grizzly2:jar:1.9:compile [INFO] | | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-fra mework-core:jar:1.9:compile [INFO] | | | | | | +- javax.servlet:javax.servlet-api:jar:3.0.1:compile [INFO] | | | | | | \- com.sun.jersey:jersey-client:jar:1.9:compile [INFO] | | | | | \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http:jar:2.1.2:comp ile [INFO] | | | | | | \- org.glassfish.grizzly:grizzly-framework:jar:2. 1.2:compile [INFO] | | | | | | \- org.glassfish.gmbal:gmbal-api-only:jar:3.0. 0-b023:compile [INFO] | | | | | | \- org.glassfish.external:management-api:ja r:3.0.0-b012:compile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-server:jar:2.1 .2:compile [INFO] | | | | | | \- org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:co mpile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-servlet:jar:2. 1.2:compile [INFO] | | | | | \- org.glassfish:javax.servlet:jar:3.1:compile [INFO] | | | | +- com.sun.jersey:jersey-server:jar:1.9:compile [INFO] | | | | | +- asm:asm:jar:3.1:compile [INFO] | | | | | \- com.sun.jersey:jersey-core:jar:1.9:compile [INFO] | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile [INFO] | | | | | +- org.codehaus.jettison:jettison:jar:1.1:compile [INFO] | | | | | | \- stax:stax-api:jar:1.0.1:compile [INFO] | | | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile [INFO] | | | | | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile [INFO] |
Re: Spark Streaming testing strategies
On Tue, Mar 10, 2015 at 1:18 PM, Marcin Kuthan marcin.kut...@gmail.com wrote: Hi Holden Thanks Holden for pointing me the package. Indeed StreamingSuiteBase trait hides a lot, especially regarding clock manipulation. Did you encounter problems with concurrent tests execution from SBT (SPARK-2243)? I had to disable parallel execution and configure SBT to use separate JVM for tests execution (fork). Yah, I haven't used parallel execution with this testing trait, I can look into it some more. BTW. I added samples for SparkSQL as well. Oh awesome :) I would expect base trait for testing purposes in spark distribution. ManualClock should be exposed as well. And some documentation how to configure SBT to avoid problems with multiple spark contexts. I'm going to create improvement proposal on Spark issue tracker about it. Right now I think a package is probably a good place for this to live since the internal Spark testing code is changing/evolving rapidly, but I think once we have the trait fleshed out a bit more we could see if there is enough interest to try and merge it in (just my personal thoughts). On 1 March 2015 at 18:49, Holden Karau hol...@pigscanfly.ca wrote: There is also the Spark Testing Base package which is on spark-packages.org and hides the ugly bits (it's based on the existing streaming test code but I cleaned it up a bit to try and limit the number of internals it was touching). On Sunday, March 1, 2015, Marcin Kuthan marcin.kut...@gmail.com wrote: I have started using Spark and Spark Streaming and I'm wondering how do you test your applications? Especially Spark Streaming application with window based transformations. After some digging I found ManualClock class to take full control over stream processing. Unfortunately the class is not available outside spark.streaming package. Are you going to expose the class for other developers as well? Now I have to use my custom wrapper under spark.streaming package. My Spark and Spark Streaming unit tests strategies are documented here: http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ Your feedback is more than appreciated. Marcin -- Cell : 425-233-8271 -- Cell : 425-233-8271
Re: Compilation error on JavaPairDStream
Ah, that's a typo in the example: use words.mapToPair I can make a little PR to fix that. On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am getting following error. When I look at the sources it seems to be a scala source, but not sure why it's complaining about it. The method map(FunctionString,R) in the type JavaDStreamString is not applicable for the arguments (new PairFunctionString,String,Integer(){}) And my code has been taken from the spark examples site: JavaPairDStreamString, Integer pairs = words.map( new PairFunctionString, String, Integer() { @Override public Tuple2String, Integer call(String s) throws Exception { return new Tuple2String, Integer(s, 1); } }); - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.3 SQL Type Parser Changes?
In Spark 1.2 I used to be able to do this: scala org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint) res30: org.apache.spark.sql.catalyst.types.DataType = StructType(List(StructField(int,LongType,true))) That is, the name of a column can be a keyword like int. This is no longer the case in 1.3: data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint) org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8] failure: ``'' expected but `int' found structint:bigint ^ at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52) at org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785) at org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9) Note HiveTypeHelper is simply an object I load in to expose HiveMetastoreTypes since it was made private. See https://gist.github.com/nitay/460b41ed5fd7608507f5 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c This is actually a pretty big problem for us as we have a bunch of legacy tables with column names like timestamp. They work fine in 1.2, but now everything throws in 1.3. Any thoughts? Thanks, - Nitay Founder CTO
SchemaRDD: SQL Queries vs Language Integrated Queries
I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference between the two approaches, besides using different syntax? Are they interchangeable? Which one has better performance? Thanks a lot -- Cesar Flores
RE: Compilation error
Or another option is to use Scala-IDE, which is built on top of Eclipse, instead of pure Eclipse, so Scala comes with it. Yong From: so...@cloudera.com Date: Tue, 10 Mar 2015 18:40:44 + Subject: Re: Compilation error To: mohitanch...@gmail.com CC: t...@databricks.com; user@spark.apache.org A couple points: You've got mismatched versions here -- 1.2.0 vs 1.2.1. You should fix that but it's not your problem. These are also supposed to be 'provided' scope dependencies in Maven. You should get the Scala deps transitively and can import scala.* classes. However, it would be a little bit more correct to depend directly on the scala library classes, but in practice, easiest not to in simple use cases. If you're still having trouble look at the output of mvn dependency:tree On Tue, Mar 10, 2015 at 6:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am using maven and my dependency looks like this, but this doesn't seem to be working dependencies dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.2.0/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.1/version /dependency /dependencies On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com wrote: If you are using tools like SBT/Maven/Gradle/etc, they figure out all the recursive dependencies and includes them in the class path. I haven't touched Eclipse in years so I am not sure off the top of my head what's going on instead. Just in case you only downloaded the spark-streaming_2.10.jar then that is indeed insufficient and you have to download all the recursive dependencies. May be you should create a Maven project inside Eclipse? TD On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How do I do that? I haven't used Scala before. Also, linking page doesn't mention that: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote: It means you do not have Scala library classes in your project classpath. On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any import scala.Tuple2 class. http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example private void map(JavaReceiverInputDStreamString lines) { JavaDStreamString words = lines.flatMap( new FlatMapFunctionString, String() { @Override public IterableString call(String x) { return Arrays.asList(x.split( )); } }); // Count each word in each batch JavaPairDStreamString, Integer pairs = words.map( new PairFunctionString, String, Integer() { @Override public Tuple2String, Integer call(String s) throws Exception { return new Tuple2String, Integer(s, 1); } }); } - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compilation error
I am using maven and my dependency looks like this, but this doesn't seem to be working dependencies dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.2.0/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.1/version /dependency /dependencies On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com wrote: If you are using tools like SBT/Maven/Gradle/etc, they figure out all the recursive dependencies and includes them in the class path. I haven't touched Eclipse in years so I am not sure off the top of my head what's going on instead. Just in case you only downloaded the spark-streaming_2.10.jar then that is indeed insufficient and you have to download all the recursive dependencies. May be you should create a Maven project inside Eclipse? TD On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How do I do that? I haven't used Scala before. Also, linking page doesn't mention that: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote: It means you do not have Scala library classes in your project classpath. On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any import scala.Tuple2 class. http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example private void map(JavaReceiverInputDStreamString lines) { JavaDStreamString words = lines.flatMap( new FlatMapFunctionString, String() { @Override public IterableString call(String x) { return Arrays.asList(x.split( )); } }); // Count each word in each batch JavaPairDStreamString, Integer pairs = words.map( new PairFunctionString, String, Integer() { @Override public Tuple2String, Integer call(String s) throws Exception { return new Tuple2String, Integer(s, 1); } }); }
Re: Compilation error
See if you can import scala libraries in your project. On Tue, Mar 10, 2015 at 11:32 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am using maven and my dependency looks like this, but this doesn't seem to be working dependencies dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming_2.10/artifactId version1.2.0/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version1.2.1/version /dependency /dependencies On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com wrote: If you are using tools like SBT/Maven/Gradle/etc, they figure out all the recursive dependencies and includes them in the class path. I haven't touched Eclipse in years so I am not sure off the top of my head what's going on instead. Just in case you only downloaded the spark-streaming_2.10.jar then that is indeed insufficient and you have to download all the recursive dependencies. May be you should create a Maven project inside Eclipse? TD On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How do I do that? I haven't used Scala before. Also, linking page doesn't mention that: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote: It means you do not have Scala library classes in your project classpath. On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying out streaming example as documented and I am using spark 1.2.1 streaming from maven for Java. When I add this code I get compilation error on and eclipse is not able to recognize Tuple2. I also don't see any import scala.Tuple2 class. http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example private void map(JavaReceiverInputDStreamString lines) { JavaDStreamString words = lines.flatMap( new FlatMapFunctionString, String() { @Override public IterableString call(String x) { return Arrays.asList(x.split( )); } }); // Count each word in each batch JavaPairDStreamString, Integer pairs = words.map( new PairFunctionString, String, Integer() { @Override public Tuple2String, Integer call(String s) throws Exception { return new Tuple2String, Integer(s, 1); } }); }
ec2 persistent-hdfs with ebs using spot instances
Hello, I'm new to ec2. I've set up a spark cluster on ec2 and am using persistent-hdfs with the data nodes mounting ebs. I launched my cluster using spot-instances ./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z us-east-1c --spark-version=1.2.0 --spot-price=.0321 --hadoop-major-version=2 --copy-aws-credentials --ebs-vol-size=100 launch mysparkcluster My question is, if the spot-instances get dropped, and I try and attach new slaves to existing master with --use-existing-master, can I mount those new slaves to the same ebs volumes? I'm guessing not. If somebody has experience with this, how is it done? Thanks. Sincerely, Deb
How to pass parameter to spark-shell when choose client mode --master yarn-client
Hi All, I try to pass parameter to the spark-shell when I do some test: spark-shell --driver-memory 512M --executor-memory 4G --master spark://:7077 --conf spark.sql.parquet.compression.codec=snappy --conf spark.sql.parquet.binaryAsString=true This works fine on my local pc. And when I start EMR, and pass the similar things: ~/spark/bin/spark-shell --executor-memory 40G --master yarn-client --num-executors 1 --executor-cores 32 --conf spark.sql.parquet.compression.codec=snappy --conf spark.sql.parquet.binaryAsString=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer The parameter doesn't pass to spark-shell. Anyone knows why? And what is the alternatives for me to do that? Regards, Shuai
Re: Setting up Spark with YARN on EC2 cluster
Harika, I think you can modify existing spark on ec2 cluster to run Yarn mapreduce, not sure if this is what you are looking for. To try: 1) logon to master 2) go into either ephemeral-hdfs/conf/ or persistent-hdfs/conf/ and add this to mapred-site.xml : property namemapreduce.framework.name/name valueyarn/value /property 3) use copy-dir to copy this file over to the slaves (don't know if this step is necessary) eg. ~/spark-ec2/copy-dir.sh ~/ephemeral-hdfs/conf/mapred-site.xml 4) stop and restart hdfs (for pesistent-hdfs it wasn't started to begin with) ephemeral-hdfs]$ ./sbin/stop-all.sh ephemeral-hdfs]$ ./sbin/start-all.sh HTH Deb On Wed, Feb 25, 2015 at 11:46 PM, Harika matha.har...@gmail.com wrote: Hi, I want to setup a Spark cluster with YARN dependency on Amazon EC2. I was reading this https://spark.apache.org/docs/1.2.0/running-on-yarn.html document and I understand that Hadoop has to be setup for running Spark with YARN. My questions - 1. Do we have to setup Hadoop cluster on EC2 and then build Spark on it? 2. Is there a way to modify the existing Spark cluster to work with YARN? Thanks in advance. Harika -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compilation error
I ran the dependency command and see the following dependencies: I only see org.scala-lang. [INFO] org.spark.test:spak-test:jar:0.0.1-SNAPSHOT [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.2.0:compile [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:co mpile [INFO] | | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compil e [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile [INFO] | +- org.scala-lang:scala-library:jar:2.10.4:compile [INFO] | \- org.spark-project.spark:unused:jar:1.0.0:compile [INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.1:compile [INFO] +- com.twitter:chill_2.10:jar:0.5.0:compile [INFO] | \- com.esotericsoftware.kryo:kryo:jar:2.21:compile [INFO] | +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:co mpile [INFO] | +- com.esotericsoftware.minlog:minlog:jar:1.2:compile [INFO] | \- org.objenesis:objenesis:jar:1.2:compile [INFO] +- com.twitter:chill-java:jar:0.5.0:compile [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile [INFO] | | +- commons-cli:commons-cli:jar:1.2:compile [INFO] | | +- org.apache.commons:commons-math:jar:2.1:compile [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | +- commons-io:commons-io:jar:2.1:compile [INFO] | | +- commons-logging:commons-logging:jar:1.1.1:compile [INFO] | | +- commons-lang:commons-lang:jar:2.5:compile [INFO] | | +- commons-configuration:commons-configuration:jar:1.6:compile [INFO] | | | +- commons-collections:commons-collections:jar:3.2.1:compile [INFO] | | | +- commons-digester:commons-digester:jar:1.8:compile [INFO] | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile [INFO] | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile [INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile [INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] | | +- org.apache.avro:avro:jar:1.7.4:compile [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [INFO] | | +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile [INFO] | | \- org.apache.commons:commons-compress:jar:1.4.1:compile [INFO] | | \- org.tukaani:xz:jar:1.0:compile [INFO] | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile [INFO] | | \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile [INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:co mpile [INFO] | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile [INFO] | | | | +- com.google.inject:guice:jar:3.0:compile [INFO] | | | | | +- javax.inject:javax.inject:jar:1:compile [INFO] | | | | | \- aopalliance:aopalliance:jar:1.0:compile [INFO] | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-framew ork-grizzly2:jar:1.9:compile [INFO] | | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-fra mework-core:jar:1.9:compile [INFO] | | | | | | +- javax.servlet:javax.servlet-api:jar:3.0.1:compile [INFO] | | | | | | \- com.sun.jersey:jersey-client:jar:1.9:compile [INFO] | | | | | \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http:jar:2.1.2:comp ile [INFO] | | | | | | \- org.glassfish.grizzly:grizzly-framework:jar:2. 1.2:compile [INFO] | | | | | | \- org.glassfish.gmbal:gmbal-api-only:jar:3.0. 0-b023:compile [INFO] | | | | | | \- org.glassfish.external:management-api:ja r:3.0.0-b012:compile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-server:jar:2.1 .2:compile [INFO] | | | | | | \- org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:co mpile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-servlet:jar:2. 1.2:compile [INFO] | | | | | \- org.glassfish:javax.servlet:jar:3.1:compile [INFO] | | | | +- com.sun.jersey:jersey-server:jar:1.9:compile [INFO] | | | | | +- asm:asm:jar:3.1:compile [INFO] | | | | | \- com.sun.jersey:jersey-core:jar:1.9:compile [INFO] | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile [INFO] | | | | | +- org.codehaus.jettison:jettison:jar:1.1:compile [INFO] | | | | | | \- stax:stax-api:jar:1.0.1:compile [INFO] | | | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile [INFO] | | | | | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile [INFO] | | | | | | \- javax.activation:activation:jar:1.1:compile [INFO] | | | | | +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile [INFO] | | | | | \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile [INFO] | | | | \- com.sun.jersey.contribs:jersey-guice:jar:1.9:compile [INFO] | | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:comp ile [INFO] | | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:c ompile [INFO] | +-
Re: ANSI Standard Supported by the Spark-SQL
From the archives in this user list, It seems that Spark-SQL is yet to achieve SQL 92 level. But there are few things still not clear. 1. This is from an old post dated : Aug 09, 2014. 2. It clearly says that it doesn't support DDL and DML operations. Does that means, all reads (select) are sql 92 compliant? Please clarify. Regards, Ravi On Tue, Mar 10, 2015 at 11:46 AM Ravindra ravindra.baj...@gmail.com wrote: Hi All, I am new to spark and trying to understand what SQL Standard is supported by the Spark. I googled around a lot but didn't get clear answer. Some where I saw that Spark supports sql-92 and at other location I found that spark is not fully compliant with sql-92. I also noticed that using Hive Context I can execute almost all the queries supported by the Hive, so does that means that Spark is equivalent to Hive in terms of sql standards, given that HiveContext is used for querying. Please suggest. Regards, Ravi
Re: Spark with Spring
It will be good if you can explain the entire usecase like what kind of requests, what sort of processing etc. Thanks Best Regards On Mon, Mar 9, 2015 at 11:18 PM, Tarun Garg bigdat...@live.com wrote: Hi, I have a existing web base system which receives the request and process that. This framework uses Spring framework. Now i am planning to separate this business logic and out that in Spark Streaming. I am not sure using Spring framework in streaming is how much valuable. Any suggestion is welcome. Thanks Tarun
Re: Joining data using Latitude, Longitude
Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html
Re: Spark History server default conf values
What I found from a quick search of the Spark source code (from my local snapshot on January 25, 2015): // Interval between each check for event log updates private val UPDATE_INTERVAL_MS = conf.getInt(spark.history.fs.updateInterval, conf.getInt(spark.history.updateInterval, 10)) * 1000 private val retainedApplications = conf.getInt(spark.history.retainedApplications, 50) On Tue, Mar 10, 2015 at 12:37 AM Srini Karri skarri@gmail.com wrote: Hi All, What are the default values for the following conf properities if we don't set in the conf file? # spark.history.fs.updateInterval 10 # spark.history.retainedApplications 500 Regards, Srini.
Re: Spark Streaming testing strategies
I would expect base trait for testing purposes in spark distribution. ManualClock should be exposed as well. And some documentation how to configure SBT to avoid problems with multiple spark contexts. I'm going to create improvement proposal on Spark issue tracker about it. Right now I think a package is probably a good place for this to live since the internal Spark testing code is changing/evolving rapidly, but I think once we have the trait fleshed out a bit more we could see if there is enough interest to try and merge it in (just my personal thoughts). You are right, let's wait for community feedback. I'm quite new in Spark but it is strange for me that good support for isolated testing is not first class citizen. In the Spark Programming Guide the chapter about unit testing is not more than three sentences :-( Spring Framework taught me how important is the ability to run tests from your IDE, and how important is tests execution performance (there are also problems with parallelism under single JVM). - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark History server default conf values
Thank you Charles and Meethu. On Tue, Mar 10, 2015 at 12:47 AM, Charles Feduke charles.fed...@gmail.com wrote: What I found from a quick search of the Spark source code (from my local snapshot on January 25, 2015): // Interval between each check for event log updates private val UPDATE_INTERVAL_MS = conf.getInt(spark.history.fs.updateInterval, conf.getInt(spark.history.updateInterval, 10)) * 1000 private val retainedApplications = conf.getInt(spark.history.retainedApplications, 50) On Tue, Mar 10, 2015 at 12:37 AM Srini Karri skarri@gmail.com wrote: Hi All, What are the default values for the following conf properities if we don't set in the conf file? # spark.history.fs.updateInterval 10 # spark.history.retainedApplications 500 Regards, Srini.
Hadoop Map vs Spark stream Map
Hi, I am trying to understand Hadoop Map method compared to spark Map and I noticed that spark Map only receives 3 arguments 1) input value 2) output key 3) output value, however in hadoop map it has 4 values 1) input key 2) input value 3) output key 4) output value. Is there any reason it was designed this way? Just trying to undersand: Hadoop: public void map(K key, V val, OutputCollectorK, V output, Reporter reporter) // Count each word in each batch JavaPairDStreamString, Integer *pairs* = words.mapToPair( *new* *PairFunctionString, String, Integer()* { @Override *public* Tuple2String, Integer call(String s) *throws* Exception { *return* *new* Tuple2String, Integer(s, 1); } });
Re: Compilation error on JavaPairDStream
works now. I should have checked :) On Tue, Mar 10, 2015 at 1:44 PM, Sean Owen so...@cloudera.com wrote: Ah, that's a typo in the example: use words.mapToPair I can make a little PR to fix that. On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am getting following error. When I look at the sources it seems to be a scala source, but not sure why it's complaining about it. The method map(FunctionString,R) in the type JavaDStreamString is not applicable for the arguments (new PairFunctionString,String,Integer(){}) And my code has been taken from the spark examples site: JavaPairDStreamString, Integer pairs = words.map( new PairFunctionString, String, Integer() { @Override public Tuple2String, Integer call(String s) throws Exception { return new Tuple2String, Integer(s, 1); } });
Re: Spark 1.3 SQL Type Parser Changes?
Thanks for reporting. This was a result of a change to our DDL parser that resulted in types becoming reserved words. I've filled a JIRA and will investigate if this is something we can fix. https://issues.apache.org/jira/browse/SPARK-6250 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote: In Spark 1.2 I used to be able to do this: scala org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint) res30: org.apache.spark.sql.catalyst.types.DataType = StructType(List(StructField(int,LongType,true))) That is, the name of a column can be a keyword like int. This is no longer the case in 1.3: data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint) org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8] failure: ``'' expected but `int' found structint:bigint ^ at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52) at org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785) at org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9) Note HiveTypeHelper is simply an object I load in to expose HiveMetastoreTypes since it was made private. See https://gist.github.com/nitay/460b41ed5fd7608507f5 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c This is actually a pretty big problem for us as we have a bunch of legacy tables with column names like timestamp. They work fine in 1.2, but now everything throws in 1.3. Any thoughts? Thanks, - Nitay Founder CTO
Re: SchemaRDD: SQL Queries vs Language Integrated Queries
They should have the same performance, as they are compiled down to the same execution plan. Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote: I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference between the two approaches, besides using different syntax? Are they interchangeable? Which one has better performance? Thanks a lot -- Cesar Flores
Re: Joining data using Latitude, Longitude
There are some techniques you can use If you geohash http://en.wikipedia.org/wiki/Geohash the lat-lngs. They will naturally be sorted by proximity (with some edge cases so watch out). If you go the join route, either by trimming the lat-lngs or geohashing them, you’re essentially grouping nearby locations into buckets — but you have to consider the borders of the buckets since the nearest location may actually be in an adjacent bucket. Here’s a paper that discusses an implementation: http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing then you are better off using haversine formula. http://www.movable-type.co.uk/scripts/latlong.html
RE: Spark SQL Stackoverflow error
import com.google.gson.{GsonBuilder, JsonParser} import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.mllib.clustering.KMeans /** * Examine the collected tweets and trains a model based on them. */ object ExamineAndTrain { val jsonParser = new JsonParser() val gson = new GsonBuilder().setPrettyPrinting().create() def main(args: Array[String]) { val outputModelDir=C:\\outputmode111 val tweetInput=C:\\test val numClusters=10 val numIterations=20 val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4]).set(spark.executor.memory, 1g) val sc = new SparkContext(conf) val tweets = sc.textFile(tweetInput) val vectors = tweets.map(Utils.featurize).cache() vectors.count() // Calls an action on the RDD to populate the vectors cache. val model = KMeans.train(vectors, numClusters, numIterations) sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir) val some_tweets = tweets.take(2) println(Example tweets from the clusters) for (i - 0 until numClusters) { println(s\nCLUSTER $i:) some_tweets.foreach { t = if (model.predict(Utils.featurize(t)) == i) { println(t) } } } } } From: lovelylavs [via Apache Spark User List] [mailto:ml-node+s1001560n21956...@n3.nabble.com] Sent: Sunday, March 08, 2015 2:34 AM To: Jishnu Menath Prathap (WT01 - BAS) Subject: Re: Spark SQL Stackoverflow error Thank you so much for your reply. If it is possible can you please provide me with the code? Thank you so much. Lavanya. From: Jishnu Prathap [via Apache Spark User List] ml-node+[hidden email]/user/SendEmail.jtp?type=nodenode=21956i=0 Sent: Sunday, March 1, 2015 3:03 AM To: Nadikuda, Lavanya Subject: RE: Spark SQL Stackoverflow error Hi The Issue was not fixed . I removed the between sql layer and directly created features from the file. Regards Jishnu Prathap From: lovelylavs [via Apache Spark User List] [mailto:ml-node+[hidden email]/user/SendEmail.jtp?type=nodenode=21863i=0] Sent: Sunday, March 01, 2015 4:44 AM To: Jishnu Menath Prathap (WT01 - BAS) Subject: Re: Spark SQL Stackoverflow error Hi, how was this issue fixed? If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Stackoverflow-error-tp12086p21862.html To unsubscribe from Spark SQL Stackoverflow error, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.comhttp://www.wipro.com If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Stackoverflow-error-tp12086p21863.html To unsubscribe from Spark SQL Stackoverflow error, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Stackoverflow-error-tp12086p21956.html To unsubscribe from Spark SQL Stackoverflow error, click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=12086code=amlzaG51LnByYXRoYXBAd2lwcm8uY29tfDEyMDg2fC0xNzUwOTc3MjE3.
Re: Workaround for spark 1.2.X roaringbitmap kryo problem?
Does anyone know how to get the HighlyCompressedMapStatus to compile? I will try turning off kryo in 1.2.0 and hope things don't break. I want to benefit from the MapOutputTracker fix in 1.2.0. On Tue, Mar 3, 2015 at 5:41 AM, Imran Rashid iras...@cloudera.com wrote: the scala syntax for arrays is Array[T], not T[], so you want to use something: kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]]) kryo.register(classOf[Array[Short]]) nonetheless, the spark should take care of this itself. I'll look into it later today. On Mon, Mar 2, 2015 at 2:55 PM, Arun Luthra arun.lut...@gmail.com wrote: I think this is a Java vs scala syntax issue. Will check. On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra arun.lut...@gmail.com wrote: Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import org.apache.spark.scheduler._ import org.roaringbitmap._ ... kryo.register(classOf[org.roaringbitmap.RoaringBitmap]) kryo.register(classOf[org.roaringbitmap.RoaringArray]) kryo.register(classOf[org.roaringbitmap.ArrayContainer]) kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus]) kryo.register(classOf[org.roaringbitmap.RoaringArray$Element]) kryo.register(classOf[org.roaringbitmap.RoaringArray$Element[]]) kryo.register(classOf[short[]]) in build file: libraryDependencies += org.roaringbitmap % RoaringBitmap % 0.4.8 This fails to compile: ...:53: identifier expected but ']' found. [error] kryo.register(classOf[org.roaringbitmap.RoaringArray$Element[]]) also: :54: identifier expected but ']' found. [error] kryo.register(classOf[short[]]) also: :51: class HighlyCompressedMapStatus in package scheduler cannot be accessed in package org.apache.spark.scheduler [error] kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus]) Suggestions? Arun
Re: Spark 1.3 SQL Type Parser Changes?
Hi Nitay, Can you try using backticks to quote the column name? Like org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType( struct`int`:bigint)? Thanks, Yin On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust mich...@databricks.com wrote: Thanks for reporting. This was a result of a change to our DDL parser that resulted in types becoming reserved words. I've filled a JIRA and will investigate if this is something we can fix. https://issues.apache.org/jira/browse/SPARK-6250 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote: In Spark 1.2 I used to be able to do this: scala org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint) res30: org.apache.spark.sql.catalyst.types.DataType = StructType(List(StructField(int,LongType,true))) That is, the name of a column can be a keyword like int. This is no longer the case in 1.3: data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint) org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8] failure: ``'' expected but `int' found structint:bigint ^ at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52) at org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785) at org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9) Note HiveTypeHelper is simply an object I load in to expose HiveMetastoreTypes since it was made private. See https://gist.github.com/nitay/460b41ed5fd7608507f5 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c This is actually a pretty big problem for us as we have a bunch of legacy tables with column names like timestamp. They work fine in 1.2, but now everything throws in 1.3. Any thoughts? Thanks, - Nitay Founder CTO
SQL with Spark Streaming
Does Spark Streaming also supports SQLs? Something like how Esper does CEP.
RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?
Oh, sorry, my bad, currently Spark SQL doesn’t provide the user interface for UDAF, but it can work seamlessly with Hive UDAF (via HiveContext). I am also working on the UDAF interface refactoring, after that we can provide the custom interface for extension. https://github.com/apache/spark/pull/3247 From: shahab [mailto:shahab.mok...@gmail.com] Sent: Wednesday, March 11, 2015 1:44 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF( user defined type function ). I appreciate if you could point me to some starting point on UDAF development in Spark. Thanks Shahab On Tuesday, March 10, 2015, Cheng, Hao hao.ch...@intel.commailto:hao.ch...@intel.com wrote: Currently, Spark SQL doesn’t provide interface for developing the custom UDTF, but it can work seamless with Hive UDTF. I am working on the UDTF refactoring for Spark SQL, hopefully will provide an Hive independent UDTF soon after that. From: shahab [mailto:shahab.mok...@gmail.comjavascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com');] Sent: Tuesday, March 10, 2015 5:44 PM To: user@spark.apache.orgjavascript:_e(%7B%7D,'cvml','user@spark.apache.org'); Subject: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Hi, I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can be registered as a function in HiveContext, I could not find any documentation of how UDAFs can be registered in the HiveContext?? so far what I have found is to make a JAR file, out of developed UDAF class, and then deploy the JAR file to SparkSQL . But is there any way to avoid deploying the jar file and register it programmatically? best, /Shahab
RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?
I am not so sure if Hive supports change the metastore after initialized, I guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably that's why it doesn't work as expected for Q1. BTW, in most of cases, people configure the metastore settings in hive-site.xml, and will not change that since then, is there any reason that you want to change that in runtime? For Q2, probably something wrong in configuration, seems the HDFS run into the pseudo/single node mode, can you double check that? Or can you run the DDL (like create a table) from the spark shell with HiveContext? From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, March 10, 2015 6:38 PM To: user; d...@spark.apache.org Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse? I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table src). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w) scala sqlContext.sql(SELECT * from src).saveAsTable(table1) scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2) scala sqlContext.sql(SELECT * from src).saveAsTable(table2) == After these steps, the tables are stored in /test/w only. I expect table2 to be stored in /test/w2 folder. Another question is: if I set hive.metastore.warehouse.dir to a HDFS folder, I cannot use saveAsTable()? Is this by design? Exception stack trace is below: == 15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:74 java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/space/warehouse/table2, expected: file:///file:///\\ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463) at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:370) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308) at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC.init(console:35) at $iwC.init(console:37) at init(console:39) Thank you very much!
Numbering RDD members Sequentially
I have Hadoop Input Format which reads records and produces JavaPairRDDString,String locatedData where _1() is a formatted version of the file location - like 12690,, 24386 .27523 ... _2() is data to be processed For historical reasons I want to convert _1() into in integer representing the record number. so keys become 0001, 002 ... (Yes I know this cannot be done in parallel) The PairRDD may be too large to collect and work on one machine but small enough to handle on a single machine. I could use toLocalIterator to guarantee execution on one machine but last time I tried this all kinds of jobs were launched to get the next element of the iterator and I was not convinced this approach was efficient. Any bright ideas?
RE: SQL with Spark Streaming
Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. From: Mohit Anchlia [mailto:mohitanch...@gmail.com] Sent: Wednesday, March 11, 2015 8:12 AM To: user@spark.apache.org Subject: SQL with Spark Streaming Does Spark Streaming also supports SQLs? Something like how Esper does CEP.