Re: spark mesos deployment : starting workers based on attributes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Created issue: https://issues.apache.org/jira/browse/SPARK-6707 I would really appreciate ideas/views/opinions on this feature. - -- Ankur Chauhan On 03/04/2015 13:23, Tim Chen wrote: Hi Ankur, There isn't a way to do that yet, but it's simple to add. Can you create a JIRA in Spark for this? Thanks! Tim On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan achau...@brightcove.com mailto:achau...@brightcove.com wrote: Hi, I am trying to figure out if there is a way to tell the mesos scheduler in spark to isolate the workers to a set of mesos slaves that have a given attribute such as `tachyon:true`. Anyone knows if that is possible or how I could achieve such a behavior. Thanks! -- Ankur Chauhan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - -- - -- Ankur Chauhan -BEGIN PGP SIGNATURE- iQEcBAEBAgAGBQJVH4xBAAoJEOSJAMhvLp3LMfsH/0oyQ4fGomCd8GnQzqVrZ6zj cgwhOyntz5aaBdjipVez1EwzNzG/3kXzFnK3YpuT6SXdXuPLSD6NX62ju/Ii+86w /Y15taXt1qo+Ah6CLkofCPAPY1HRCZ+KAM/KzW45W+uGvcUqyupPFUEvN/a9/hYC Ok7AERk8Tw/CRoU/Fbz/23LxjK1TJUW1klaToUjyij2oakMUxT7HnqS08fCUBJF6 pEqXJ+gHGW3br6BJcvwce7my8bFlPShVP+exhcNhpmqjoRvSf//etmP2E0Me2hXM ZmghjIqRhoAI4sJYIhEBBQS7r4AsI5FQyNkr8i4Hqed4dq61YA7FcpUCC+GDbTY= =pVkB -END PGP SIGNATURE- - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: conversion from java collection type to scala JavaRDDObject
Hi I have tried with parallelize but i got the below exception java.io.NotSerializableException: pacific.dr.VendorRecord Here is my code ListVendorRecord vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput); JavaRDDVendorRecord lines = sc.parallelize(vendorRecords) On 2 April 2015 at 21:11, Dean Wampler deanwamp...@gmail.com wrote: Use JavaSparkContext.parallelize. http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All Is there an way to make the JavaRDDObject from existing java collection type ListObject? I know this can be done using scala , but i am looking how to do this using java. Regards Jeetendra
Spark Vs MR
How is spark faster than MR when data is in disk in both cases? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Migrating from Spark 0.8.0 to Spark 1.3.0
It shouldn't be too bad - pertinent changes migration notes are here: http://spark.apache.org/docs/1.0.0/programming-guide.html#migrating-from-pre-10-versions-of-spark for pre-1.0 and here: http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13 for SparkSQL pre-1.3 Since you aren't using SparkSQL the 2nd link is probably not useful. Generally you should find very few changes in the core API but things like MLlib would have changed a fair bit - though again the API should have been relatively stable. Your biggest change is probably going to be running jobs through spark-submit rather than spark-class etc: http://spark.apache.org/docs/latest/submitting-applications.html — Sent from Mailbox On Sat, Apr 4, 2015 at 1:11 AM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Hi, Are there any tutorials that explains all the changelogs between Spark 0.8.0 and Spark 1.3.0 and how can we approach this issue.
Issue of sqlContext.createExternalTable with parquet partition discovery after changing folder structure
Hi Spark Users, I'm testing 1.3 new feature of parquet partition discovery. I have 2 sub folders, each has 800 rows. /data/table1/key=1 /data/table1/key=2 In spark-shell, run this command: val t = sqlContext.createExternalTable(table1, hdfs:///data/table1, parquet) t.count It shows 1600 successfully. But after that, I add a new folder /data/table1/key=3, then run t.count again, it still gives me 1600, not 2400. I try to restart spark-shell, then run val t = sqlContext.table(table1) t.count It's 2400 now. I'm wondering there should be a partition cache in driver, I try to set spark.sql.parquet.cacheMetadata to false and test it again, unfortunately it doesn't help. How can I disable this partition cache or force refresh the cache? Thanks
Re: 4 seconds to count 13M lines. Does it make sense?
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of cores. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Vs MR
If data is on HDFS, it is not read any more or less quickly by either framework. Both are in fact using the same logic to exploit locality, and read and deserialize data anyway. I don't think this is what anyone claims though. Spark can be faster in a multi-stage operation, which would require several MRs. The MRs must hit disk again after the reducer whereas Spark might not, possibly by persisting outputs in memory. A similar but larger speedup can be had for iterative computations that access the same data in memory; caching it means reading it from disk once, but then re-reading from memory only. For a single operation that really is a map and a reduce, starting and ending on HDFS, I would expect MR to be a bit faster just because it is so optimized for this one pattern. Even that depends a lot, and wouldn't be significant. On Sat, Apr 4, 2015 at 11:19 AM, SamyaMaiti samya.maiti2...@gmail.com wrote: How is spark faster than MR when data is in disk in both cases? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Need help with ALS Recommendation code
Hi , I am trying to run the following command in the Movie Recommendation example provided by the ampcamp tutorial Command: sbt package run /movielens/medium Exception: sbt.TrapExitSecurityException thrown from the UncaughtExceptionHandler in thread run-main-0 java.lang.RuntimeException: Nonzero exit code: 1 at scala.sys.package$.error(package.scala:27) [trace] Stack trace suppressed: run last compile:run for the full output. [error] (compile:run) Nonzero exit code: 1 [error] Total time: 0 s, completed Apr 4, 2015 12:00:18 PM I am unable to identify the error code.Can someone help me on this. Regards Phani Kumar
Re: newAPIHadoopRDD Mutiple scan result return from Hbase
Here is my conf object passing first parameter of API. but here I want to pass multiple scan means i have 4 criteria for STRAT ROW and STOROW in same table. by using below code i can get result for one STARTROW and ENDROW. Configuration conf = DBConfiguration.getConf(); // int scannerTimeout = (int) conf.getLong( // HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY, -1); // System.out.println(lease timeout on server is+scannerTimeout); int scannerTimeout = (int) conf.getLong( hbase.client.scanner.timeout.period, -1); // conf.setLong(hbase.client.scanner.timeout.period, 6L); conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME); Scan scan = new Scan(); scan.addFamily(FAMILY); FilterList filterList = new FilterList(Operator.MUST_PASS_ALL); filterList.addFilter(new KeyOnlyFilter()); filterList.addFilter(new FirstKeyOnlyFilter()); scan.setFilter(filterList); scan.setCacheBlocks(false); scan.setCaching(10); scan.setBatch(1000); scan.setSmall(false); conf.set(TableInputFormat.SCAN, DatabaseUtils.convertScanToString(scan)); return conf; On 4 April 2015 at 20:54, Jeetendra Gangele gangele...@gmail.com wrote: Hi All, Can we get the result of the multiple scan from JavaSparkContext.newAPIHadoopRDD from Hbase. This method first parameter take configuration object where I have added filter. but how Can I query multiple scan from same table calling this API only once? regards jeetendra
Re: Parquet timestamp support for Hive?
Avoiding maintaining a separate Hive version is one of the initial purpose of Spark SQL. (We had once done this for Shark.) The org.spark-project.hive:hive-0.13.1a artifact only cleans up some 3rd dependencies to avoid dependency hell in Spark. This artifact is exactly the same as Hive 0.13.1 at the source level. On the other hand, we're planning to add a Hive metastore adapter layer to Spark SQL so that in the future we can talk to arbitrary versions greater than or equal to 0.13.1 of Hive metastore, and then always stick to the most recent Hive versions to provide the most recent Hive features. This will probably happen in Spark 1.4 or 1.5. Cheng On 4/3/15 7:59 PM, Rex Xiong wrote: Hi, I got this error when creating a hive table from parquet file: DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Parquet does not support timestamp. See HIVE-6384 I check HIVE-6384, it's fixed in 0.14. The hive in spark build is a customized version 0.13.1a (GroupId: org.spark-project.hive), is it possible to get the source code for it and apply patch from HIVE-6384? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError
I think this is a bug of Spark SQL dates back to at least 1.1.0. The json_tuple function is implemented as org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The ClassNotFoundException should complain with the class name rather than the UDTF function name. The problematic line should be this one https://github.com/apache/spark/blob/9b40c17ab161b64933539abeefde443cb4f98673/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala#L1288. HiveFunctionWrapper expects the full qualified class name of the UDTF class that implements the function, but we pass in the function name. Thanks for reporting this! Cheng On 4/2/15 3:19 AM, Todd Nist wrote: I have a feeling I’m missing a Jar that provides the support or could this may be related to https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar where would I find that ? I would have thought in the $HIVE/lib folder, but not sure which jar contains it. Error: |Create MetricTemporary Table for querying 15/04/01 14:41:44 INFO HiveMetaStore:0: Opening raw storewith implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/04/01 14:41:44 INFO ObjectStore: ObjectStore, initialize called 15/04/01 14:41:45 INFO Persistence: Property hive.metastore.integral.jdo.pushdownunknown - will be ignored 15/04/01 14:41:45 INFO Persistence: Property datanucleus.cache.level2unknown - will be ignored 15/04/01 14:41:45 INFO BlockManager: Removing broadcast0 15/04/01 14:41:45 INFO BlockManager: Removing block broadcast_0 15/04/01 14:41:45 INFO MemoryStore: Block broadcast_0of size 1272 droppedfrom memory (free278018571) 15/04/01 14:41:45 INFO BlockManager: Removing block broadcast_0_piece0 15/04/01 14:41:45 INFO MemoryStore: Block broadcast_0_piece0of size 869 droppedfrom memory (free278019440) 15/04/01 14:41:45 INFO BlockManagerInfo: Removed broadcast_0_piece0on 192.168.1.5:63230 in memory (size:869.0 B, free:265.1 MB) 15/04/01 14:41:45 INFO BlockManagerMaster: Updated infoof block broadcast_0_piece0 15/04/01 14:41:45 INFO BlockManagerInfo: Removed broadcast_0_piece0on 192.168.1.5:63278 in memory (size:869.0 B, free:530.0 MB) 15/04/01 14:41:45 INFO ContextCleaner: Cleaned broadcast0 15/04/01 14:41:46 INFO ObjectStore: Setting MetaStore object pin classeswith hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/04/01 14:41:46 INFO Datastore: The classorg.apache.hadoop.hive.metastore.model.MFieldSchema is taggedas embedded-only so doesnot have its own datastoretable. 15/04/01 14:41:46 INFO Datastore: The classorg.apache.hadoop.hive.metastore.model.MOrder is taggedas embedded-only so doesnot have its own datastoretable. 15/04/01 14:41:47 INFO Datastore: The classorg.apache.hadoop.hive.metastore.model.MFieldSchema is taggedas embedded-only so doesnot have its own datastoretable. 15/04/01 14:41:47 INFO Datastore: The classorg.apache.hadoop.hive.metastore.model.MOrder is taggedas embedded-only so doesnot have its own datastoretable. 15/04/01 14:41:47 INFO Query: Readingin resultsfor queryorg.datanucleus.store.rdbms.query.SQLQuery@0 since theconnection usedis closing 15/04/01 14:41:47 INFO ObjectStore: Initialized ObjectStore 15/04/01 14:41:47 INFO HiveMetaStore: Added admin rolein metastore 15/04/01 14:41:47 INFO HiveMetaStore: Addedpublic rolein metastore 15/04/01 14:41:48 INFO HiveMetaStore:No user is addedin admin role, since configis empty 15/04/01 14:41:48 INFO SessionState:No Tezsession requiredat this point. hive.execution.engine=mr. 15/04/01 14:41:49 INFO ParseDriver: Parsing command:SELECT path, name,value, v1.peValue, v1.peName FROM metric lateralview json_tuple(pathElements,'name','value') v1 as peName, peValue 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in threadmain java.lang.ClassNotFoundException: json_tuple at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:141) at org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:261) at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:261) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:267) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:267) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:272) at
Re: Spark Streaming FileStream Nested File Support
We've a custom version/build of sparktreaming doing the nested s3 lookups faster (uses native S3 APIs). You can find the source code over here : https://github.com/sigmoidanalytics/spark-modified, In particular the changes from here https://github.com/sigmoidanalytics/spark-modified/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L206. And the binary jars here : https://github.com/sigmoidanalytics/spark-modified/tree/master/lib Here's the instructions to use it: This is how you create your stream: val lines = ssc.*s3FileStream*[LongWritable, Text, TextInputFormat](bucketname/) You need ACCESS_KEY and SECRET_KEY in the environment for this to work. Also, by default it is recursive. Also you need these jars https://github.com/sigmoidanalytics/spark-modified/tree/master/lib in the SPARK_CLASSPATH: aws-java-sdk-1.8.3.jarhttpclient-4.2.5.jar aws-java-sdk-1.9.24.jar httpcore-4.3.2.jar aws-java-sdk-core-1.9.24.jar joda-time-2.6.jar aws-java-sdk-s3-1.9.24.jarspark-streaming_2.10-1.2.0.jar Let me know if you need any more clarification/information on this, feel free to suggest changes. Thanks Best Regards On Sat, Apr 4, 2015 at 3:30 AM, Tathagata Das t...@databricks.com wrote: Yes, definitely can be added. Just haven't gotten around to doing it :) There are proposals for this that you can try - https://github.com/apache/spark/pull/2765/files . Have you review it at some point. On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter adamge...@gmail.com wrote: That doesn't seem like a good solution unfortunately as I would be needing this to work in a production environment. Do you know why the limitation exists for FileInputDStream in the first place? Unless I'm missing something important about how some of the internals work I don't see why this feature could be added in at some point. On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com wrote: I sort-a-hacky workaround is to use a queueStream where you can manually create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note that this is for testing only as queueStream does not work with driver fautl recovery. TD On Fri, Apr 3, 2015 at 12:23 PM, adamgerst adamge...@gmail.com wrote: So after pulling my hair out for a bit trying to convert one of my standard spark jobs to streaming I found that FileInputDStream does not support nested folders (see the brief mention here http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources the fileStream method returns a FileInputDStream). So before, for my standard job, I was reading from say s3n://mybucket/2015/03/02/*log And could also modify it to simply get an entire months worth of logs. Since the logs are split up based upon their date, when the batch ran for the day, I simply passed in a parameter of the date to make sure I was reading the correct data But since I want to turn this job into a streaming job I need to simply do something like s3n://mybucket/*log This would totally work fine if it were a standard spark application, but fails for streaming. Is there anyway I can get around this limitation? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
newAPIHadoopRDD Mutiple scan result return from Hbase
Hi All, Can we get the result of the multiple scan from JavaSparkContext.newAPIHadoopRDD from Hbase. This method first parameter take configuration object where I have added filter. but how Can I query multiple scan from same table calling this API only once? regards jeetendra
Re: Parquet Hive table become very slow on 1.3?
Hey Xudong, We had been digging this issue for a while, and believe PR 5339 http://github.com/apache/spark/pull/5339 and PR 5334 http://github.com/apache/spark/pull/5339 should fix this issue. There two problems: 1. Normally we cache Parquet table metadata for better performance, but when converting Hive metastore Hive tables, the cache is not used. Thus heavy operations like schema discovery is done every time a metastore Parquet table is converted. 2. With Parquet task side metadata reading (which is turned on by default), we can actually skip the row group information in the footer. However, we accidentally called a Parquet function which doesn't skip row group information. For your question about schema merging, Parquet allows different part-files have different but compatible schemas. For example, part-1.parquet has columns a and b, while part-2.parquet may has columns a and c. In some cases, the summary files (_metadata and _common_metadata) contains the merged schema (a, b, and c), but it's not guaranteed. For example, when the user defined metadata stored different part-files contain different values for the same key, Parquet simply gives up writing summary files. That's why all part-files must be touched to get a precise merged schema. However, in scenarios where a centralized arbitrative schema is available (e.g. Hive metastore schema, or the schema provided by user via data source DDL), we don't need to do schema merging on driver side, but defer it to executor side and each task only needs to reconcile those part-files it needs to touch. This is also what the Parquet developers did recently for parquet-hadoop https://github.com/apache/incubator-parquet-mr/pull/45. Cheng On 3/31/15 11:49 PM, Zheng, Xudong wrote: Thanks Cheng! Set 'spark.sql.parquet.useDataSourceApi' to false resolves my issues, but the PR 5231 seems not. Not sure any other things I did wrong ... BTW, actually, we are very interested in the schema merging feature in Spark 1.3, so both these two solution will disable this feature, right? It seems that Parquet metadata is store in a file named _metadata in the Parquet file folder (each folder is a partition as we use partition table), why we need scan all Parquet part files? Is there any other solutions could keep schema merging feature at the same time? We are really like this feature :) On Tue, Mar 31, 2015 at 3:19 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Hi Xudong, This is probably because of Parquet schema merging is turned on by default. This is generally useful for Parquet files with different but compatible schemas. But it needs to read metadata from all Parquet part-files. This can be problematic when reading Parquet files with lots of part-files, especially when the user doesn't need schema merging. This issue is tracked by SPARK-6575, and here is a PR for it: https://github.com/apache/spark/pull/5231. This PR adds a configuration to disable schema merging by default when doing Hive metastore Parquet table conversion. Another workaround is to fallback to the old Parquet code by setting spark.sql.parquet.useDataSourceApi to false. Cheng On 3/31/15 2:47 PM, Zheng, Xudong wrote: Hi all, We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we find that, just a simple COUNT(*) query will much slower (100x) than Spark 1.2. I find the most time spent on driver to get HDFS blocks. I find large amount of get below logs printed: 15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2097ms 15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{ fileLength=77153436 underConstruction=false blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.172:50010 http://10.152.116.172:50010,10.152.116.169:50010 http://10.152.116.169:50010, 10.153.125.184:50010]}] lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948 tel:1075187948_1448275; getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.169:50010 http://10.152.116.169:50010,10.153.125.184:50010 http://10.153.125.184:50010,10.152.116.172:50010 http://10.152.116.172:50010]} isLastBlockComplete=true} 15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode10.152.116.172:50010 http://10.152.116.172:50010 I compare the printed log with Spark 1.2, although the number of getBlockLocations call is similar, but each such operation only cost 20~30 ms (but it is 2000ms~3000ms now), and it didn't print the detailed LocatedBlocks info. Another finding is, if I read the Parquet file via scala code form spark-shell as below, it looks fine, the computation will return the result quick as before.
Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError
Filed https://issues.apache.org/jira/browse/SPARK-6708 to track this. Cheng On 4/4/15 10:21 PM, Cheng Lian wrote: I think this is a bug of Spark SQL dates back to at least 1.1.0. The json_tuple function is implemented as org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The ClassNotFoundException should complain with the class name rather than the UDTF function name. The problematic line should be this one https://github.com/apache/spark/blob/9b40c17ab161b64933539abeefde443cb4f98673/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala#L1288. HiveFunctionWrapper expects the full qualified class name of the UDTF class that implements the function, but we pass in the function name. Thanks for reporting this! Cheng On 4/2/15 3:19 AM, Todd Nist wrote: I have a feeling I’m missing a Jar that provides the support or could this may be related to https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar where would I find that ? I would have thought in the $HIVE/lib folder, but not sure which jar contains it. Error: |Create MetricTemporary Table for querying 15/04/01 14:41:44 INFO HiveMetaStore:0: Opening raw storewith implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/04/01 14:41:44 INFO ObjectStore: ObjectStore, initialize called 15/04/01 14:41:45 INFO Persistence: Property hive.metastore.integral.jdo.pushdownunknown - will be ignored 15/04/01 14:41:45 INFO Persistence: Property datanucleus.cache.level2unknown - will be ignored 15/04/01 14:41:45 INFO BlockManager: Removing broadcast0 15/04/01 14:41:45 INFO BlockManager: Removing block broadcast_0 15/04/01 14:41:45 INFO MemoryStore: Block broadcast_0of size 1272 droppedfrom memory (free278018571) 15/04/01 14:41:45 INFO BlockManager: Removing block broadcast_0_piece0 15/04/01 14:41:45 INFO MemoryStore: Block broadcast_0_piece0of size 869 droppedfrom memory (free278019440) 15/04/01 14:41:45 INFO BlockManagerInfo: Removed broadcast_0_piece0on 192.168.1.5:63230 in memory (size:869.0 B, free:265.1 MB) 15/04/01 14:41:45 INFO BlockManagerMaster: Updated infoof block broadcast_0_piece0 15/04/01 14:41:45 INFO BlockManagerInfo: Removed broadcast_0_piece0on 192.168.1.5:63278 in memory (size:869.0 B, free:530.0 MB) 15/04/01 14:41:45 INFO ContextCleaner: Cleaned broadcast0 15/04/01 14:41:46 INFO ObjectStore: Setting MetaStore object pin classeswith hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/04/01 14:41:46 INFO Datastore: The classorg.apache.hadoop.hive.metastore.model.MFieldSchema is taggedas embedded-only so doesnot have its own datastoretable. 15/04/01 14:41:46 INFO Datastore: The classorg.apache.hadoop.hive.metastore.model.MOrder is taggedas embedded-only so doesnot have its own datastoretable. 15/04/01 14:41:47 INFO Datastore: The classorg.apache.hadoop.hive.metastore.model.MFieldSchema is taggedas embedded-only so doesnot have its own datastoretable. 15/04/01 14:41:47 INFO Datastore: The classorg.apache.hadoop.hive.metastore.model.MOrder is taggedas embedded-only so doesnot have its own datastoretable. 15/04/01 14:41:47 INFO Query: Readingin resultsfor queryorg.datanucleus.store.rdbms.query.SQLQuery@0 since theconnection usedis closing 15/04/01 14:41:47 INFO ObjectStore: Initialized ObjectStore 15/04/01 14:41:47 INFO HiveMetaStore: Added admin rolein metastore 15/04/01 14:41:47 INFO HiveMetaStore: Addedpublic rolein metastore 15/04/01 14:41:48 INFO HiveMetaStore:No user is addedin admin role, since configis empty 15/04/01 14:41:48 INFO SessionState:No Tezsession requiredat this point. hive.execution.engine=mr. 15/04/01 14:41:49 INFO ParseDriver: Parsing command:SELECT path, name,value, v1.peValue, v1.peName FROM metric lateralview json_tuple(pathElements,'name','value') v1 as peName, peValue 15/04/01 14:41:49 INFO ParseDriver: Parse Completed Exception in threadmain java.lang.ClassNotFoundException: json_tuple at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:141) at org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:261) at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:261) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:267) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:267) at
Re: conversion from java collection type to scala JavaRDDObject
Without the rest of your code, it's hard to know what might be unserializable. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Sat, Apr 4, 2015 at 7:56 AM, Jeetendra Gangele gangele...@gmail.com wrote: Hi I have tried with parallelize but i got the below exception java.io.NotSerializableException: pacific.dr.VendorRecord Here is my code ListVendorRecord vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput); JavaRDDVendorRecord lines = sc.parallelize(vendorRecords) On 2 April 2015 at 21:11, Dean Wampler deanwamp...@gmail.com wrote: Use JavaSparkContext.parallelize. http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All Is there an way to make the JavaRDDObject from existing java collection type ListObject? I know this can be done using scala , but i am looking how to do this using java. Regards Jeetendra
Re: Issue of sqlContext.createExternalTable with parquet partition discovery after changing folder structure
You need to refresh the external table manually after updating the data source outside Spark SQL: - via Scala API: sqlContext.refreshTable(table1) - via SQL: REFRESH TABLE table1; Cheng On 4/4/15 5:24 PM, Rex Xiong wrote: Hi Spark Users, I'm testing 1.3 new feature of parquet partition discovery. I have 2 sub folders, each has 800 rows. /data/table1/key=1 /data/table1/key=2 In spark-shell, run this command: val t = sqlContext.createExternalTable(table1, hdfs:///data/table1, parquet) t.count It shows 1600 successfully. But after that, I add a new folder /data/table1/key=3, then run t.count again, it still gives me 1600, not 2400. I try to restart spark-shell, then run val t = sqlContext.table(table1) t.count It's 2400 now. I'm wondering there should be a partition cache in driver, I try to set spark.sql.parquet.cacheMetadata to false and test it again, unfortunately it doesn't help. How can I disable this partition cache or force refresh the cache? Thanks
DataFrame groupBy MapType
Hello, I have a case class like this: case class A( m: Map[Long, Long], ... ) and constructed a DataFrame from Seq[A]. I would like to perform a groupBy on A.m(SomeKey). I can implement a UDF, create a new Column then invoke a groupBy on the new Column. But is it the idiomatic way of doing such operation? Can't find much info about operating MapType on Column in the doc. Thanks ahead! Justin
Spark Streaming program questions
I have two questions: 1) In a Spark Streaming program, after the various DStream transformations have being setup, the ssc.start() method is called to start the computation. Can the underlying DAG change (ie. add another map or maybe a join) after ssc.start() has been called (and maybe messages have already been received/processed for some batches)? 2) In a Spark Streaming program (one process), can I have multiple DStream transformations, each series belonging to each own StreamingContext (in the same thread or in different threads)? For example: val lines_A = ssc_A.socketTextStream(..) val words_A = lines_A.flatMap(_.split( )) val wordCounts_A = words_A.map(x = (x, 1)).reduceByKey(_ + _) wordCounts_A.print() val lines_B = ssc_B.socketTextStream(..) val words_B = lines_B.flatMap(_.split( )) val wordCounts_B = words_B.map(x = (x, 1)).reduceByKey(_ + _) wordCounts_B.print() ssc_A.start() ssc_B.start() I think the answer is NO to both questions but I am wondering what is the reason for this behavior. Thanks, Nickos
Re: Spark Streaming program questions
UNSUBSCRIBE On Sun, Apr 5, 2015 at 6:43 AM, nickos168 nickos...@yahoo.com.invalid wrote: I have two questions: 1) In a Spark Streaming program, after the various DStream transformations have being setup, the ssc.start() method is called to start the computation. Can the underlying DAG change (ie. add another map or maybe a join) after ssc.start() has been called (and maybe messages have already been received/processed for some batches)? 2) In a Spark Streaming program (one process), can I have multiple DStream transformations, each series belonging to each own StreamingContext (in the same thread or in different threads)? For example: val lines_A = ssc_A.socketTextStream(..)val words_A = lines_A.flatMap(_.split( ))val wordCounts_A = words_A.map(x = (x, 1)).reduceByKey(_ + _) wordCounts_A.print() val lines_B = ssc_B.socketTextStream(..)val words_B = lines_B.flatMap(_.split( ))val wordCounts_B = words_B.map(x = (x, 1 )).reduceByKey(_ + _) wordCounts_B.print() ssc_A.start() ssc_B.start() I think the answer is NO to both questions but I am wondering what is the reason for this behavior. Thanks, Nickos
Re: Spark SQL Self join with agreegate
I am not sure whether this can be possible but i have tried something like SELECT time, src, dst, sum(val1), sum(val2) from table group by src,dst; and it works.I think it will result the same answer as you are expecting -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Self-join-with-agreegate-tp22151p22378.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark + Kinesis
Hi all, More good news! I was able to utilize mergeStrategy to assembly my Kinesis consumer into an uber jar Here's what I added to* build.sbt:* *mergeStrategy in assembly = (mergeStrategy in assembly) { (old) =* * {* * case PathList(com, esotericsoftware, minlog, xs @ _*) = MergeStrategy.first* * case PathList(com, google, common, base, xs @ _*) = MergeStrategy.first* * case PathList(org, apache, commons, xs @ _*) = MergeStrategy.last* * case PathList(org, apache, hadoop, xs @ _*) = MergeStrategy.first* * case PathList(org, apache, spark, unused, xs @ _*) = MergeStrategy.first* *case x = old(x)* * }* *}* Everything appears to be working fine. Right now my producer is pushing simple strings through Kinesis, which my consumer is trying to print (using Spark's print() method for now). However, instead of displaying my strings, I get the following: *15/04/04 18:57:32 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer(1428173848000 ms)* Any idea on what might be going on? Thanks, Vadim Here's my consumer code (adapted from the WordCount example): *private object MyConsumer extends Logging { def main(args: Array[String]) {/* Check that all required args were passed in. */if (args.length 2) { System.err.println( |Usage: KinesisWordCount stream-name endpoint-url |stream-name is the name of the Kinesis stream |endpoint-url is the endpoint of the Kinesis service | (e.g. https://kinesis.us-east-1.amazonaws.com https://kinesis.us-east-1.amazonaws.com).stripMargin) System.exit(1)}/* Populate the appropriate variables from the given args */val Array(streamName, endpointUrl) = args/* Determine the number of shards from the stream */val kinesisClient = new AmazonKinesisClient(new DefaultAWSCredentialsProviderChain()) kinesisClient.setEndpoint(endpointUrl)val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards() .size()System.out.println(Num shards: + numShards)/* In this example, we're going to create 1 Kinesis Worker/Receiver/DStream for each shard. */val numStreams = numShards/* Setup the and SparkConfig and StreamingContext *//* Spark Streaming batch interval */val batchInterval = Milliseconds(2000)val sparkConfig = new SparkConf().setAppName(MyConsumer)val ssc = new StreamingContext(sparkConfig, batchInterval)/* Kinesis checkpoint interval. Same as batchInterval for this example. */val kinesisCheckpointInterval = batchInterval/* Create the same number of Kinesis DStreams/Receivers as Kinesis stream's shards */val kinesisStreams = (0 until numStreams).map { i = KinesisUtils.createStream(ssc, streamName, endpointUrl, kinesisCheckpointInterval, InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)}/* Union all the streams */val unionStreams = ssc.union(kinesisStreams).map(byteArray = new String(byteArray))unionStreams.print()ssc.start() ssc.awaitTermination() }}* ᐧ On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com wrote: Just remove provided for spark-streaming-kinesis-asl libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl % 1.3.0 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Thanks. So how do I fix it? ᐧ On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com wrote: spark-streaming-kinesis-asl is not part of the Spark distribution on your cluster, so you cannot have it be just a provided dependency. This is also why the KCL and its dependencies were not included in the assembly (but yes, they should be). ~ Jonathan Kelly From: Vadim Bichutskiy vadim.bichuts...@gmail.com Date: Friday, April 3, 2015 at 12:26 PM To: Jonathan Kelly jonat...@amazon.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: Spark + Kinesis Hi all, Good news! I was able to create a Kinesis consumer and assemble it into an uber jar following http://spark.apache.org/docs/latest/streaming-kinesis-integration.html and example https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala . However when I try to spark-submit it I get the following exception: *Exception in thread main java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider* Do I need to include KCL dependency in *build.sbt*, here's what it looks like currently: import AssemblyKeys._ name := Kinesis Consumer version := 1.0 organization := com.myconsumer scalaVersion := 2.11.5 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0 % provided libraryDependencies +=
Processing Time Spikes (Spark Streaming)
Hi all, I am running some benchmarks on a simple Spark application which consists of : - textFileStream() to extract text records from HDFS files - map() to parse records into JSON objects - updateStateByKey() to calculate and store an in-memory state for each key. The processing time per batch gets slower as time passes and the number of states increases, that is expected. However, we also notice spikes occuring at rather regular intervals. What could cause those spikes ? We first suspected the GC, but the logs/metrics don't seem to show any significant GC-related delays. Could this be related to checkpointing ? Disk access latencies ? I've attached a graph so you can visualize the problem (please ignore the first spike which corresponds to system initialization) : http://apache-spark-user-list.1001560.n3.nabble.com/file/n22375/Processing_Delay-page-001.jpg Thanks ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Processing-Time-Spikes-Spark-Streaming-tp22375.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
UNRESOLVED DEPENDENCIES while building Spark 1.3.0
Hi All, I am trying to build spark 1.3.0 on Ubuntu 14.04 Stand alone machine. I am using sbt/sbt assembly command to build it. However, this command works pretty fine with spark version 1.1.0 but for Spark 1.3 it gives following error. Any help or suggestions to resolve this problem will highly be appreciated. ] Resolving org.fusesource.jansi#jansi;1.4 ... [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration not p ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was requir ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test [warn] :: [warn] [warn] Note: Unresolved dependencies path: [warn] org.apache.spark:spark-network-common_2.10:1.3.0 ((com.typesafe. sbt.pom.MavenHelper) MavenHelper.scala#L76) [warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0 sbt.ResolveException: unresolved dependency: org.apache.spark#spark-network-comm on_2.10;1.3.0: configuration not public in org.apache.spark#spark-network-common _2.10;1.3.0: 'test'. It was required from org.apache.spark#spark-network-shuffle _2.10;1.3.0 test at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56) at sbt.IvySbt$$anon$4.call(Ivy.scala:64) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet ries$1(Locks.scala:78) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala: 97) at xsbt.boot.Using$.withResource(Using.scala:10) at xsbt.boot.Using$.apply(Using.scala:9) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:64) at sbt.IvySbt.withIvy(Ivy.scala:123) at sbt.IvySbt.withIvy(Ivy.scala:120) at sbt.IvySbt$Module.withModule(Ivy.scala:151) at sbt.IvyActions$.updateEither(IvyActions.scala:157) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala :1318) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala :1315) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1 345) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1 343) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at
UNRESOLVED DEPENDENCIES while building Spark 1.3.0
Hi All, I am trying to build spark 1.3.O on standalone Ubuntu 14.04. I am using the sbt command i.e. sbt/sbt assembly to build it. This command works pretty good with spark version 1.1 however, it gives following error with spark 1.3.0. Any help or suggestions to resolve this would highly be appreciated. [info] Done updating. [info] Updating {file:/home/roott/aamirTest/spark/}network-shuffle... [info] Resolving org.fusesource.jansi#jansi;1.4 ... [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration not p ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was requir ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test [warn] :: [warn] [warn] Note: Unresolved dependencies path: [warn] org.apache.spark:spark-network-common_2.10:1.3.0 ((com.typesafe. sbt.pom.MavenHelper) MavenHelper.scala#L76) [warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0 sbt.ResolveException: unresolved dependency: org.apache.spark#spark-network-comm on_2.10;1.3.0: configuration not public in org.apache.spark#spark-network-common _2.10;1.3.0: 'test'. It was required from org.apache.spark#spark-network-shuffle _2.10;1.3.0 test at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56) at sbt.IvySbt$$anon$4.call(Ivy.scala:64) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet ries$1(Locks.scala:78) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala: 97) at xsbt.boot.Using$.withResource(Using.scala:10) at xsbt.boot.Using$.apply(Using.scala:9) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:64) at sbt.IvySbt.withIvy(Ivy.scala:123) at sbt.IvySbt.withIvy(Ivy.scala:120) at sbt.IvySbt$Module.withModule(Ivy.scala:151) at sbt.IvyActions$.updateEither(IvyActions.scala:157) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala :1318) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala :1315) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1 345) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1 343) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275) at
Re: UNRESOLVED DEPENDENCIES while building Spark 1.3.0
Use the MVN build instead. From the README in the git repo ( https://github.com/apache/spark) mvn -DskipTests clean package Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Sat, Apr 4, 2015 at 4:39 PM, mas mas.ha...@gmail.com wrote: Hi All, I am trying to build spark 1.3.O on standalone Ubuntu 14.04. I am using the sbt command i.e. sbt/sbt assembly to build it. This command works pretty good with spark version 1.1 however, it gives following error with spark 1.3.0. Any help or suggestions to resolve this would highly be appreciated. [info] Done updating. [info] Updating {file:/home/roott/aamirTest/spark/}network-shuffle... [info] Resolving org.fusesource.jansi#jansi;1.4 ... [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration not p ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was requir ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test [warn] :: [warn] [warn] Note: Unresolved dependencies path: [warn] org.apache.spark:spark-network-common_2.10:1.3.0 ((com.typesafe. sbt.pom.MavenHelper) MavenHelper.scala#L76) [warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0 sbt.ResolveException: unresolved dependency: org.apache.spark#spark-network-comm on_2.10;1.3.0: configuration not public in org.apache.spark#spark-network-common _2.10;1.3.0: 'test'. It was required from org.apache.spark#spark-network-shuffle _2.10;1.3.0 test at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56) at sbt.IvySbt$$anon$4.call(Ivy.scala:64) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet ries$1(Locks.scala:78) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala: 97) at xsbt.boot.Using$.withResource(Using.scala:10) at xsbt.boot.Using$.apply(Using.scala:9) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:64) at sbt.IvySbt.withIvy(Ivy.scala:123) at sbt.IvySbt.withIvy(Ivy.scala:120) at sbt.IvySbt$Module.withModule(Ivy.scala:151) at sbt.IvyActions$.updateEither(IvyActions.scala:157) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala :1318) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala :1315) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1 345) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1 343) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) at sbt.std.Transform$$anon$4.work(System.scala:63) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:22 6) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:22 6) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17) at sbt.Execute.work(Execute.scala:235) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestric tions.scala:159) at sbt.CompletionService$$anon$2.call(CompletionService.scala:28) at
CPU Usage for Spark Local Mode
Hi, I am currently testing my application with Spark under local mode, and I set the master to be local[4]. One thing I note is that when there is groupBy/reduceBy operation involved, the CPU usage can sometimes be around 600% to 800%. I am wondering if this is expected? (As only 4 worker threads are assigned, together with the driver thread, it should be 500%?) Best, Wenlei