Re: spark mesos deployment : starting workers based on attributes

2015-04-04 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Created issue: https://issues.apache.org/jira/browse/SPARK-6707
I would really appreciate ideas/views/opinions on this feature.

- -- Ankur Chauhan

On 03/04/2015 13:23, Tim Chen wrote:
 Hi Ankur,
 
 There isn't a way to do that yet, but it's simple to add.
 
 Can you create a JIRA in Spark for this?
 
 Thanks!
 
 Tim
 
 On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan
 achau...@brightcove.com mailto:achau...@brightcove.com wrote:
 
 Hi,
 
 I am trying to figure out if there is a way to tell the mesos 
 scheduler in spark to isolate the workers to a set of mesos slaves 
 that have a given attribute such as `tachyon:true`.
 
 Anyone knows if that is possible or how I could achieve such a
 behavior.
 
 Thanks! -- Ankur Chauhan
 
 -

 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 mailto:user-unsubscr...@spark.apache.org For additional commands,
 e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 

- -- 
- -- Ankur Chauhan
-BEGIN PGP SIGNATURE-

iQEcBAEBAgAGBQJVH4xBAAoJEOSJAMhvLp3LMfsH/0oyQ4fGomCd8GnQzqVrZ6zj
cgwhOyntz5aaBdjipVez1EwzNzG/3kXzFnK3YpuT6SXdXuPLSD6NX62ju/Ii+86w
/Y15taXt1qo+Ah6CLkofCPAPY1HRCZ+KAM/KzW45W+uGvcUqyupPFUEvN/a9/hYC
Ok7AERk8Tw/CRoU/Fbz/23LxjK1TJUW1klaToUjyij2oakMUxT7HnqS08fCUBJF6
pEqXJ+gHGW3br6BJcvwce7my8bFlPShVP+exhcNhpmqjoRvSf//etmP2E0Me2hXM
ZmghjIqRhoAI4sJYIhEBBQS7r4AsI5FQyNkr8i4Hqed4dq61YA7FcpUCC+GDbTY=
=pVkB
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: conversion from java collection type to scala JavaRDDObject

2015-04-04 Thread Jeetendra Gangele
Hi I have tried with parallelize but i got the below exception

java.io.NotSerializableException: pacific.dr.VendorRecord

Here is my code

ListVendorRecord
vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput);
JavaRDDVendorRecord lines = sc.parallelize(vendorRecords)


On 2 April 2015 at 21:11, Dean Wampler deanwamp...@gmail.com wrote:

 Use JavaSparkContext.parallelize.


 http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List)

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All
 Is there an way to make the JavaRDDObject from existing java collection
 type ListObject?
 I know this can be done using scala , but i am looking how to do this
 using java.


 Regards
 Jeetendra





Spark Vs MR

2015-04-04 Thread SamyaMaiti
How is spark faster than MR when data is in disk in both cases?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Migrating from Spark 0.8.0 to Spark 1.3.0

2015-04-04 Thread Nick Pentreath
It shouldn't be too bad - pertinent changes  migration notes are here: 
http://spark.apache.org/docs/1.0.0/programming-guide.html#migrating-from-pre-10-versions-of-spark
 for pre-1.0 and here: 
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13
 for SparkSQL pre-1.3




Since you aren't using SparkSQL the 2nd link is probably not useful. 




Generally you should find very few changes in the core API but things like 
MLlib would have changed a fair bit - though again the API should have been 
relatively stable.




Your biggest change is probably going to be running jobs through spark-submit 
rather than spark-class etc: 
http://spark.apache.org/docs/latest/submitting-applications.html










—
Sent from Mailbox

On Sat, Apr 4, 2015 at 1:11 AM, Ritesh Kumar Singh
riteshoneinamill...@gmail.com wrote:

 Hi,
 Are there any tutorials that explains all the changelogs between Spark
 0.8.0 and Spark 1.3.0 and how can we approach this issue.

Issue of sqlContext.createExternalTable with parquet partition discovery after changing folder structure

2015-04-04 Thread Rex Xiong
Hi Spark Users,

I'm testing 1.3 new feature of parquet partition discovery.
I have 2 sub folders, each has 800 rows.
/data/table1/key=1
/data/table1/key=2

In spark-shell, run this command:

val t = sqlContext.createExternalTable(table1, hdfs:///data/table1,
parquet)

t.count


It shows 1600 successfully.

But after that, I add a new folder /data/table1/key=3, then run t.count
again, it still gives me 1600, not 2400.


I try to restart spark-shell, then run

val t = sqlContext.table(table1)

t.count


It's 2400 now.


I'm wondering there should be a partition cache in driver, I try to
set spark.sql.parquet.cacheMetadata
to false and test it again, unfortunately it doesn't help.


How can I disable this partition cache or force refresh the cache?


Thanks


Re: 4 seconds to count 13M lines. Does it make sense?

2015-04-04 Thread SamyaMaiti
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of
cores.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Vs MR

2015-04-04 Thread Sean Owen
If data is on HDFS, it is not read any more or less quickly by either
framework. Both are in fact using the same logic to exploit locality,
and read and deserialize data anyway. I don't think this is what
anyone claims though.

Spark can be faster in a multi-stage operation, which would require
several MRs. The MRs must hit disk again after the reducer whereas
Spark might not, possibly by persisting outputs in memory. A similar
but larger speedup can be had for iterative computations that access
the same data in memory; caching it means reading it from disk once,
but then re-reading from memory only.

For a single operation that really is a map and a reduce, starting and
ending on HDFS, I would expect MR to be a bit faster just because it
is so optimized for this one pattern. Even that depends a lot, and
wouldn't be significant.


On Sat, Apr 4, 2015 at 11:19 AM, SamyaMaiti samya.maiti2...@gmail.com wrote:
 How is spark faster than MR when data is in disk in both cases?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Need help with ALS Recommendation code

2015-04-04 Thread Phani Yadavilli -X (pyadavil)
Hi ,

I am trying to run the following command in the Movie Recommendation example 
provided by the ampcamp tutorial

Command:   sbt package run /movielens/medium

Exception: sbt.TrapExitSecurityException thrown from the 
UncaughtExceptionHandler in thread run-main-0
java.lang.RuntimeException: Nonzero exit code: 1
at scala.sys.package$.error(package.scala:27)
[trace] Stack trace suppressed: run last compile:run for the full output.
[error] (compile:run) Nonzero exit code: 1
[error] Total time: 0 s, completed Apr 4, 2015 12:00:18 PM

I am unable to identify the error code.Can someone help me on this.

Regards
Phani Kumar


Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-04 Thread Jeetendra Gangele
Here is my conf object passing first parameter of API.
but here I want to pass multiple scan means i have 4 criteria for STRAT ROW
and STOROW in same table.
by using below code i can get result for one STARTROW and ENDROW.

Configuration conf = DBConfiguration.getConf();

// int scannerTimeout = (int) conf.getLong(
//  HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY, -1);
// System.out.println(lease timeout on server is+scannerTimeout);

int scannerTimeout = (int) conf.getLong(
hbase.client.scanner.timeout.period, -1);
// conf.setLong(hbase.client.scanner.timeout.period, 6L);
conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME);
Scan scan = new Scan();
scan.addFamily(FAMILY);
FilterList filterList = new FilterList(Operator.MUST_PASS_ALL);
filterList.addFilter(new KeyOnlyFilter());
 filterList.addFilter(new FirstKeyOnlyFilter());
scan.setFilter(filterList);

scan.setCacheBlocks(false);
scan.setCaching(10);
 scan.setBatch(1000);
scan.setSmall(false);
 conf.set(TableInputFormat.SCAN, DatabaseUtils.convertScanToString(scan));
return conf;

On 4 April 2015 at 20:54, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi All,

 Can we get the result of the multiple scan
 from JavaSparkContext.newAPIHadoopRDD from Hbase.

 This method first parameter take configuration object where I have added
 filter. but how Can I query multiple scan from same table calling this API
 only once?

 regards
 jeetendra



Re: Parquet timestamp support for Hive?

2015-04-04 Thread Cheng Lian
Avoiding maintaining a separate Hive version is one of the initial 
purpose of Spark SQL. (We had once done this for Shark.) The 
org.spark-project.hive:hive-0.13.1a artifact only cleans up some 3rd 
dependencies to avoid dependency hell in Spark. This artifact is exactly 
the same as Hive 0.13.1 at the source level.


On the other hand, we're planning to add a Hive metastore adapter layer 
to Spark SQL so that in the future we can talk to arbitrary versions 
greater than or equal to 0.13.1 of Hive metastore, and then always stick 
to the most recent Hive versions to provide the most recent Hive 
features. This will probably happen in Spark 1.4 or 1.5.


Cheng

On 4/3/15 7:59 PM, Rex Xiong wrote:

Hi,

I got this error when creating a hive table from parquet file:
DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.UnsupportedOperationException: Parquet does not support 
timestamp. See HIVE-6384


I check HIVE-6384, it's fixed in 0.14.
The hive in spark build is a customized version 0.13.1a 
(GroupId: org.spark-project.hive), is it possible to get the source 
code for it and apply patch from HIVE-6384?


Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-04 Thread Cheng Lian

I think this is a bug of Spark SQL dates back to at least 1.1.0.

The json_tuple function is implemented as 
org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The 
ClassNotFoundException should complain with the class name rather than 
the UDTF function name.


The problematic line should be this one 
https://github.com/apache/spark/blob/9b40c17ab161b64933539abeefde443cb4f98673/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala#L1288. 
HiveFunctionWrapper expects the full qualified class name of the UDTF 
class that implements the function, but we pass in the function name.


Thanks for reporting this!

Cheng

On 4/2/15 3:19 AM, Todd Nist wrote:


I have a feeling I’m missing a Jar that provides the support or could 
this may be related to 
https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar where 
would I find that ? I would have thought in the $HIVE/lib folder, but 
not sure which jar contains it.


Error:

|Create  MetricTemporary  Table  for  querying
15/04/01  14:41:44  INFO HiveMetaStore:0: Opening raw storewith  implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/04/01  14:41:44  INFO ObjectStore: ObjectStore, initialize called
15/04/01  14:41:45  INFO Persistence: Property 
hive.metastore.integral.jdo.pushdownunknown  - will be ignored
15/04/01  14:41:45  INFO Persistence: Property datanucleus.cache.level2unknown  
- will be ignored
15/04/01  14:41:45  INFO BlockManager: Removing broadcast0
15/04/01  14:41:45  INFO BlockManager: Removing block broadcast_0
15/04/01  14:41:45  INFO MemoryStore: Block broadcast_0of  size  1272  
droppedfrom  memory (free278018571)
15/04/01  14:41:45  INFO BlockManager: Removing block broadcast_0_piece0
15/04/01  14:41:45  INFO MemoryStore: Block broadcast_0_piece0of  size  869  
droppedfrom  memory (free278019440)
15/04/01  14:41:45  INFO BlockManagerInfo: Removed broadcast_0_piece0on  
192.168.1.5:63230  in  memory (size:869.0  B, free:265.1  MB)
15/04/01  14:41:45  INFO BlockManagerMaster: Updated infoof  block 
broadcast_0_piece0
15/04/01  14:41:45  INFO BlockManagerInfo: Removed broadcast_0_piece0on  
192.168.1.5:63278  in  memory (size:869.0  B, free:530.0  MB)
15/04/01  14:41:45  INFO ContextCleaner: Cleaned broadcast0
15/04/01  14:41:46  INFO ObjectStore: Setting MetaStore object pin classeswith  
hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
15/04/01  14:41:46  INFO Datastore: The 
classorg.apache.hadoop.hive.metastore.model.MFieldSchema  is  taggedas  
embedded-only  so doesnot  have its own datastoretable.
15/04/01  14:41:46  INFO Datastore: The 
classorg.apache.hadoop.hive.metastore.model.MOrder  is  taggedas  
embedded-only  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Datastore: The 
classorg.apache.hadoop.hive.metastore.model.MFieldSchema  is  taggedas  
embedded-only  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Datastore: The 
classorg.apache.hadoop.hive.metastore.model.MOrder  is  taggedas  
embedded-only  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Query: Readingin  resultsfor  
queryorg.datanucleus.store.rdbms.query.SQLQuery@0  since theconnection  
usedis  closing
15/04/01  14:41:47  INFO ObjectStore: Initialized ObjectStore
15/04/01  14:41:47  INFO HiveMetaStore: Added admin rolein  metastore
15/04/01  14:41:47  INFO HiveMetaStore: Addedpublic  rolein  metastore
15/04/01  14:41:48  INFO HiveMetaStore:No  user  is  addedin  admin role, since 
configis  empty
15/04/01  14:41:48  INFO SessionState:No  Tezsession  requiredat  this point. 
hive.execution.engine=mr.
15/04/01  14:41:49  INFO ParseDriver: Parsing command:SELECT  path, name,value, 
v1.peValue, v1.peName
  FROM  metric
  lateralview  json_tuple(pathElements,'name','value') v1
as  peName, peValue
15/04/01  14:41:49  INFO ParseDriver: Parse Completed
Exception  in  threadmain  java.lang.ClassNotFoundException: json_tuple
 at  java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at  java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at  java.security.AccessController.doPrivileged(Native Method)
 at  java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at  java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at  java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at  
org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:141)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:261)
 at  org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:261)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:267)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:267)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:272)
 at  

Re: Spark Streaming FileStream Nested File Support

2015-04-04 Thread Akhil Das
We've a custom version/build of sparktreaming doing the nested s3 lookups
faster (uses native S3 APIs). You can find the source code over here :
https://github.com/sigmoidanalytics/spark-modified, In particular the
changes from here
https://github.com/sigmoidanalytics/spark-modified/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L206.
And the binary jars here :
https://github.com/sigmoidanalytics/spark-modified/tree/master/lib

Here's the instructions to use it:

This is how you create your stream:

val lines = ssc.*s3FileStream*[LongWritable, Text,
TextInputFormat](bucketname/)


You need ACCESS_KEY and SECRET_KEY in the environment for this to work.
Also, by default it is recursive.

Also you need these jars
https://github.com/sigmoidanalytics/spark-modified/tree/master/lib in the
SPARK_CLASSPATH:


aws-java-sdk-1.8.3.jarhttpclient-4.2.5.jar
aws-java-sdk-1.9.24.jar   httpcore-4.3.2.jar
aws-java-sdk-core-1.9.24.jar  joda-time-2.6.jar
aws-java-sdk-s3-1.9.24.jarspark-streaming_2.10-1.2.0.jar



Let me know if you need any more clarification/information on this, feel
free to suggest changes.




Thanks
Best Regards

On Sat, Apr 4, 2015 at 3:30 AM, Tathagata Das t...@databricks.com wrote:

 Yes, definitely can be added. Just haven't gotten around to doing it :)
 There are proposals for this that you can try -
 https://github.com/apache/spark/pull/2765/files . Have you review it at
 some point.

 On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter adamge...@gmail.com wrote:

 That doesn't seem like a good solution unfortunately as I would be
 needing this to work in a production environment.  Do you know why the
 limitation exists for FileInputDStream in the first place?  Unless I'm
 missing something important about how some of the internals work I don't
 see why this feature could be added in at some point.

 On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com
 wrote:

 I sort-a-hacky workaround is to use a queueStream where you can manually
 create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
 that this is for testing only as queueStream does not work with driver
 fautl recovery.

 TD

 On Fri, Apr 3, 2015 at 12:23 PM, adamgerst adamge...@gmail.com wrote:

 So after pulling my hair out for a bit trying to convert one of my
 standard
 spark jobs to streaming I found that FileInputDStream does not support
 nested folders (see the brief mention here

 http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
 the fileStream method returns a FileInputDStream).  So before, for my
 standard job, I was reading from say

 s3n://mybucket/2015/03/02/*log

 And could also modify it to simply get an entire months worth of logs.
 Since the logs are split up based upon their date, when the batch ran
 for
 the day, I simply passed in a parameter of the date to make sure I was
 reading the correct data

 But since I want to turn this job into a streaming job I need to simply
 do
 something like

 s3n://mybucket/*log

 This would totally work fine if it were a standard spark application,
 but
 fails for streaming.  Is there anyway I can get around this limitation?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-FileStream-Nested-File-Support-tp22370.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-04 Thread Jeetendra Gangele
Hi All,

Can we get the result of the multiple scan
from JavaSparkContext.newAPIHadoopRDD from Hbase.

This method first parameter take configuration object where I have added
filter. but how Can I query multiple scan from same table calling this API
only once?

regards
jeetendra


Re: Parquet Hive table become very slow on 1.3?

2015-04-04 Thread Cheng Lian

Hey Xudong,

We had been digging this issue for a while, and believe PR 5339 
http://github.com/apache/spark/pull/5339 and PR 5334 
http://github.com/apache/spark/pull/5339 should fix this issue.


There two problems:

1. Normally we cache Parquet table metadata for better performance, but 
when converting Hive metastore Hive tables, the cache is not used. Thus 
heavy operations like schema discovery is done every time a metastore 
Parquet table is converted.
2. With Parquet task side metadata reading (which is turned on by 
default), we can actually skip the row group information in the footer. 
However, we accidentally called a Parquet function which doesn't skip 
row group information.


For your question about schema merging, Parquet allows different 
part-files have different but compatible schemas. For example, 
part-1.parquet has columns a and b, while part-2.parquet may has 
columns a and c. In some cases, the summary files (_metadata and 
_common_metadata) contains the merged schema (a, b, and c), but it's not 
guaranteed. For example, when the user defined metadata stored different 
part-files contain different values for the same key, Parquet simply 
gives up writing summary files. That's why all part-files must be 
touched to get a precise merged schema.


However, in scenarios where a centralized arbitrative schema is 
available (e.g. Hive metastore schema, or the schema provided by user 
via data source DDL), we don't need to do schema merging on driver side, 
but defer it to executor side and each task only needs to reconcile 
those part-files it needs to touch. This is also what the Parquet 
developers did recently for parquet-hadoop 
https://github.com/apache/incubator-parquet-mr/pull/45.


Cheng

On 3/31/15 11:49 PM, Zheng, Xudong wrote:

Thanks Cheng!

Set 'spark.sql.parquet.useDataSourceApi' to false resolves my issues, 
but the PR 5231 seems not. Not sure any other things I did wrong ...


BTW, actually, we are very interested in the schema merging feature in 
Spark 1.3, so both these two solution will disable this feature, 
right? It seems that Parquet metadata is store in a file named 
_metadata in the Parquet file folder (each folder is a partition as we 
use partition table), why we need scan all Parquet part files? Is 
there any other solutions could keep schema merging feature at the 
same time? We are really like this feature :)


On Tue, Mar 31, 2015 at 3:19 PM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


Hi Xudong,

This is probably because of Parquet schema merging is turned on by
default. This is generally useful for Parquet files with different
but compatible schemas. But it needs to read metadata from all
Parquet part-files. This can be problematic when reading Parquet
files with lots of part-files, especially when the user doesn't
need schema merging.

This issue is tracked by SPARK-6575, and here is a PR for it:
https://github.com/apache/spark/pull/5231. This PR adds a
configuration to disable schema merging by default when doing Hive
metastore Parquet table conversion.

Another workaround is to fallback to the old Parquet code by
setting spark.sql.parquet.useDataSourceApi to false.

Cheng


On 3/31/15 2:47 PM, Zheng, Xudong wrote:

Hi all,

We are using Parquet Hive table, and we are upgrading to Spark
1.3. But we find that, just a simple COUNT(*) query will much
slower (100x) than Spark 1.2.

I find the most time spent on driver to get HDFS blocks. I find
large amount of get below logs printed:

15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 
2097ms
15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{
   fileLength=77153436
   underConstruction=false
   
blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275; 
getBlockSize()=77153436; corrupt=false; offset=0; locs=[10.152.116.172:50010  
http://10.152.116.172:50010,10.152.116.169:50010  
http://10.152.116.169:50010, 10.153.125.184:50010]}]
   lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948  
tel:1075187948_1448275; getBlockSize()=77153436; corrupt=false; offset=0; 
locs=[10.152.116.169:50010  http://10.152.116.169:50010,10.153.125.184:50010  
http://10.153.125.184:50010,10.152.116.172:50010  http://10.152.116.172:50010]}
   isLastBlockComplete=true}
15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode10.152.116.172:50010  
http://10.152.116.172:50010

I compare the printed log with Spark 1.2, although the number of
getBlockLocations call is similar, but each such operation only
cost 20~30 ms (but it is 2000ms~3000ms now), and it didn't print
the detailed LocatedBlocks info.

Another finding is, if I read the Parquet file via scala code
form spark-shell as below, it looks fine, the computation will
return the result quick as before.

   

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-04 Thread Cheng Lian

Filed https://issues.apache.org/jira/browse/SPARK-6708 to track this.

Cheng

On 4/4/15 10:21 PM, Cheng Lian wrote:

I think this is a bug of Spark SQL dates back to at least 1.1.0.

The json_tuple function is implemented as 
org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple. The 
ClassNotFoundException should complain with the class name rather than 
the UDTF function name.


The problematic line should be this one 
https://github.com/apache/spark/blob/9b40c17ab161b64933539abeefde443cb4f98673/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala#L1288. 
HiveFunctionWrapper expects the full qualified class name of the UDTF 
class that implements the function, but we pass in the function name.


Thanks for reporting this!

Cheng

On 4/2/15 3:19 AM, Todd Nist wrote:


I have a feeling I’m missing a Jar that provides the support or could 
this may be related to 
https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar 
where would I find that ? I would have thought in the $HIVE/lib 
folder, but not sure which jar contains it.


Error:

|Create  MetricTemporary  Table  for  querying
15/04/01  14:41:44  INFO HiveMetaStore:0: Opening raw storewith  implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/04/01  14:41:44  INFO ObjectStore: ObjectStore, initialize called
15/04/01  14:41:45  INFO Persistence: Property 
hive.metastore.integral.jdo.pushdownunknown  - will be ignored
15/04/01  14:41:45  INFO Persistence: Property datanucleus.cache.level2unknown  
- will be ignored
15/04/01  14:41:45  INFO BlockManager: Removing broadcast0
15/04/01  14:41:45  INFO BlockManager: Removing block broadcast_0
15/04/01  14:41:45  INFO MemoryStore: Block broadcast_0of  size  1272  
droppedfrom  memory (free278018571)
15/04/01  14:41:45  INFO BlockManager: Removing block broadcast_0_piece0
15/04/01  14:41:45  INFO MemoryStore: Block broadcast_0_piece0of  size  869  
droppedfrom  memory (free278019440)
15/04/01  14:41:45  INFO BlockManagerInfo: Removed broadcast_0_piece0on  
192.168.1.5:63230  in  memory (size:869.0  B, free:265.1  MB)
15/04/01  14:41:45  INFO BlockManagerMaster: Updated infoof  block 
broadcast_0_piece0
15/04/01  14:41:45  INFO BlockManagerInfo: Removed broadcast_0_piece0on  
192.168.1.5:63278  in  memory (size:869.0  B, free:530.0  MB)
15/04/01  14:41:45  INFO ContextCleaner: Cleaned broadcast0
15/04/01  14:41:46  INFO ObjectStore: Setting MetaStore object pin classeswith  
hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
15/04/01  14:41:46  INFO Datastore: The 
classorg.apache.hadoop.hive.metastore.model.MFieldSchema  is  taggedas  
embedded-only  so doesnot  have its own datastoretable.
15/04/01  14:41:46  INFO Datastore: The 
classorg.apache.hadoop.hive.metastore.model.MOrder  is  taggedas  
embedded-only  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Datastore: The 
classorg.apache.hadoop.hive.metastore.model.MFieldSchema  is  taggedas  
embedded-only  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Datastore: The 
classorg.apache.hadoop.hive.metastore.model.MOrder  is  taggedas  
embedded-only  so doesnot  have its own datastoretable.
15/04/01  14:41:47  INFO Query: Readingin  resultsfor  
queryorg.datanucleus.store.rdbms.query.SQLQuery@0  since theconnection  
usedis  closing
15/04/01  14:41:47  INFO ObjectStore: Initialized ObjectStore
15/04/01  14:41:47  INFO HiveMetaStore: Added admin rolein  metastore
15/04/01  14:41:47  INFO HiveMetaStore: Addedpublic  rolein  metastore
15/04/01  14:41:48  INFO HiveMetaStore:No  user  is  addedin  admin role, since 
configis  empty
15/04/01  14:41:48  INFO SessionState:No  Tezsession  requiredat  this point. 
hive.execution.engine=mr.
15/04/01  14:41:49  INFO ParseDriver: Parsing command:SELECT  path, name,value, 
v1.peValue, v1.peName
  FROM  metric
  lateralview  json_tuple(pathElements,'name','value') v1
as  peName, peValue
15/04/01  14:41:49  INFO ParseDriver: Parse Completed
Exception  in  threadmain  java.lang.ClassNotFoundException: json_tuple
 at  java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at  java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at  java.security.AccessController.doPrivileged(Native Method)
 at  java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at  java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at  java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at  
org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:141)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:261)
 at  org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:261)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:267)
 at  
org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:267)
 at  

Re: conversion from java collection type to scala JavaRDDObject

2015-04-04 Thread Dean Wampler
Without the rest of your code, it's hard to know what might be
unserializable.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Sat, Apr 4, 2015 at 7:56 AM, Jeetendra Gangele gangele...@gmail.com
wrote:


 Hi I have tried with parallelize but i got the below exception

 java.io.NotSerializableException: pacific.dr.VendorRecord

 Here is my code

 ListVendorRecord
 vendorRecords=blockingKeys.getMatchingRecordsWithscan(matchKeysOutput);
 JavaRDDVendorRecord lines = sc.parallelize(vendorRecords)


 On 2 April 2015 at 21:11, Dean Wampler deanwamp...@gmail.com wrote:

 Use JavaSparkContext.parallelize.


 http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List)

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All
 Is there an way to make the JavaRDDObject from existing java
 collection type ListObject?
 I know this can be done using scala , but i am looking how to do this
 using java.


 Regards
 Jeetendra








Re: Issue of sqlContext.createExternalTable with parquet partition discovery after changing folder structure

2015-04-04 Thread Cheng Lian
You need to refresh the external table manually after updating the data 
source outside Spark SQL:


- via Scala API: sqlContext.refreshTable(table1)
- via SQL: REFRESH TABLE table1;

Cheng

On 4/4/15 5:24 PM, Rex Xiong wrote:

Hi Spark Users,

I'm testing 1.3 new feature of parquet partition discovery.
I have 2 sub folders, each has 800 rows.
/data/table1/key=1
/data/table1/key=2

In spark-shell, run this command:

val t = sqlContext.createExternalTable(table1, 
hdfs:///data/table1, parquet)


t.count


It shows 1600 successfully.

But after that, I add a new folder /data/table1/key=3, then run 
t.count again, it still gives me 1600, not 2400.



I try to restart spark-shell, then run

val t = sqlContext.table(table1)

t.count


It's 2400 now.


I'm wondering there should be a partition cache in driver, I try to 
set spark.sql.parquet.cacheMetadata to false and test it 
again, unfortunately it doesn't help.



How can I disable this partition cache or force refresh the cache?


Thanks





DataFrame groupBy MapType

2015-04-04 Thread Justin Yip
Hello,

I have a case class like this:

case class A(
  m: Map[Long, Long],
  ...
)

and constructed a DataFrame from Seq[A].

I would like to perform a groupBy on A.m(SomeKey). I can implement a UDF,
create a new Column then invoke a groupBy on the new Column. But is it the
idiomatic way of doing such operation?

Can't find much info about operating MapType on Column in the doc.

Thanks ahead!

Justin


Spark Streaming program questions

2015-04-04 Thread nickos168
I have two questions:

1) In a Spark Streaming program, after the various DStream transformations have 
being setup, 
the ssc.start() method is called to start the computation.

Can the underlying DAG change (ie. add another map or maybe a join) after 
ssc.start() has been 
called (and maybe messages have already been received/processed for some 
batches)?


2) In a Spark Streaming program (one process), can I have multiple DStream 
transformations, 
each series belonging to each own StreamingContext (in the same thread or in 
different threads)?

For example:
 val lines_A = ssc_A.socketTextStream(..)
 val words_A = lines_A.flatMap(_.split( ))
 val wordCounts_A = words_A.map(x = (x, 1)).reduceByKey(_ + _)
wordCounts_A.print()

val lines_B = ssc_B.socketTextStream(..)
val words_B = lines_B.flatMap(_.split( ))
val wordCounts_B = words_B.map(x = (x, 1)).reduceByKey(_ + _)

wordCounts_B.print()

ssc_A.start()
  ssc_B.start()

I think the answer is NO to both questions but I am wondering what is the 
reason for this behavior.


Thanks,

Nickos



Re: Spark Streaming program questions

2015-04-04 Thread Aj K
UNSUBSCRIBE

On Sun, Apr 5, 2015 at 6:43 AM, nickos168 nickos...@yahoo.com.invalid
wrote:

 I have two questions:

 1) In a Spark Streaming program, after the various DStream transformations
 have being setup,
 the ssc.start() method is called to start the computation.

 Can the underlying DAG change (ie. add another map or maybe a join) after
 ssc.start() has been
 called (and maybe messages have already been received/processed for some
 batches)?


 2) In a Spark Streaming program (one process), can I have multiple DStream
 transformations,
 each series belonging to each own StreamingContext (in the same thread or
 in different threads)?

 For example:
  val lines_A = ssc_A.socketTextStream(..)val words_A =
 lines_A.flatMap(_.split( ))val wordCounts_A = words_A.map(x = (x, 
 1)).reduceByKey(_
 + _)
 wordCounts_A.print()

 val lines_B = ssc_B.socketTextStream(..)val words_B =
  lines_B.flatMap(_.split( ))val wordCounts_B = words_B.map(x = (x, 1
 )).reduceByKey(_ + _)

 wordCounts_B.print()

 ssc_A.start()
   ssc_B.start()

 I think the answer is NO to both questions but I am wondering what is the
 reason for this behavior.


 Thanks,

 Nickos




Re: Spark SQL Self join with agreegate

2015-04-04 Thread SachinJanani
I am not sure whether this can be possible but i have tried something like 
SELECT time, src, dst, sum(val1), sum(val2) from table group by
src,dst;
and it works.I think it will result the same answer as you are expecting



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Self-join-with-agreegate-tp22151p22378.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark + Kinesis

2015-04-04 Thread Vadim Bichutskiy
Hi all,

More good news! I was able to utilize mergeStrategy to assembly my Kinesis
consumer into an uber jar

Here's what I added to* build.sbt:*

*mergeStrategy in assembly = (mergeStrategy in assembly) { (old) =*
*  {*
*  case PathList(com, esotericsoftware, minlog, xs @ _*) =
MergeStrategy.first*
*  case PathList(com, google, common, base, xs @ _*) =
MergeStrategy.first*
*  case PathList(org, apache, commons, xs @ _*) = MergeStrategy.last*
*  case PathList(org, apache, hadoop, xs @ _*) = MergeStrategy.first*
*  case PathList(org, apache, spark, unused, xs @ _*) =
MergeStrategy.first*
*case x = old(x)*
*  }*
*}*

Everything appears to be working fine. Right now my producer is pushing
simple strings through Kinesis,
which my consumer is trying to print (using Spark's print() method for now).

However, instead of displaying my strings, I get the following:

*15/04/04 18:57:32 INFO scheduler.ReceivedBlockTracker: Deleting batches
ArrayBuffer(1428173848000 ms)*

Any idea on what might be going on?

Thanks,

Vadim

Here's my consumer code (adapted from the WordCount example):























































































*private object MyConsumer extends Logging {  def main(args: Array[String])
{/* Check that all required args were passed in. */if (args.length
 2) {  System.err.println(  |Usage:
KinesisWordCount stream-name endpoint-url  |stream-name
is the name of the Kinesis stream  |endpoint-url is the
endpoint of the Kinesis service  |   (e.g.
https://kinesis.us-east-1.amazonaws.com
https://kinesis.us-east-1.amazonaws.com).stripMargin)
System.exit(1)}/* Populate the appropriate variables from the given
args */val Array(streamName, endpointUrl) = args/* Determine the
number of shards from the stream */val kinesisClient = new
AmazonKinesisClient(new DefaultAWSCredentialsProviderChain())
kinesisClient.setEndpoint(endpointUrl)val numShards =
kinesisClient.describeStream(streamName).getStreamDescription().getShards()
.size()System.out.println(Num shards:  + numShards)/* In this
example, we're going to create 1 Kinesis Worker/Receiver/DStream for each
shard. */val numStreams = numShards/* Setup the and SparkConfig and
StreamingContext *//* Spark Streaming batch interval */val
batchInterval = Milliseconds(2000)val sparkConfig = new
SparkConf().setAppName(MyConsumer)val ssc = new
StreamingContext(sparkConfig, batchInterval)/* Kinesis checkpoint
interval.  Same as batchInterval for this example. */val
kinesisCheckpointInterval = batchInterval/* Create the same number of
Kinesis DStreams/Receivers as Kinesis stream's shards */val
kinesisStreams = (0 until numStreams).map { i =
KinesisUtils.createStream(ssc, streamName, endpointUrl,
kinesisCheckpointInterval,  InitialPositionInStream.LATEST,
StorageLevel.MEMORY_AND_DISK_2)}/* Union all the streams */val
unionStreams  = ssc.union(kinesisStreams).map(byteArray = new
String(byteArray))unionStreams.print()ssc.start()
ssc.awaitTermination()  }}*

ᐧ

On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com wrote:

 Just remove provided for spark-streaming-kinesis-asl

 libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
 % 1.3.0

 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
 vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?
 ᐧ

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution on
 your cluster, so you cannot have it be just a provided dependency.  This
 is also why the KCL and its dependencies were not included in the assembly
 (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it
 into an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided
 libraryDependencies += org.apache.spark %% spark-streaming % 1.3.0
 % provided
 libraryDependencies += 

Processing Time Spikes (Spark Streaming)

2015-04-04 Thread t1ny
Hi all,

I am running some benchmarks on a simple Spark application which consists of
:
- textFileStream() to extract text records from HDFS files
- map() to parse records into JSON objects
- updateStateByKey() to calculate and store an in-memory state for each key.

The processing time per batch gets slower as time passes and the number of
states increases, that is expected. 
However, we also notice spikes occuring at rather regular intervals. What
could cause those spikes ? We first suspected the GC, but the logs/metrics
don't seem to show any significant GC-related delays. Could this be related
to checkpointing ? Disk access latencies ?

I've attached a graph so you can visualize the problem (please ignore the
first spike which corresponds to system initialization) :

http://apache-spark-user-list.1001560.n3.nabble.com/file/n22375/Processing_Delay-page-001.jpg
 

Thanks !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-Time-Spikes-Spark-Streaming-tp22375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread mas
Hi All,

I am trying to build spark 1.3.0 on Ubuntu 14.04 Stand alone machine. I am
using sbt/sbt assembly command to build it. However, this command works
pretty fine with spark version 1.1.0 but for Spark 1.3 it gives following
error.
Any help or suggestions to resolve this problem will highly be appreciated.

] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
not p   
 
ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was
requir  
  
ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
[warn]  ::
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]  org.apache.spark:spark-network-common_2.10:1.3.0
((com.typesafe. 
   
sbt.pom.MavenHelper) MavenHelper.scala#L76)
[warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0
sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-network-comm 
   
on_2.10;1.3.0: configuration not public in
org.apache.spark#spark-network-common   
 
_2.10;1.3.0: 'test'. It was required from
org.apache.spark#spark-network-shuffle  
  
_2.10;1.3.0 test
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)
at sbt.IvySbt$$anon$4.call(Ivy.scala:64)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet   

 
ries$1(Locks.scala:78)
at
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:   

 
97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at
xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
at sbt.IvySbt.withIvy(Ivy.scala:123)
at sbt.IvySbt.withIvy(Ivy.scala:120)
at sbt.IvySbt$Module.withModule(Ivy.scala:151)
at sbt.IvyActions$.updateEither(IvyActions.scala:157)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala   

 
:1318)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala   

 
:1315)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1   

 
345)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1   

 
343)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at

UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread mas
Hi All,
I am trying to build spark 1.3.O on standalone Ubuntu 14.04. I am using the
sbt command i.e. sbt/sbt assembly to build it. This command works pretty
good with spark version 1.1 however, it gives following error with spark
1.3.0. Any help or suggestions to resolve this would highly be appreciated.

[info] Done updating.
[info] Updating {file:/home/roott/aamirTest/spark/}network-shuffle...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
not p   
 
ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was
requir  
  
ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
[warn]  ::
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]  org.apache.spark:spark-network-common_2.10:1.3.0
((com.typesafe. 
   
sbt.pom.MavenHelper) MavenHelper.scala#L76)
[warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0
sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-network-comm 
   
on_2.10;1.3.0: configuration not public in
org.apache.spark#spark-network-common   
 
_2.10;1.3.0: 'test'. It was required from
org.apache.spark#spark-network-shuffle  
  
_2.10;1.3.0 test
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)
at
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)
at sbt.IvySbt$$anon$4.call(Ivy.scala:64)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet   

 
ries$1(Locks.scala:78)
at
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:   

 
97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at
xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
at sbt.IvySbt.withIvy(Ivy.scala:123)
at sbt.IvySbt.withIvy(Ivy.scala:120)
at sbt.IvySbt$Module.withModule(Ivy.scala:151)
at sbt.IvyActions$.updateEither(IvyActions.scala:157)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala   

 
:1318)
at
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala   

 
:1315)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1   

 
345)
at
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1   

 
343)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275)
at 

Re: UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread Dean Wampler
Use the MVN build instead. From the README in the git repo (
https://github.com/apache/spark)

mvn -DskipTests clean package



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Sat, Apr 4, 2015 at 4:39 PM, mas mas.ha...@gmail.com wrote:

 Hi All,
 I am trying to build spark 1.3.O on standalone Ubuntu 14.04. I am using the
 sbt command i.e. sbt/sbt assembly to build it. This command works pretty
 good with spark version 1.1 however, it gives following error with spark
 1.3.0. Any help or suggestions to resolve this would highly be appreciated.

 [info] Done updating.
 [info] Updating {file:/home/roott/aamirTest/spark/}network-shuffle...
 [info] Resolving org.fusesource.jansi#jansi;1.4 ...
 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.spark#spark-network-common_2.10;1.3.0: configuration
 not p
 ublic in org.apache.spark#spark-network-common_2.10;1.3.0: 'test'. It was
 requir
 ed from org.apache.spark#spark-network-shuffle_2.10;1.3.0 test
 [warn]  ::
 [warn]
 [warn]  Note: Unresolved dependencies path:
 [warn]  org.apache.spark:spark-network-common_2.10:1.3.0
 ((com.typesafe.
 sbt.pom.MavenHelper) MavenHelper.scala#L76)
 [warn]+- org.apache.spark:spark-network-shuffle_2.10:1.3.0
 sbt.ResolveException: unresolved dependency:
 org.apache.spark#spark-network-comm
 on_2.10;1.3.0: configuration not public in
 org.apache.spark#spark-network-common
 _2.10;1.3.0: 'test'. It was required from
 org.apache.spark#spark-network-shuffle
 _2.10;1.3.0 test
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:278)
 at
 sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:175)
 at
 sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:157)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:151)
 at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:128)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:56)
 at sbt.IvySbt$$anon$4.call(Ivy.scala:64)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
 at
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRet
 ries$1(Locks.scala:78)
 at
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:
 97)
 at xsbt.boot.Using$.withResource(Using.scala:10)
 at xsbt.boot.Using$.apply(Using.scala:9)
 at
 xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
 at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
 at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:64)
 at sbt.IvySbt.withIvy(Ivy.scala:123)
 at sbt.IvySbt.withIvy(Ivy.scala:120)
 at sbt.IvySbt$Module.withModule(Ivy.scala:151)
 at sbt.IvyActions$.updateEither(IvyActions.scala:157)
 at
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala
 :1318)
 at
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala
 :1315)
 at
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1
 345)
 at
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$85.apply(Defaults.scala:1
 343)
 at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1348)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1342)
 at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
 at sbt.Classpaths$.cachedUpdate(Defaults.scala:1360)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1300)
 at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1275)
 at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
 at
 sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
 at sbt.std.Transform$$anon$4.work(System.scala:63)
 at
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:22
 6)
 at
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:22
 6)
 at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
 at sbt.Execute.work(Execute.scala:235)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
 at
 sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestric
 tions.scala:159)
 at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
 at 

CPU Usage for Spark Local Mode

2015-04-04 Thread Wenlei Xie
Hi,

I am currently testing my application with Spark under local mode, and I
set the master to be local[4]. One thing I note is that when there is
groupBy/reduceBy operation involved, the CPU usage can sometimes be around
600% to 800%. I am wondering if this is expected? (As only 4 worker threads
are assigned, together with the driver thread, it should be 500%?)

Best,
Wenlei