date:20150310

Here's a Java version
https://github.com/cloudera/parquet-examples/tree/master/MapReduce It won't
be that hard to make that in Scala.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 9:55 PM, Shuai Zheng szheng.c...@gmail.com wrote:

 Hi All,



 I have a lot of parquet files, and I try to open them directly instead of
 load them into RDD in driver (so I can optimize some performance through
 special logic).

 But I do some research online and can’t find any example to access parquet
 directly from scala, anyone has done this before?



 Regards,



 Shuai

Re: saveAsTextFile extremely slow near finish

Don't you think 1000 is too less for 160GB of data? Also you could try
using KryoSerializer, Enabling RDD Compression.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x m...@spokeo.com wrote:

 I'm basically running a sorting using spark. The spark program will read
 from
 HDFS, sort on composite keys, and then save the partitioned result back to
 HDFS.
 pseudo code is like this:

 input = sc.textFile
 pairs = input.mapToPair
 sorted = pairs.sortByKey
 values = sorted.values
 values.saveAsTextFile

  Input size is ~ 160G, and I made 1000 partitions specified in
 JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
 splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
 in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
 and the last few jobs just took forever and never finishes.

 Cluster setup:
 8 nodes
 on each node: 15gb memory, 8 cores

 running parameters:
 --executor-memory 12G
 --conf spark.cores.max=60

 Thank you for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark-on-YARN architecture

2015-03-10 Thread Harika Matha

Thanks for the quick reply.

I am running the application in YARN client mode.
And I want to run the AM on the same node as RM inorder use the node which
otherwise would run AM.

How can I get AM run on the same node as RM?

On Tue, Mar 10, 2015 at 3:49 PM, Sean Owen so...@cloudera.com wrote:

In YARN cluster mode, there is no Spark master, since YARN is your
resource manager. Yes you could force your AM somehow to run on the
same node as the RM, but why -- what do think is faster about that?

On Tue, Mar 10, 2015 at 10:06 AM, Harika matha.har...@gmail.com wrote:
Hi all,

I have Spark cluster setup on YARN with 4 nodes(1 master and 3 slaves).
When
I run an application, YARN chooses, at random, one Application Master
from
among the slaves. This means that my final computation is being carried
only on two slaves. This decreases the performance of the cluster.

1. Is this the correct way of configuration? What is the architecture of
Spark on YARN?
2. Is there a way in which I can run Spark master, YARN application
master
and resource manager on a single node?(so that I can use three other
nodes
for the computation)

Thanks
Harika

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-architecture-tp21986.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread shahab

Hi,

I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs
can be registered as a function in HiveContext, I could not find any
documentation of how UDAFs can be registered in the HiveContext?? so far
what I have found is to make a JAR file, out of developed UDAF class, and
then deploy the JAR file to SparkSQL .

But is there any way to avoid deploying the jar file and register it
programmatically?


best,
/Shahab

Re: Spark-on-YARN architecture

I suppose you just provision enough resource to run both on that
node... but it really shouldn't matter. The RM and your AM aren't
communicating heavily.

On Tue, Mar 10, 2015 at 10:23 AM, Harika Matha matha.har...@gmail.com wrote:
Thanks for the quick reply.

I am running the application in YARN client mode.
And I want to run the AM on the same node as RM inorder use the node which
otherwise would run AM.

How can I get AM run on the same node as RM?

On Tue, Mar 10, 2015 at 3:49 PM, Sean Owen so...@cloudera.com wrote:

On Tue, Mar 10, 2015 at 10:06 AM, Harika matha.har...@gmail.com wrote:
Hi all,

Thanks
Harika

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-architecture-tp21986.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang

I'm using Spark 1.3.0 RC3 build with Hive support.

 

In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table src).

 

==

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w)

scala sqlContext.sql(SELECT * from src).saveAsTable(table1)

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2)

scala sqlContext.sql(SELECT * from src).saveAsTable(table2)

==

After these steps, the tables are stored in /test/w only. I expect
table2 to be stored in /test/w2 folder.

 

Another question is: if I set hive.metastore.warehouse.dir to a HDFS
folder, I cannot use saveAsTable()? Is this by design? Exception stack
trace is below:

==

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast
at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS:
hdfs://server:8020/space/warehouse/table2, expected: file:///

at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

at
org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

at
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav
a:118)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:252)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:251)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at
scala.collection.AbstractTraversable.map(Traversable.scala:105)

at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP
arquet.scala:251)

at
org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:37
0)

at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:96)

at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:125)

at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

at
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru
n(commands.scala:217)

at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu
te(commands.scala:55)

at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands
.scala:55)

at
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65
)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLConte
xt.scala:1088)

at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:10
88)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC$$iwC$$iwC.init(console:31)

at $iwC$$iwC$$iwC.init(console:33)

at $iwC$$iwC.init(console:35)

at $iwC.init(console:37)

at init(console:39)

 

Thank you very much!

Spark-on-YARN architecture

2015-03-10 Thread Harika

Hi all,

I have Spark cluster setup on YARN with 4 nodes(1 master and 3 slaves). When
I run an application, YARN chooses, at random, one Application Master from
among the slaves. This means that my final computation is  being carried
only on two slaves. This decreases the performance of the cluster. 

1. Is this the correct way of configuration? What is the architecture of
Spark on YARN?
2. Is there a way in which I can run Spark master, YARN application master
and resource manager on a single node?(so that I can use three other nodes
for the computation)

Thanks
Harika





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-architecture-tp21986.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: saveAsTextFile extremely slow near finish

This is more of an aside, but why repartition this data instead of letting
it define partitions naturally? You will end up with a similar number.
On Mar 9, 2015 5:32 PM, mingweili0x m...@spokeo.com wrote:

 I'm basically running a sorting using spark. The spark program will read
 from
 HDFS, sort on composite keys, and then save the partitioned result back to
 HDFS.
 pseudo code is like this:

 input = sc.textFile
 pairs = input.mapToPair
 sorted = pairs.sortByKey
 values = sorted.values
 values.saveAsTextFile

  Input size is ~ 160G, and I made 1000 partitions specified in
 JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
 splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
 in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
 and the last few jobs just took forever and never finishes.

 Cluster setup:
 8 nodes
 on each node: 15gb memory, 8 cores

 running parameters:
 --executor-memory 12G
 --conf spark.cores.max=60

 Thank you for any help.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark-on-YARN architecture

In YARN cluster mode, there is no Spark master, since YARN is your
resource manager. Yes you could force your AM somehow to run on the
same node as the RM, but why -- what do think is faster about that?

On Tue, Mar 10, 2015 at 10:06 AM, Harika matha.har...@gmail.com wrote:
 Hi all,

 I have Spark cluster setup on YARN with 4 nodes(1 master and 3 slaves). When
 I run an application, YARN chooses, at random, one Application Master from
 among the slaves. This means that my final computation is  being carried
 only on two slaves. This decreases the performance of the cluster.

 1. Is this the correct way of configuration? What is the architecture of
 Spark on YARN?
 2. Is there a way in which I can run Spark master, YARN application master
 and resource manager on a single node?(so that I can use three other nodes
 for the computation)

 Thanks
 Harika





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-architecture-tp21986.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: error on training with logistic regression sgd

2015-03-10 Thread Peng Xia

Hi,

Can anyone give an idea about this?
Just did some google search, it seems related to the 2gb limitation on
block size, https://issues.apache.org/jira/browse/SPARK-1476.
The whole process is that:
1. load the data
2. convert each line of data into labeled points using some feature hashing
algorithm in python.
3. train a logistic regression model with  the converted labeled points.
Can any one give some advice for how to avoid the 2gb, if this is the cause?
Thanks very much for the help.

Best,
Peng

On Mon, Mar 9, 2015 at 3:54 PM, Peng Xia sparkpeng...@gmail.com wrote:

 Hi,

 I was launching a spark cluster with 4 work nodes, each work nodes
 contains 8 cores and 56gb ram, and I was testing my logistic regression
 problem.
 The training set is around 1.2 million records.When I was using 2**10
 (1024) features, the whole program works fine, but when I use 2**14
 features, the program has encountered the error:

 Py4JJavaError: An error occurred while calling 
 o84.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 
 (TID 9, workernode0.sparkexperience4a7.d5.internal.cloudapp.net): 
 java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds 
 Integer.MAX_VALUE
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
   at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
   at 
 org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307)
   at 
 org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
   at 
 org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
   at 
 org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124)
   at 
 org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97)
   at 
 org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91)
   at 
 org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
   at 
 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
   at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
   at java.lang.Thread.run(Thread.java:745)

   at 
 org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:156)
   at 
 org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:93)
   at

RE: sc.textFile() on windows cannot access UNC path

2015-03-10 Thread java8964

I think the work around is clear.
Using JDK 7, and implement your own saveAsRemoteWinText() using java.nio.path.
Yong

From: ningjun.w...@lexisnexis.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: RE: sc.textFile() on windows cannot access UNC path
Date: Tue, 10 Mar 2015 03:02:37 +









Hi Yong
 
Thanks for the reply. Yes it works with local drive letter. But I really need 
to use UNC path because the path is input from at runtime. I cannot dynamically 
assign a drive letter
 to arbitrary UNC path at runtime.
 
Is there any work around that I can use UNC path for sc.textFile(…)?

 
 

Ningjun
 

 


From: java8964 [mailto:java8...@hotmail.com]


Sent: Monday, March 09, 2015 5:33 PM

To: Wang, Ningjun (LNG-NPV); user@spark.apache.org

Subject: RE: sc.textFile() on windows cannot access UNC path


 

This is a Java problem, not really Spark.

 


From this page: 
http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u


 


You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path 
class in Hadoop will use java.io.*, instead of java.nio.


 


You need to manually mount your windows remote share a local driver, like Z:, 
then it should work.


 


Yong




From:
ningjun.w...@lexisnexis.com

To: user@spark.apache.org

Subject: sc.textFile() on windows cannot access UNC path

Date: Mon, 9 Mar 2015 21:09:38 +

I am running Spark on windows 2008 R2. I use sc.textFile() to load text file  
using UNC path, it does not work.
 
sc.textFile(rawfile:10.196.119.230/folder1/abc.txt,
4).count()

 
Input path does not exist: file:/10.196.119.230/folder1/abc.txt
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/10.196.119.230/tar/Enron/enron-207-short.load
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.count(RDD.scala:910)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply$mcV$sp(IndexTest.scala:104)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply(IndexTest.scala:103)
at 
ltn.analytics.tests.IndexTest$$anonfun$3.apply(IndexTest.scala:103)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at

RE: Does any one know how to deploy a custom UDAF jar file in SparkSQL?

You can add the additional jar when submitting your job, something like:

./bin/spark-submit --jars xx.jar …

More options can be listed by just typing ./bin/spark-submit

From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 10, 2015 8:48 PM
To: user@spark.apache.org
Subject: Does any one know how to deploy a custom UDAF jar file in SparkSQL?

Hi,

Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where 
should i put the jar file so SparkSQL can pick it up and make it accessible for 
SparkSQL applications?
I do not use spark-shell instead I want to use it in an spark application.

best,
/Shahab

RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

Currently, Spark SQL doesn’t provide interface for developing the custom UDTF, 
but it can work seamless with Hive UDTF.

I am working on the UDTF refactoring for Spark SQL, hopefully will provide an 
Hive independent UDTF soon after that.

From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 10, 2015 5:44 PM
To: user@spark.apache.org
Subject: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

Hi,

I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can 
be registered as a function in HiveContext, I could not find any documentation 
of how UDAFs can be registered in the HiveContext?? so far what I have found is 
to make a JAR file, out of developed UDAF class, and then deploy the JAR file 
to SparkSQL .

But is there any way to avoid deploying the jar file and register it 
programmatically?

best,
/Shahab

Does any one know how to deploy a custom UDAF jar file in SparkSQL?

2015-03-10 Thread shahab

Hi,

Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where
should i put the jar file so SparkSQL can pick it up and make it accessible
for SparkSQL applications?
I do not use spark-shell instead I want to use it in an spark application.

best,
/Shahab

Re: FW: RE: distribution of receivers in spark streaming

2015-03-10 Thread Du Li

Thanks TD and Jerry for suggestions. I have done some experiments and worked 
out a reasonable solution to the problem of spreading receivers to a set of 
worker hosts. It would be a bit too tedious to document in email. So I discuss 
the solution in a blog: 
http://scala4fun.tumblr.com/post/113172936582/how-to-distribute-receivers-over-worker-hosts-in
Please be free to give me feedback if you see any issue.
Thanks,Du 

 On Friday, March 6, 2015 4:10 PM, Tathagata Das t...@databricks.com 
wrote:
   

 Aaah, good point, about the same node. All right. Can you post this on the 
user mailing list for future reference to the community? Might be a good idea 
to post both methods with pros and cons, as different users may have different 
constraints. :)Thanks :)
TD
On Fri, Mar 6, 2015 at 4:07 PM, Du Li l...@yahoo-inc.com wrote:

Yes but the caveat may not exist if we do this when the streaming app is 
launched, since we're trying the start receivers before any other tasks. 
Discovered in this way will only include the worker hosts.
By using the API we may need some extra efforts to single out worker hosts. 
Sometimes we may run master and worker daemons on the same host while some 
other times we don't, depending on configuration.
Du 

 On Friday, March 6, 2015 3:59 PM, Tathagata Das t...@databricks.com 
wrote:
   

 That can definitely be done. In fact I have done that. Just one caveat. If one 
of the executors is fully occupied with a previous very-long job, then these 
fake tasks may not capture that worker even if there are lots of tasks. The 
executor storage status will work for sure, as long as you can filter out the 
master. 
TD
On Fri, Mar 6, 2015 at 3:49 PM, Du Li l...@yahoo-inc.com wrote:

Hi TD,
Thanks for your response and the information.
I just tried out the SparkContext getExecutorMemoryStatus and 
getExecutorStorageStatus methods. Due to their purposes, they do not 
differentiate master and worker nodes. However, for performance of my app, I 
prefer to distribute receivers only to the worker nodes.
This morning I worked out another solution: Create an accumulator and a fake 
workload, parallelize the workload with a high level of parallelism which does 
nothing but adds its hostname to the accumulator. Repeat this until the 
accumulator value stops growing. In the end I get the set of worker hostnames. 
It worked pretty well!
Thanks,Du 

 On Friday, March 6, 2015 3:11 PM, Tathagata Das t...@databricks.com 
wrote:
   

 What Saisai said is correct. There is no good API. However there are jacky 
ways of finding out the current workers. See 
sparkContext.getExecutorStorageStatus() and you can get the host names of the 
current executors. You could use those. 
TD
On Thu, Mar 5, 2015 at 6:55 AM, Du Li l...@yahoo-inc.com wrote:


| Hi TD. Do you have any suggestion? Thanks /Du


Sent from Yahoo Mail for iPhone
 Begin Forwarded Message 
From: Shao, Saisai'saisai.s...@intel.com'
Date: Mar 4, 2015, 10:35:44 PM
To: Du Li'l...@yahoo-inc.com', User'user@spark.apache.org'
Subject: RE: distribution of receivers in spark streaming
Yes, hostname is enough.   I think currently it is hard for user code to get 
the worker list from standalone master. If you can get the Master object, you 
could get the worker list, but AFAIK may be it is difficult to get this object. 
All you could do is to manually get the worker list and assigned its hostname 
to each receiver.   Thanks Jerry   From: Du Li [mailto:l...@yahoo-inc.com]
Sent: Thursday, March 5, 2015 2:29 PM
To: Shao, Saisai; User
Subject: Re: distribution of receivers in spark streaming   Hi Jerry,   Thanks 
for your response.   Is there a way to get the list of currently 
registered/live workers? Even in order to provide preferredLocation, it would 
be safer to know which workers are active. Guess I only need to provide the 
hostname, right?   Thanks, Du   On Wednesday, March 4, 2015 10:08 PM, Shao, 
Saisai saisai.s...@intel.com wrote:   Hi Du,   You could try to sleep for 
several seconds after creating streaming context to let all the executors 
registered, then all the receivers can distribute to the nodes more evenly. 
Also setting locality is another way as you mentioned.   Thanks Jerry From: 
Du Li [mailto:l...@yahoo-inc.com.INVALID]
Sent: Thursday, March 5, 2015 1:50 PM
To: User
Subject: Re: distribution of receivers in spark streaming   Figured it out: I 
need to override method preferredLocation() in MyReceiver class.   On 
Wednesday, March 4, 2015 3:35 PM, Du Li l...@yahoo-inc.com.INVALID wrote:   
Hi,   I have a set of machines (say 5) and want to evenly launch a number (say 
8) of kafka receivers on those machines. In my code I did something like the 
following, as suggested in the spark docs:  val streams = (1 to 
numReceivers).map(_ = ssc.receiverStream(new MyKafkaReceiver()))  
ssc.union(streams)   However, from the spark UI, I saw that some machines are 
not running any instance of the receiver while some get three. The

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Tobias Pfeiffer

Hi,

On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote:

 I am new to the SchemaRDD class, and I am trying to decide in using SQL
 queries or Language Integrated Queries (
 https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
 ).

 Can someone tell me what is the main difference between the two
 approaches, besides using different syntax? Are they interchangeable? Which
 one has better performance?


One difference is that the language integrated queries are method calls on
the SchemaRDD you want to work on, which requires you have access to the
object at hand. The SQL queries are passed to a method of the SQLContext
and you have to call registerTempTable() on the SchemaRDD you want to use
beforehand, which can basically happen at an arbitrary location of your
program. (I don't know if I could express what I wanted to say.) That may
have an influence on how you design your program and how the different
parts work together.

Tobias

Re: Why spark master consumes 100% CPU when we kill a spark streaming app?

2015-03-10 Thread Saisai Shao

Probably the cleanup work like clean shuffle files, tmp files cost too much
of CPUs, since if we run Spark Streaming for a long time, lots of files
will be generated, so cleanup this files before app is exited could be
time-consuming.

Thanks
Jerry

2015-03-11 10:43 GMT+08:00 Tathagata Das t...@databricks.com:

 Do you have event logging enabled?
 That could be the problem. The Master tries to aggressively recreate the
 web ui of the completed job with the event logs (when it is enabled)
 causing the Master to stall.
 I created a JIRA for this.
 https://issues.apache.org/jira/browse/SPARK-6270

 On Tue, Mar 10, 2015 at 7:10 PM, Xuelin Cao xuelincao2...@gmail.com
 wrote:


 Hey,

  Recently, we found in our cluster, that when we kill a spark
 streaming app, the whole cluster cannot response for 10 minutes.

  And, we investigate the master node, and found the master process
 consumes 100% CPU when we kill the spark streaming app.

  How could it happen? Did anyone had the similar problem before?

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-10 Thread Jaonary Rabarisoa

I'm trying to play with the implementation of least square solver (Ax = b)
in mlmatrix.TSQR where A is  a 5*1024 matrix  and b a 5*10 matrix.
It works but I notice
that it's 8 times slower than the implementation given in the latest
ampcamp :
http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
. As far as I know these two implementations come from the same basis.
What is the difference between these two codes ?





On Tue, Mar 3, 2015 at 8:02 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can use
 out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao

Re: Setting up Spark with YARN on EC2 cluster

2015-03-10 Thread roni

Hi Harika,
Did you get any solution for this?
I want to use yarn , but the spark-ec2 script does not support it.
Thanks 
-Roni




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818p21991.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ANSI Standard Supported by the Spark-SQL

2015-03-10 Thread Michael Armbrust

Spark SQL supports a subset of HiveQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

On Mon, Mar 9, 2015 at 11:32 PM, Ravindra ravindra.baj...@gmail.com wrote:

 From the archives in this user list, It seems that Spark-SQL is yet to
 achieve SQL 92 level. But there are few things still not clear.
 1. This is from an old post dated : Aug 09, 2014.
 2.  It clearly says that it doesn't support DDL and DML operations. Does
 that means, all reads (select) are sql 92 compliant?

 Please clarify.

 Regards,
 Ravi

 On Tue, Mar 10, 2015 at 11:46 AM Ravindra ravindra.baj...@gmail.com
 wrote:

 Hi All,

 I am new to spark and trying to understand what SQL Standard is supported
 by the Spark. I googled around a lot but didn't get clear answer.

 Some where I saw that Spark supports sql-92 and at other location I found
 that spark is not fully compliant with sql-92.

 I also noticed that using Hive Context I can execute almost all the
 queries supported by the Hive, so does that means that Spark is equivalent
 to Hive in terms of sql standards, given that HiveContext is used for
 querying.

 Please suggest.

 Regards,
 Ravi

Re: Compilation error

If you are using tools like SBT/Maven/Gradle/etc, they figure out all the
recursive dependencies and includes them in the class path. I haven't
touched Eclipse in years so I am not sure off the top of my head what's
going on instead. Just in case you only downloaded the
spark-streaming_2.10.jar  then that is indeed insufficient and you have to
download all the recursive dependencies. May be you should create a Maven
project inside Eclipse?

TD

On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 How do I do that? I haven't used Scala before.

 Also, linking page doesn't mention that:

 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking

 On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote:

 It means you do not have Scala library classes in your project classpath.

 On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am trying out streaming example as documented and I am using spark
 1.2.1
  streaming from maven for Java.
 
  When I add this code I get compilation error on and eclipse is not able
 to
  recognize Tuple2. I also don't see any import scala.Tuple2 class.
 
 
 
 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example
 
 
  private void map(JavaReceiverInputDStreamString lines) {
 
  JavaDStreamString words = lines.flatMap(
 
  new FlatMapFunctionString, String() {
 
  @Override public IterableString call(String x) {
 
  return Arrays.asList(x.split( ));
 
  }
 
  });
 
  // Count each word in each batch
 
  JavaPairDStreamString, Integer pairs = words.map(
 
  new PairFunctionString, String, Integer() {
 
  @Override public Tuple2String, Integer call(String s) throws
 Exception {
 
  return new Tuple2String, Integer(s, 1);
 
  }
 
  });
 
  }

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread shahab

Thanks Hao,
But my question concerns UDAF (user defined aggregation function ) not
UDTF( user defined type function ).
I appreciate if you could point me to some starting point on UDAF
development in Spark.

Thanks
Shahab

On Tuesday, March 10, 2015, Cheng, Hao hao.ch...@intel.com wrote:

  Currently, Spark SQL doesn’t provide interface for developing the custom
 UDTF, but it can work seamless with Hive UDTF.



 I am working on the UDTF refactoring for Spark SQL, hopefully will provide
 an Hive independent UDTF soon after that.



 *From:* shahab [mailto:shahab.mok...@gmail.com
 javascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com');]
 *Sent:* Tuesday, March 10, 2015 5:44 PM
 *To:* user@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user@spark.apache.org');
 *Subject:* Registering custom UDAFs with HiveConetxt in SparkSQL, how?



 Hi,



 I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs
 can be registered as a function in HiveContext, I could not find any
 documentation of how UDAFs can be registered in the HiveContext?? so far
 what I have found is to make a JAR file, out of developed UDAF class, and
 then deploy the JAR file to SparkSQL .



 But is there any way to avoid deploying the jar file and register it
 programmatically?





 best,

 /Shahab

Re: Compilation error

You have to include Scala libraries in the Eclipse dependencies.

TD

On Tue, Mar 10, 2015 at 10:54 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 I am trying out streaming example as documented and I am using spark 1.2.1
 streaming from maven for Java.

 When I add this code I get compilation error on and eclipse is not able to
 recognize Tuple2. I also don't see any import scala.Tuple2 class.



 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example


 *private* *void* map(JavaReceiverInputDStreamString lines) {

 JavaDStreamString words = lines.flatMap(

 *new* *FlatMapFunctionString, String()* {

 @Override *public* IterableString call(String x) {

 *return* Arrays.*asList*(x.split( ));

 }

 });

  // Count each word in each batch

 JavaPairDStreamString, Integer pairs = words.*map*(

 *new* *PairFunctionString, String, Integer()* {

 @Override *public* *Tuple2*String, Integer call(String s) *throws*
 Exception {

 *return* *new* *Tuple2*String, Integer(s, 1);

 }

 });

  }

Re: Can't cache RDD of collaborative filtering on MLlib

2015-03-10 Thread Yuichiro Sakamoto

Thank you for your reply.


 1. Which version of Spark do you use now?

  I use Spark 1.2.0. (CDH 5.3.1)


 2. Why don't you check whether `productJavaRDD ` and `userJavaRDD ` are
 cached with Web UI or not?

  I checked SparkUI.
  Task was stopped at 1/2 (Succeeded/Total tasks).

  Here is stacktrace:

org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:840)
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendUsers(MatrixFactorizationModel.scala:126)
...


 3. You can check a RDD cached or not with `getStorageLevel()` method, such
 as `model .userFeatures.getStorageLevel()`. 

  I printed the return value of getStorageLevel() userFeatures and
productFeatures,
  both were Memory Deserialized 1x Replicated .

  I think, two variables were configured to cache,
  but didn't cache at that time. (delayed ?)


Thanks,
Yuichiro Sakamoto



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-cache-RDD-of-collaborative-filtering-on-MLlib-tp21962p21990.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Compilation error

How do I do that? I haven't used Scala before.

Also, linking page doesn't mention that:

http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking

On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote:

 It means you do not have Scala library classes in your project classpath.

 On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am trying out streaming example as documented and I am using spark
 1.2.1
  streaming from maven for Java.
 
  When I add this code I get compilation error on and eclipse is not able
 to
  recognize Tuple2. I also don't see any import scala.Tuple2 class.
 
 
 
 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example
 
 
  private void map(JavaReceiverInputDStreamString lines) {
 
  JavaDStreamString words = lines.flatMap(
 
  new FlatMapFunctionString, String() {
 
  @Override public IterableString call(String x) {
 
  return Arrays.asList(x.split( ));
 
  }
 
  });
 
  // Count each word in each batch
 
  JavaPairDStreamString, Integer pairs = words.map(
 
  new PairFunctionString, String, Integer() {
 
  @Override public Tuple2String, Integer call(String s) throws Exception
 {
 
  return new Tuple2String, Integer(s, 1);
 
  }
 
  });
 
  }

Re: Why spark master consumes 100% CPU when we kill a spark streaming app?

Do you have event logging enabled?
That could be the problem. The Master tries to aggressively recreate the
web ui of the completed job with the event logs (when it is enabled)
causing the Master to stall.
I created a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-6270

On Tue, Mar 10, 2015 at 7:10 PM, Xuelin Cao xuelincao2...@gmail.com wrote:


 Hey,

  Recently, we found in our cluster, that when we kill a spark
 streaming app, the whole cluster cannot response for 10 minutes.

  And, we investigate the master node, and found the master process
 consumes 100% CPU when we kill the spark streaming app.

  How could it happen? Did anyone had the similar problem before?

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-10 Thread Silvio Fiorito

Have you tried Apache Daemon? 
http://commons.apache.org/proper/commons-daemon/procrun.html

From: Wang, Ningjun (LNG-NPV)
Date: Tuesday, March 10, 2015 at 11:47 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Is it possible to use windows service to start and stop spark 
standalone cluster

We are using spark stand alone cluster on Windows 2008 R2. I can start spark 
clusters by open an command prompt and run the following

bin\spark-class.cmd org.apache.spark.deploy.master.Master

bin\spark-class.cmd org.apache.spark.deploy.worker.Worker 
spark://mywin.mydomain.com:7077

I can stop spark cluster by pressing Ctril-C.

The problem is that if the machine is reboot, I have to manually start the 
spark cluster again as above. Is it possible to use a windows service to start 
cluster? This way when the machine is reboot, the windows service will 
automatically restart spark cluster. How to stop spark cluster using windows 
service is also a challenge.

Please advise.

Thanks

Ningjun

S3 SubFolder Write Issues

2015-03-10 Thread cpalm3

Hi All, 

I am hoping someone has seen this issue before with S3, as I haven't been
able to find a solution for this problem.

When I try to save as Text file to s3 into a subfolder, it only ever writes
out to the bucket level folder
and produces block level generated file names and not my output folder as I
specified.
Below is the sample code in Scala, I have also seen this behavior in the
Java code.

 val out =  inputRdd.map {ir = mapFunction(ir)}.groupByKey().mapValues { x
= mapValuesFunction(x) }
   .saveAsTextFile(s3://BUCKET/SUB_FOLDER/output

Any ideas on how to get saveAsTextFile to write to an S3 subfolder?

Thanks,
Chris



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/S3-SubFolder-Write-Issues-tp21997.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SQL with Spark Streaming

2015-03-10 Thread Tobias Pfeiffer

Hi,

On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote:

  Intel has a prototype for doing this, SaiSai and Jason are the authors.
 Probably you can ask them for some materials.


The github repository is here: https://github.com/intel-spark/stream-sql

Also, what I did is writing a wrapper class SchemaDStream that internally
holds a DStream[Row] and a DStream[StructType] (the latter having just one
element in every RDD) and then allows to do
- operations SchemaRDD = SchemaRDD using
`rowStream.transformWith(schemaStream, ...)`
- in particular you can register this stream's data as a table this way
- and via a companion object with a method `fromSQL(sql: String):
SchemaDStream` you can get a new stream from previously registered tables.

However, you are limited to batch-internal operations, i.e., you can't
aggregate across batches.

I am not able to share the code at the moment, but will within the next
months. It is not very advanced code, though, and should be easy to
replicate. Also, I have no idea about the performance of transformWith

Tobias

Re: ANSI Standard Supported by the Spark-SQL

2015-03-10 Thread Ravindra

Thanks Michael,

That helps. So just to summarise that we should not make any assumption
about Spark being fully compliant with any SQL Standards until announced by
the community and maintain the same status quo as you have suggested.

Regards,
Ravi.

On Tue, Mar 10, 2015 at 11:14 PM Michael Armbrust mich...@databricks.com
wrote:

Spark SQL supports a subset of HiveQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

On Mon, Mar 9, 2015 at 11:32 PM, Ravindra ravindra.baj...@gmail.com
wrote:

From the archives in this user list, It seems that Spark-SQL is yet to
achieve SQL 92 level. But there are few things still not clear.
1. This is from an old post dated : Aug 09, 2014.
2. It clearly says that it doesn't support DDL and DML operations. Does
that means, all reads (select) are sql 92 compliant?

Please clarify.

Regards,
Ravi

On Tue, Mar 10, 2015 at 11:46 AM Ravindra ravindra.baj...@gmail.com
wrote:

Hi All,

I am new to spark and trying to understand what SQL Standard is
supported by the Spark. I googled around a lot but didn't get clear answer.

Some where I saw that Spark supports sql-92 and at other location I
found that spark is not fully compliant with sql-92.

I also noticed that using Hive Context I can execute almost all the
queries supported by the Hive, so does that means that Spark is equivalent
to Hive in terms of sql standards, given that HiveContext is used for
querying.

Please suggest.

Regards,
Ravi

Is it possible to use windows service to start and stop spark standalone cluster

2015-03-10 Thread Wang, Ningjun (LNG-NPV)

We are using spark stand alone cluster on Windows 2008 R2. I can start spark 
clusters by open an command prompt and run the following

bin\spark-class.cmd org.apache.spark.deploy.master.Master

bin\spark-class.cmd org.apache.spark.deploy.worker.Worker 
spark://mywin.mydomain.com:7077

I can stop spark cluster by pressing Ctril-C.

The problem is that if the machine is reboot, I have to manually start the 
spark cluster again as above. Is it possible to use a windows service to start 
cluster? This way when the machine is reboot, the windows service will 
automatically restart spark cluster. How to stop spark cluster using windows 
service is also a challenge.

Please advise.

Thanks

Ningjun

Re: sparse vector operations in Python

2015-03-10 Thread Joseph Bradley

There isn't a great way currently.  The best option is probably to convert
to scipy.sparse column vectors and add using scipy.
Joseph

On Mon, Mar 9, 2015 at 4:21 PM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

 Hi,

 Sorry to ask this, but how do I compute the sum of 2 (or more) mllib
 SparseVectors in Python?

 Thanks,
 Ron




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Why spark master consumes 100% CPU when we kill a spark streaming app?

2015-03-10 Thread Xuelin Cao

Hey,

 Recently, we found in our cluster, that when we kill a spark streaming
app, the whole cluster cannot response for 10 minutes.

 And, we investigate the master node, and found the master process
consumes 100% CPU when we kill the spark streaming app.

 How could it happen? Did anyone had the similar problem before?

Pyspark not using all cores

2015-03-10 Thread htailor

Hi All,

I need some help with a problem in pyspark which is causing a major issue. 

Recently I've noticed that the behaviour of the python.deamons on the worker
nodes for compute-intensive tasks have changed from using all the avaliable
cores to using only a single core. On each worker node, 8 python.deamons
exist but they all seem to run on a single core. The remaining 7 cores idle.

Our hardware consists of 9 hosts (1x driver node and 8x worker nodes) each
with 8 cores and 64gb RAM - we are using Spark 1.2.0-SNAPSHOT (Python) in
standalone mode via Cloudera 5.3.2. 

To give a better understanding of the problem I have made a quick script
from one of the given examples which replicates the problem:

When I run the calculating_pies.py script using the command spark-submit
calculate_pies.py this is what I typically see on all my worker nodes:

  1  [|||100.0%]  5 
[|||  3.9%]
  2  [ 
0.0%]  6  [|||  2.6%]
  3  [||   
1.3%] 7  [|8.4%]
  4  [||   
1.3%] 8  [||   8.7%]

  PID USER  PRI  NI  VIRT   RESSHR S CPU% MEM%   TIME+  Command
30672 spark20   0   225M  112M  1156 R 13.0   0.2   0:03.10 python
-m pyspark.daemon
30681 spark20   0   225M  112M  1152 R 13.0   0.2   0:03.10 python
-m pyspark.daemon
30687 spark20   0   225M  112M  1152 R 13.0   0.2   0:03.10 python
-m pyspark.daemon
30678 spark20   0   225M  112M  1152 R 12.0   0.2   0:03.10 python
-m pyspark.daemon
30693 spark20   0   225M  112M  1152 R 12.0   0.2   0:03.08 python
-m pyspark.daemon
30674 spark20   0   225M  112M  1152 R 12.0   0.2   0:03.10 python
-m pyspark.daemon
30688 spark20   0   225M  112M  1152 R 12.0   0.2   0:03.08 python
-m pyspark.daemon
30684 spark20   0   225M  112M  1152 R 12.0   0.2   0:03.10 python
-m pyspark.daemon

Through the spark UI I do see 8 executor ids with 8 active tasks on each. I
also see the same behaviour if I use the flag --total-executor-cores 64 in
spark-submit.

Strangly, If I run the same script in local mode everything seems to run
fine. This is what I see

  1  [||100.0%] 5 
[||100.0%]
  2  [||100.0%] 6 
[||100.0%]
  3  [||100.0%] 7 
[||100.0%]
  4  [|||99.3%] 8 
[|||99.4%]

  PID USER  PRI  NI  VIRT   RESSHR  S CPU% MEM%   TIME+  Command
22519 data   20   0  225M  106M  1368 R 99.0   0.20:10.97 python
-m pyspark.daemon
22508 data   20   0  225M  106M  1368 R 99.0   0.20:10.92 python
-m pyspark.daemon
22513 data   20   0  225M  106M  1368 R 99.0   0.20:11.02 python
-m pyspark.daemon
22526 data   20   0  225M  106M  1368 R 99.0   0.20:10.84 python
-m pyspark.daemon
22522 data   20   0  225M  106M  1368 R 98.0   0.20:10.95 python
-m pyspark.daemon
22523 data   20   0  225M  106M  1368 R 97.0   0.20:10.92 python
-m pyspark.daemon
22507 data   20   0  225M  106M  1368 R 97.0   0.20:10.83 python
-m pyspark.daemon
22516 data   20   0  225M  106M  1368 R 93.0   0.20:10.88 python
-m pyspark.daemon


== calculating_pies.py ==

#!/usr/bin/pyspark

import random
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.storagelevel import StorageLevel

def pi(NUM_SAMPLE = 100):
count = 0.
for i in xrange(NUM_SAMPLE):
x, y = random.random(), random.random()
if (x * x) + (y * y)  1:
count += 1
return 4.0 * (count / NUM_SAMPLE)

if __name__ == __main__:

sconf = (SparkConf()
 .set('spark.default.parallelism','256')
 .set('spark.app.name', 'Calculating PI'))

# local
# sc = SparkContext(conf=sconf)

# standalone
sc = SparkContext(spark://driver_host:7077, conf=sconf)

# yarn
# sc = SparkContext(yarn-client, conf=sconf)

rdd_pies = sc.parallelize(range(1), 1000)
rdd_pies.map(lambda x: pi()).collect()
sc.stop()

=

Does anyone have any suggestions, or know of any config we should be looking
at that could solve this problem? Does anyone else see the same problem?

Any help is appreciated, Thanks.



--
View this message

Re: Spark Streaming testing strategies

2015-03-10 Thread Marcin Kuthan

Hi Holden

Thanks Holden for pointing me the package. Indeed StreamingSuiteBase
trait hides a lot, especially regarding clock manipulation. Did you
encounter problems with concurrent tests execution from SBT
(SPARK-2243)? I had to disable parallel execution and configure SBT to
use separate JVM for tests execution (fork).

BTW. I added samples for SparkSQL as well.

I would expect base trait for testing purposes in spark distribution.
ManualClock should be exposed as well. And some documentation how to
configure SBT to avoid problems with multiple spark contexts. I'm
going to create improvement proposal on Spark issue tracker about it.



On 1 March 2015 at 18:49, Holden Karau hol...@pigscanfly.ca wrote:

 There is also the Spark Testing Base package which is on spark-packages.org 
 and hides the ugly bits (it's based on the existing streaming test code but I 
 cleaned it up a bit to try and limit the number of internals it was touching).


 On Sunday, March 1, 2015, Marcin Kuthan marcin.kut...@gmail.com wrote:

 I have started using Spark and Spark Streaming and I'm wondering how do you 
 test your applications? Especially Spark Streaming application with window 
 based transformations.

 After some digging I found ManualClock class to take full control over 
 stream processing. Unfortunately the class is not available outside 
 spark.streaming package. Are you going to expose the class for other 
 developers as well? Now I have to use my custom wrapper under 
 spark.streaming package.

 My Spark and Spark Streaming unit tests strategies are documented here:
 http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/

 Your feedback is more than appreciated.

 Marcin



 --
 Cell : 425-233-8271

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Compilation error on JavaPairDStream

I am getting following error. When I look at the sources it seems to be a
scala source, but not sure why it's complaining about it.

The method map(FunctionString,R) in the type JavaDStreamString is not
applicable for the arguments (new

PairFunctionString,String,Integer(){})


And my code has been taken from the spark examples site:


JavaPairDStreamString, Integer pairs = words.*map*(

 *new* *PairFunctionString, String, Integer()* {

 @Override *public* Tuple2String, Integer call(String s) *throws*
Exception {

 *return* *new* Tuple2String, Integer(s, 1);


 }

 });

java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local class incompatible: stream classdesc

2015-03-10 Thread Manas Kar

Hi,
 I have a CDH5.3.2(Spark1.2) cluster.
 I am getting an local class incompatible exception for my spark
application during an action.
All my classes are case classes(To best of my knowledge)

Appreciate any help.

Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 3.0 (TID 346, datanode02):
java.io.InvalidClassException: org.apache.spark.rdd.PairRDDFunctions; local
class incompatible: stream classdesc serialVersionUID =
8789839749593513237, local class serialVersionUID = -4145741279224749316
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Thanks
Manas

Writing wide parquet file in Spark SQL

2015-03-10 Thread kpeng1

Hi All,

I am currently trying to write a very wide file into parquet using spark
sql.  I have 100K column records that I am trying to write out, but of
course I am running into space issues(out of memory - heap space).  I was
wondering if there are any tweaks or work arounds for this.

I am basically calling saveAsParquetFile on the schemaRDD.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-wide-parquet-file-in-Spark-SQL-tp21995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Compilation error

I navigated to maven dependency and found scala library. I also found
Tuple2.class and when I click on it in eclipse I get invalid LOC header
(bad signature)

java.util.zip.ZipException: invalid LOC header (bad signature)
 at java.util.zip.ZipFile.read(Native Method)

I am wondering if I should delete that file from local repo and re-sync

On Tue, Mar 10, 2015 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 I ran the dependency command and see the following dependencies:

 I only see org.scala-lang.

 [INFO] org.spark.test:spak-test:jar:0.0.1-SNAPSHOT

 [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.2.0:compile

 [INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile

 [INFO] | | +-
 org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:co mpile

 [INFO] | | +-
 org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compil e

 [INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile

 [INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile

 [INFO] | +- org.scala-lang:scala-library:jar:2.10.4:compile

 [INFO] | \- org.spark-project.spark:unused:jar:1.0.0:compile

 [INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.1:compile

 [INFO] +- com.twitter:chill_2.10:jar:0.5.0:compile

 [INFO] | \- com.esotericsoftware.kryo:kryo:jar:2.21:compile

 [INFO] | +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:co
 mpile

 [INFO] | +- com.esotericsoftware.minlog:minlog:jar:1.2:compile

 [INFO] | \- org.objenesis:objenesis:jar:1.2:compile

 [INFO] +- com.twitter:chill-java:jar:0.5.0:compile

 [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile

 [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile

 [INFO] | | +- commons-cli:commons-cli:jar:1.2:compile

 [INFO] | | +- org.apache.commons:commons-math:jar:2.1:compile

 [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile

 [INFO] | | +- commons-io:commons-io:jar:2.1:compile

 [INFO] | | +- commons-logging:commons-logging:jar:1.1.1:compile

 [INFO] | | +- commons-lang:commons-lang:jar:2.5:compile

 [INFO] | | +- commons-configuration:commons-configuration:jar:1.6:compile

 [INFO] | | | +- commons-collections:commons-collections:jar:3.2.1:compile

 [INFO] | | | +- commons-digester:commons-digester:jar:1.8:compile

 [INFO] | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile

 [INFO] | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile

 [INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile

 [INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile

 [INFO] | | +- org.apache.avro:avro:jar:1.7.4:compile

 [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile

 [INFO] | | +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile

 [INFO] | | \- org.apache.commons:commons-compress:jar:1.4.1:compile

 [INFO] | | \- org.tukaani:xz:jar:1.0:compile

 [INFO] | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile

 [INFO] | | \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile

 [INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile

 [INFO] | | +-
 org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:co mpile

 [INFO] | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile

 [INFO] | | | | +- com.google.inject:guice:jar:3.0:compile

 [INFO] | | | | | +- javax.inject:javax.inject:jar:1:compile

 [INFO] | | | | | \- aopalliance:aopalliance:jar:1.0:compile

 [INFO] | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-framew
 ork-grizzly2:jar:1.9:compile

 [INFO] | | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-fra
 mework-core:jar:1.9:compile

 [INFO] | | | | | | +- javax.servlet:javax.servlet-api:jar:3.0.1:compile

 [INFO] | | | | | | \- com.sun.jersey:jersey-client:jar:1.9:compile

 [INFO] | | | | | \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile

 [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http:jar:2.1.2:comp ile

 [INFO] | | | | | | \- org.glassfish.grizzly:grizzly-framework:jar:2.
 1.2:compile

 [INFO] | | | | | | \- org.glassfish.gmbal:gmbal-api-only:jar:3.0.
 0-b023:compile

 [INFO] | | | | | | \- org.glassfish.external:management-api:ja
 r:3.0.0-b012:compile

 [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-server:jar:2.1
 .2:compile

 [INFO] | | | | | | \- org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:co mpile

 [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-servlet:jar:2.
 1.2:compile

 [INFO] | | | | | \- org.glassfish:javax.servlet:jar:3.1:compile

 [INFO] | | | | +- com.sun.jersey:jersey-server:jar:1.9:compile

 [INFO] | | | | | +- asm:asm:jar:3.1:compile

 [INFO] | | | | | \- com.sun.jersey:jersey-core:jar:1.9:compile

 [INFO] | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile

 [INFO] | | | | | +- org.codehaus.jettison:jettison:jar:1.1:compile

 [INFO] | | | | | | \- stax:stax-api:jar:1.0.1:compile

 [INFO] | | | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile

 [INFO] | | | | | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile

 [INFO] |

Re: Spark Streaming testing strategies

2015-03-10 Thread Holden Karau

On Tue, Mar 10, 2015 at 1:18 PM, Marcin Kuthan marcin.kut...@gmail.com
wrote:

 Hi Holden

 Thanks Holden for pointing me the package. Indeed StreamingSuiteBase
 trait hides a lot, especially regarding clock manipulation. Did you
 encounter problems with concurrent tests execution from SBT
 (SPARK-2243)? I had to disable parallel execution and configure SBT to
 use separate JVM for tests execution (fork).

Yah, I haven't used parallel execution with this testing trait, I can look
into it some more.


 BTW. I added samples for SparkSQL as well.

Oh awesome :)


 I would expect base trait for testing purposes in spark distribution.
 ManualClock should be exposed as well. And some documentation how to
 configure SBT to avoid problems with multiple spark contexts. I'm
 going to create improvement proposal on Spark issue tracker about it.

Right now I think a package is probably a good place for this to live since
the internal Spark testing code is changing/evolving rapidly, but I think
once we have the trait fleshed out a bit more we could see if there is
enough interest to try and merge it in (just my personal thoughts).





 On 1 March 2015 at 18:49, Holden Karau hol...@pigscanfly.ca wrote:
 
  There is also the Spark Testing Base package which is on
 spark-packages.org and hides the ugly bits (it's based on the existing
 streaming test code but I cleaned it up a bit to try and limit the number
 of internals it was touching).
 
 
  On Sunday, March 1, 2015, Marcin Kuthan marcin.kut...@gmail.com wrote:
 
  I have started using Spark and Spark Streaming and I'm wondering how do
 you test your applications? Especially Spark Streaming application with
 window based transformations.
 
  After some digging I found ManualClock class to take full control over
 stream processing. Unfortunately the class is not available outside
 spark.streaming package. Are you going to expose the class for other
 developers as well? Now I have to use my custom wrapper under
 spark.streaming package.
 
  My Spark and Spark Streaming unit tests strategies are documented here:
  http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
 
  Your feedback is more than appreciated.
 
  Marcin
 
 
 
  --
  Cell : 425-233-8271




-- 
Cell : 425-233-8271

Re: Compilation error on JavaPairDStream

Ah, that's a typo in the example: use words.mapToPair
I can make a little PR to fix that.

On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I am getting following error. When I look at the sources it seems to be a
 scala source, but not sure why it's complaining about it.

 The method map(FunctionString,R) in the type JavaDStreamString is not
 applicable for the arguments (new

 PairFunctionString,String,Integer(){})


 And my code has been taken from the spark examples site:


 JavaPairDStreamString, Integer pairs = words.map(

 new PairFunctionString, String, Integer() {

 @Override public Tuple2String, Integer call(String s) throws Exception {

 return new Tuple2String, Integer(s, 1);


 }

 });



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Nitay Joffe

In Spark 1.2 I used to be able to do this:

scala
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint)
res30: org.apache.spark.sql.catalyst.types.DataType =
StructType(List(StructField(int,LongType,true)))

That is, the name of a column can be a keyword like int. This is no
longer the case in 1.3:

data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint)
org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
failure: ``'' expected but `int' found

structint:bigint
   ^
at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
at
org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
at
org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)

Note HiveTypeHelper is simply an object I load in to expose
HiveMetastoreTypes since it was made private. See
https://gist.github.com/nitay/460b41ed5fd7608507f5
https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c

This is actually a pretty big problem for us as we have a bunch of legacy
tables with column names like timestamp. They work fine in 1.2, but now
everything throws in 1.3.

Any thoughts?

Thanks,
- Nitay
Founder  CTO

SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Cesar Flores

I am new to the SchemaRDD class, and I am trying to decide in using SQL
queries or Language Integrated Queries (
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
).

Can someone tell me what is the main difference between the two approaches,
besides using different syntax? Are they interchangeable? Which one has
better performance?


Thanks a lot
-- 
Cesar Flores

RE: Compilation error

2015-03-10 Thread java8964

Or another option is to use Scala-IDE, which is built on top of Eclipse, 
instead of pure Eclipse, so Scala comes with it.
Yong

 From: so...@cloudera.com
 Date: Tue, 10 Mar 2015 18:40:44 +
 Subject: Re: Compilation error
 To: mohitanch...@gmail.com
 CC: t...@databricks.com; user@spark.apache.org

 A couple points:

 You've got mismatched versions here -- 1.2.0 vs 1.2.1. You should fix
 that but it's not your problem.

 These are also supposed to be 'provided' scope dependencies in Maven.

 You should get the Scala deps transitively and can import scala.*
 classes. However, it would be a little bit more correct to depend
 directly on the scala library classes, but in practice, easiest not to
 in simple use cases.

 If you're still having trouble look at the output of mvn dependency:tree

 On Tue, Mar 10, 2015 at 6:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
  I am using maven and my dependency looks like this, but this doesn't seem to
  be working

  dependencies

  dependency

  groupIdorg.apache.spark/groupId

  artifactIdspark-streaming_2.10/artifactId

  version1.2.0/version

  /dependency

  dependency

  groupIdorg.apache.spark/groupId

  artifactIdspark-core_2.10/artifactId

  version1.2.1/version

  /dependency

  /dependencies

  On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com wrote:

  If you are using tools like SBT/Maven/Gradle/etc, they figure out all the
  recursive dependencies and includes them in the class path. I haven't
  touched Eclipse in years so I am not sure off the top of my head what's
  going on instead. Just in case you only downloaded the
  spark-streaming_2.10.jar  then that is indeed insufficient and you have to
  download all the recursive dependencies. May be you should create a Maven
  project inside Eclipse?

  TD

  On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:

  How do I do that? I haven't used Scala before.

  Also, linking page doesn't mention that:

  http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking

  On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote:

  It means you do not have Scala library classes in your project
  classpath.

  On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
   I am trying out streaming example as documented and I am using spark
   1.2.1
   streaming from maven for Java.

   When I add this code I get compilation error on and eclipse is not
   able to
   recognize Tuple2. I also don't see any import scala.Tuple2 class.

   http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example

   private void map(JavaReceiverInputDStreamString lines) {

   JavaDStreamString words = lines.flatMap(

   new FlatMapFunctionString, String() {

   @Override public IterableString call(String x) {

   return Arrays.asList(x.split( ));

   }

   });

   // Count each word in each batch

   JavaPairDStreamString, Integer pairs = words.map(

   new PairFunctionString, String, Integer() {

   @Override public Tuple2String, Integer call(String s) throws
   Exception {

   return new Tuple2String, Integer(s, 1);

   }

   });

   }

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Compilation error

I am using maven and my dependency looks like this, but this doesn't seem
to be working

 dependencies

dependency

groupIdorg.apache.spark/groupId

artifactIdspark-streaming_2.10/artifactId

version1.2.0/version

/dependency

dependency

groupIdorg.apache.spark/groupId

artifactIdspark-core_2.10/artifactId

version1.2.1/version

/dependency

/dependencies

On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com wrote:

 If you are using tools like SBT/Maven/Gradle/etc, they figure out all the
 recursive dependencies and includes them in the class path. I haven't
 touched Eclipse in years so I am not sure off the top of my head what's
 going on instead. Just in case you only downloaded the
 spark-streaming_2.10.jar  then that is indeed insufficient and you have to
 download all the recursive dependencies. May be you should create a Maven
 project inside Eclipse?

 TD

 On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 How do I do that? I haven't used Scala before.

 Also, linking page doesn't mention that:


 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking

 On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote:

 It means you do not have Scala library classes in your project classpath.

 On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am trying out streaming example as documented and I am using spark
 1.2.1
  streaming from maven for Java.
 
  When I add this code I get compilation error on and eclipse is not
 able to
  recognize Tuple2. I also don't see any import scala.Tuple2 class.
 
 
 
 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example
 
 
  private void map(JavaReceiverInputDStreamString lines) {
 
  JavaDStreamString words = lines.flatMap(
 
  new FlatMapFunctionString, String() {
 
  @Override public IterableString call(String x) {
 
  return Arrays.asList(x.split( ));
 
  }
 
  });
 
  // Count each word in each batch
 
  JavaPairDStreamString, Integer pairs = words.map(
 
  new PairFunctionString, String, Integer() {
 
  @Override public Tuple2String, Integer call(String s) throws
 Exception {
 
  return new Tuple2String, Integer(s, 1);
 
  }
 
  });
 
  }

Re: Compilation error

See if you can import scala libraries in your project.

On Tue, Mar 10, 2015 at 11:32 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 I am using maven and my dependency looks like this, but this doesn't seem
 to be working

  dependencies

 dependency

 groupIdorg.apache.spark/groupId

 artifactIdspark-streaming_2.10/artifactId

 version1.2.0/version

 /dependency

 dependency

 groupIdorg.apache.spark/groupId

 artifactIdspark-core_2.10/artifactId

 version1.2.1/version

 /dependency

 /dependencies

 On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com
 wrote:

 If you are using tools like SBT/Maven/Gradle/etc, they figure out all the
 recursive dependencies and includes them in the class path. I haven't
 touched Eclipse in years so I am not sure off the top of my head what's
 going on instead. Just in case you only downloaded the
 spark-streaming_2.10.jar  then that is indeed insufficient and you have to
 download all the recursive dependencies. May be you should create a Maven
 project inside Eclipse?

 TD

 On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 How do I do that? I haven't used Scala before.

 Also, linking page doesn't mention that:


 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#linking

 On Tue, Mar 10, 2015 at 10:57 AM, Sean Owen so...@cloudera.com wrote:

 It means you do not have Scala library classes in your project
 classpath.

 On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am trying out streaming example as documented and I am using spark
 1.2.1
  streaming from maven for Java.
 
  When I add this code I get compilation error on and eclipse is not
 able to
  recognize Tuple2. I also don't see any import scala.Tuple2 class.
 
 
 
 http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#a-quick-example
 
 
  private void map(JavaReceiverInputDStreamString lines) {
 
  JavaDStreamString words = lines.flatMap(
 
  new FlatMapFunctionString, String() {
 
  @Override public IterableString call(String x) {
 
  return Arrays.asList(x.split( ));
 
  }
 
  });
 
  // Count each word in each batch
 
  JavaPairDStreamString, Integer pairs = words.map(
 
  new PairFunctionString, String, Integer() {
 
  @Override public Tuple2String, Integer call(String s) throws
 Exception {
 
  return new Tuple2String, Integer(s, 1);
 
  }
 
  });
 
  }

ec2 persistent-hdfs with ebs using spot instances

2015-03-10 Thread Deborah Siegel

Hello,

I'm new to ec2. I've set up a spark cluster on ec2 and am using
persistent-hdfs with the data nodes mounting ebs. I launched my cluster
using spot-instances

./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z
us-east-1c --spark-version=1.2.0 --spot-price=.0321
--hadoop-major-version=2  --copy-aws-credentials --ebs-vol-size=100
launch mysparkcluster

My question is, if the spot-instances get dropped, and I try and attach new
slaves to existing master with --use-existing-master, can I mount those new
slaves to the same ebs volumes? I'm guessing not. If somebody has
experience with this, how is it done?

Thanks.
Sincerely,
Deb

How to pass parameter to spark-shell when choose client mode --master yarn-client

2015-03-10 Thread Shuai Zheng

Hi All,

 

I try to pass parameter to the spark-shell when I do some test:

 

spark-shell --driver-memory 512M --executor-memory 4G --master
spark://:7077 --conf spark.sql.parquet.compression.codec=snappy --conf
spark.sql.parquet.binaryAsString=true

 

This works fine on my local pc. And when I start EMR, and pass the similar
things:

 

~/spark/bin/spark-shell --executor-memory 40G --master yarn-client
--num-executors 1 --executor-cores 32 --conf
spark.sql.parquet.compression.codec=snappy --conf
spark.sql.parquet.binaryAsString=true --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer

 

The parameter doesn't pass to spark-shell. Anyone knows why? And what is the
alternatives for me to do that?

 

Regards,

 

Shuai

Re: Setting up Spark with YARN on EC2 cluster

2015-03-10 Thread Deborah Siegel

Harika,

I think you can modify existing spark on ec2 cluster to run Yarn mapreduce,
not sure if this is what you are looking for.
To try:

1) logon to master

2) go into either ephemeral-hdfs/conf/ or persistent-hdfs/conf/
and add this to mapred-site.xml :

property
namemapreduce.framework.name/name
valueyarn/value
/property

3) use copy-dir to copy this file over to the slaves (don't know if this
step is necessary)
eg.
~/spark-ec2/copy-dir.sh ~/ephemeral-hdfs/conf/mapred-site.xml

4) stop and restart hdfs (for pesistent-hdfs it wasn't started to begin
with)
ephemeral-hdfs]$ ./sbin/stop-all.sh
ephemeral-hdfs]$ ./sbin/start-all.sh

HTH
Deb

On Wed, Feb 25, 2015 at 11:46 PM, Harika matha.har...@gmail.com wrote:

Hi,

I want to setup a Spark cluster with YARN dependency on Amazon EC2. I was
reading this https://spark.apache.org/docs/1.2.0/running-on-yarn.html
document and I understand that Hadoop has to be setup for running Spark
with
YARN. My questions -

1. Do we have to setup Hadoop cluster on EC2 and then build Spark on it?
2. Is there a way to modify the existing Spark cluster to work with YARN?

Thanks in advance.

Harika

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Compilation error

I ran the dependency command and see the following dependencies:

I only see org.scala-lang.

[INFO] org.spark.test:spak-test:jar:0.0.1-SNAPSHOT

[INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.2.0:compile

[INFO] | +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile

[INFO] | | +-
org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:co mpile

[INFO] | | +-
org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compil e

[INFO] | | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile

[INFO] | | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile

[INFO] | +- org.scala-lang:scala-library:jar:2.10.4:compile

[INFO] | \- org.spark-project.spark:unused:jar:1.0.0:compile

[INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.1:compile

[INFO] +- com.twitter:chill_2.10:jar:0.5.0:compile

[INFO] | \- com.esotericsoftware.kryo:kryo:jar:2.21:compile

[INFO] | +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:co
mpile

[INFO] | +- com.esotericsoftware.minlog:minlog:jar:1.2:compile

[INFO] | \- org.objenesis:objenesis:jar:1.2:compile

[INFO] +- com.twitter:chill-java:jar:0.5.0:compile

[INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile

[INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile

[INFO] | | +- commons-cli:commons-cli:jar:1.2:compile

[INFO] | | +- org.apache.commons:commons-math:jar:2.1:compile

[INFO] | | +- xmlenc:xmlenc:jar:0.52:compile

[INFO] | | +- commons-io:commons-io:jar:2.1:compile

[INFO] | | +- commons-logging:commons-logging:jar:1.1.1:compile

[INFO] | | +- commons-lang:commons-lang:jar:2.5:compile

[INFO] | | +- commons-configuration:commons-configuration:jar:1.6:compile

[INFO] | | | +- commons-collections:commons-collections:jar:3.2.1:compile

[INFO] | | | +- commons-digester:commons-digester:jar:1.8:compile

[INFO] | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile

[INFO] | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile

[INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile

[INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile

[INFO] | | +- org.apache.avro:avro:jar:1.7.4:compile

[INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile

[INFO] | | +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile

[INFO] | | \- org.apache.commons:commons-compress:jar:1.4.1:compile

[INFO] | | \- org.tukaani:xz:jar:1.0:compile

[INFO] | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile

[INFO] | | \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile

[INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile

[INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:co
mpile

[INFO] | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile

[INFO] | | | | +- com.google.inject:guice:jar:3.0:compile

[INFO] | | | | | +- javax.inject:javax.inject:jar:1:compile

[INFO] | | | | | \- aopalliance:aopalliance:jar:1.0:compile

[INFO] | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-framew
ork-grizzly2:jar:1.9:compile

[INFO] | | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-fra
mework-core:jar:1.9:compile

[INFO] | | | | | | +- javax.servlet:javax.servlet-api:jar:3.0.1:compile

[INFO] | | | | | | \- com.sun.jersey:jersey-client:jar:1.9:compile

[INFO] | | | | | \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile

[INFO] | | | | | +- org.glassfish.grizzly:grizzly-http:jar:2.1.2:comp ile

[INFO] | | | | | | \- org.glassfish.grizzly:grizzly-framework:jar:2.
1.2:compile

[INFO] | | | | | | \- org.glassfish.gmbal:gmbal-api-only:jar:3.0.
0-b023:compile

[INFO] | | | | | | \- org.glassfish.external:management-api:ja
r:3.0.0-b012:compile

[INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-server:jar:2.1
.2:compile

[INFO] | | | | | | \- org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:co mpile

[INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-servlet:jar:2.
1.2:compile

[INFO] | | | | | \- org.glassfish:javax.servlet:jar:3.1:compile

[INFO] | | | | +- com.sun.jersey:jersey-server:jar:1.9:compile

[INFO] | | | | | +- asm:asm:jar:3.1:compile

[INFO] | | | | | \- com.sun.jersey:jersey-core:jar:1.9:compile

[INFO] | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile

[INFO] | | | | | +- org.codehaus.jettison:jettison:jar:1.1:compile

[INFO] | | | | | | \- stax:stax-api:jar:1.0.1:compile

[INFO] | | | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile

[INFO] | | | | | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile

[INFO] | | | | | | \- javax.activation:activation:jar:1.1:compile

[INFO] | | | | | +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile

[INFO] | | | | | \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile

[INFO] | | | | \- com.sun.jersey.contribs:jersey-guice:jar:1.9:compile

[INFO] | | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:comp
ile

[INFO] | | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:c
ompile

[INFO] | +-

Re: ANSI Standard Supported by the Spark-SQL

2015-03-10 Thread Ravindra

From the archives in this user list, It seems that Spark-SQL is yet to
achieve SQL 92 level. But there are few things still not clear.
1. This is from an old post dated : Aug 09, 2014.
2.  It clearly says that it doesn't support DDL and DML operations. Does
that means, all reads (select) are sql 92 compliant?

Please clarify.

Regards,
Ravi

On Tue, Mar 10, 2015 at 11:46 AM Ravindra ravindra.baj...@gmail.com wrote:

 Hi All,

 I am new to spark and trying to understand what SQL Standard is supported
 by the Spark. I googled around a lot but didn't get clear answer.

 Some where I saw that Spark supports sql-92 and at other location I found
 that spark is not fully compliant with sql-92.

 I also noticed that using Hive Context I can execute almost all the
 queries supported by the Hive, so does that means that Spark is equivalent
 to Hive in terms of sql standards, given that HiveContext is used for
 querying.

 Please suggest.

 Regards,
 Ravi

Re: Spark with Spring

It will be good if you can explain the entire usecase like what kind of
requests, what sort of processing etc.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 11:18 PM, Tarun Garg bigdat...@live.com wrote:

 Hi,

 I have a existing web base system which receives the request and process
 that. This framework uses Spring framework. Now i am planning to separate
 this business logic and out that in Spark Streaming. I am not sure using
 Spring framework in streaming is how much valuable.

 Any suggestion is welcome.

 Thanks
 Tarun

Re: Joining data using Latitude, Longitude

Are you using SparkSQL for the join? In that case I'm not quiet sure you
have a lot of options to join on the nearest co-ordinate. If you are using
the normal Spark code (by creating key-pair on lat,lon) you can apply
certain logic like trimming the lat,lon etc. If you want more specific
computing then you are better off using haversine formula.
http://www.movable-type.co.uk/scripts/latlong.html

Re: Spark History server default conf values

2015-03-10 Thread Charles Feduke

What I found from a quick search of the Spark source code (from my local
snapshot on January 25, 2015):

// Interval between each check for event log updates
private val UPDATE_INTERVAL_MS =
conf.getInt(spark.history.fs.updateInterval,
conf.getInt(spark.history.updateInterval, 10)) * 1000

  private val retainedApplications =
conf.getInt(spark.history.retainedApplications, 50)


On Tue, Mar 10, 2015 at 12:37 AM Srini Karri skarri@gmail.com wrote:

 Hi All,

 What are the default values for the following conf properities if we don't
 set in the conf file?

 # spark.history.fs.updateInterval 10

 # spark.history.retainedApplications 500


 Regards,

 Srini.

Re: Spark Streaming testing strategies

2015-03-10 Thread Marcin Kuthan


 I would expect base trait for testing purposes in spark distribution.
 ManualClock should be exposed as well. And some documentation how to
 configure SBT to avoid problems with multiple spark contexts. I'm
 going to create improvement proposal on Spark issue tracker about it.

 Right now I think a package is probably a good place for this to live since
 the internal Spark testing code is changing/evolving rapidly, but I think
 once we have the trait fleshed out a bit more we could see if there is
 enough interest to try and merge it in (just my personal thoughts).


You are right, let's wait for community feedback.

I'm quite new in Spark but it is strange for me that good support for
isolated testing is not first class citizen. In the Spark Programming
Guide the chapter about unit testing is not more than three sentences
:-(

Spring Framework taught me how important is the ability to run tests
from your IDE, and how important is tests execution performance (there
are also problems with parallelism under single JVM).

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark History server default conf values

2015-03-10 Thread Srini Karri

Thank you Charles and Meethu.

On Tue, Mar 10, 2015 at 12:47 AM, Charles Feduke charles.fed...@gmail.com
wrote:

 What I found from a quick search of the Spark source code (from my local
 snapshot on January 25, 2015):

 // Interval between each check for event log updates
 private val UPDATE_INTERVAL_MS =
 conf.getInt(spark.history.fs.updateInterval,
 conf.getInt(spark.history.updateInterval, 10)) * 1000

   private val retainedApplications =
 conf.getInt(spark.history.retainedApplications, 50)


 On Tue, Mar 10, 2015 at 12:37 AM Srini Karri skarri@gmail.com wrote:

 Hi All,

 What are the default values for the following conf properities if we
 don't set in the conf file?

 # spark.history.fs.updateInterval 10

 # spark.history.retainedApplications 500


 Regards,

 Srini.

Hadoop Map vs Spark stream Map

Hi,

I am trying to understand Hadoop Map method compared to spark Map and I
noticed that spark Map only receives 3 arguments 1) input value 2) output
key 3) output value, however in hadoop map it has 4 values 1) input key 2)
input value 3) output key 4) output value. Is there any reason it was
designed this way? Just trying to undersand:

Hadoop:

public void map(K key, V val,
   OutputCollectorK, V output, Reporter reporter)




// Count each word in each batch

JavaPairDStreamString, Integer *pairs* = words.mapToPair(

  *new* *PairFunctionString, String, Integer()* {

@Override *public* Tuple2String, Integer call(String s)
*throws* Exception {

  *return* *new* Tuple2String, Integer(s, 1);

}

  });

Re: Compilation error on JavaPairDStream

works now. I should have checked :)

On Tue, Mar 10, 2015 at 1:44 PM, Sean Owen so...@cloudera.com wrote:

 Ah, that's a typo in the example: use words.mapToPair
 I can make a little PR to fix that.

 On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am getting following error. When I look at the sources it seems to be a
  scala source, but not sure why it's complaining about it.
 
  The method map(FunctionString,R) in the type JavaDStreamString is not
  applicable for the arguments (new
 
  PairFunctionString,String,Integer(){})
 
 
  And my code has been taken from the spark examples site:
 
 
  JavaPairDStreamString, Integer pairs = words.map(
 
  new PairFunctionString, String, Integer() {
 
  @Override public Tuple2String, Integer call(String s) throws Exception
 {
 
  return new Tuple2String, Integer(s, 1);
 
 
  }
 
  });

Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Michael Armbrust

Thanks for reporting.  This was a result of a change to our DDL parser that
resulted in types becoming reserved words.  I've filled a JIRA and will
investigate if this is something we can fix.
https://issues.apache.org/jira/browse/SPARK-6250

On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote:

 In Spark 1.2 I used to be able to do this:

 scala
 org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint)
 res30: org.apache.spark.sql.catalyst.types.DataType =
 StructType(List(StructField(int,LongType,true)))

 That is, the name of a column can be a keyword like int. This is no
 longer the case in 1.3:

 data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint)
 org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
 failure: ``'' expected but `int' found

 structint:bigint
^
 at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
 at
 org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
 at
 org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)

 Note HiveTypeHelper is simply an object I load in to expose
 HiveMetastoreTypes since it was made private. See
 https://gist.github.com/nitay/460b41ed5fd7608507f5
 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c

 This is actually a pretty big problem for us as we have a bunch of legacy
 tables with column names like timestamp. They work fine in 1.2, but now
 everything throws in 1.3.

 Any thoughts?

 Thanks,
 - Nitay
 Founder  CTO

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Reynold Xin

They should have the same performance, as they are compiled down to the
same execution plan.

Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame:

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html



On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote:


 I am new to the SchemaRDD class, and I am trying to decide in using SQL
 queries or Language Integrated Queries (
 https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
 ).

 Can someone tell me what is the main difference between the two
 approaches, besides using different syntax? Are they interchangeable? Which
 one has better performance?


 Thanks a lot
 --
 Cesar Flores

Re: Joining data using Latitude, Longitude

2015-03-10 Thread John Meehan

There are some techniques you can use If you geohash 
http://en.wikipedia.org/wiki/Geohash the lat-lngs.  They will naturally be 
sorted by proximity (with some edge cases so watch out).  If you go the join 
route, either by trimming the lat-lngs or geohashing them, you’re essentially 
grouping nearby locations into buckets — but you have to consider the borders 
of the buckets since the nearest location may actually be in an adjacent 
bucket.  Here’s a paper that discusses an implementation: 
http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf
 
http://www.gdeepak.com/thesisme/Finding%20Nearest%20Location%20with%20open%20box%20query.pdf

 On Mar 9, 2015, at 11:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 Are you using SparkSQL for the join? In that case I'm not quiet sure you have 
 a lot of options to join on the nearest co-ordinate. If you are using the 
 normal Spark code (by creating key-pair on lat,lon) you can apply certain 
 logic like trimming the lat,lon etc. If you want more specific computing then 
 you are better off using haversine formula. 
 http://www.movable-type.co.uk/scripts/latlong.html

RE: Spark SQL Stackoverflow error

2015-03-10 Thread jishnu.prathap

import com.google.gson.{GsonBuilder, JsonParser}
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.clustering.KMeans
/**
* Examine the collected tweets and trains a model based on them.
*/
object ExamineAndTrain {
val jsonParser = new JsonParser()
val gson = new GsonBuilder().setPrettyPrinting().create()
def main(args: Array[String]) {
   val outputModelDir=C:\\outputmode111
 val tweetInput=C:\\test
   val numClusters=10
   val numIterations=20

val conf = new 
SparkConf().setAppName(this.getClass.getSimpleName).setMaster(local[4]).set(spark.executor.memory,
 1g)
val sc = new SparkContext(conf)
val tweets = sc.textFile(tweetInput)
val vectors = tweets.map(Utils.featurize).cache()
vectors.count() // Calls an action on the RDD to populate the vectors cache.
val model = KMeans.train(vectors, numClusters, numIterations)
sc.makeRDD(model.clusterCenters, numClusters).saveAsObjectFile(outputModelDir)
val some_tweets = tweets.take(2)
println(Example tweets from the clusters)
for (i - 0 until numClusters) {
println(s\nCLUSTER $i:)
some_tweets.foreach { t =
if (model.predict(Utils.featurize(t)) == i) {
println(t)
}
}
}
}
}

From: lovelylavs [via Apache Spark User List] 
[mailto:ml-node+s1001560n21956...@n3.nabble.com]
Sent: Sunday, March 08, 2015 2:34 AM
To: Jishnu Menath Prathap (WT01 - BAS)
Subject: Re: Spark SQL Stackoverflow error


Thank you so much for your reply. If it is possible can you please provide me 
with the code?



Thank you so much.



Lavanya.


From: Jishnu Prathap [via Apache Spark User List] ml-node+[hidden 
email]/user/SendEmail.jtp?type=nodenode=21956i=0
Sent: Sunday, March 1, 2015 3:03 AM
To: Nadikuda, Lavanya
Subject: RE: Spark SQL Stackoverflow error

Hi
The Issue was not fixed .
I removed the between sql layer and directly created features from the file.

Regards
Jishnu Prathap

From: lovelylavs [via Apache Spark User List] [mailto:ml-node+[hidden 
email]/user/SendEmail.jtp?type=nodenode=21863i=0]
Sent: Sunday, March 01, 2015 4:44 AM
To: Jishnu Menath Prathap (WT01 - BAS)
Subject: Re: Spark SQL Stackoverflow error

Hi,

how was this issue fixed?

If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Stackoverflow-error-tp12086p21862.html
To unsubscribe from Spark SQL Stackoverflow error, click here.
NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.comhttp://www.wipro.com

If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Stackoverflow-error-tp12086p21863.html
To unsubscribe from Spark SQL Stackoverflow error, click here.
NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Stackoverflow-error-tp12086p21956.html
To unsubscribe from Spark SQL Stackoverflow error, click 
herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=12086code=amlzaG51LnByYXRoYXBAd2lwcm8uY29tfDEyMDg2fC0xNzUwOTc3MjE3.

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-10 Thread Arun Luthra

Does anyone know how to get the HighlyCompressedMapStatus to compile?

I will try turning off kryo in 1.2.0 and hope things don't break.  I want
to benefit from the MapOutputTracker fix in 1.2.0.

On Tue, Mar 3, 2015 at 5:41 AM, Imran Rashid iras...@cloudera.com wrote:

 the scala syntax for arrays is Array[T], not T[], so you want to use
 something:

 kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]])
 kryo.register(classOf[Array[Short]])

 nonetheless, the spark should take care of this itself.  I'll look into it
 later today.


 On Mon, Mar 2, 2015 at 2:55 PM, Arun Luthra arun.lut...@gmail.com wrote:

 I think this is a Java vs scala syntax issue. Will check.

 On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra arun.lut...@gmail.com
 wrote:

 Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949

 I tried this as a workaround:

 import org.apache.spark.scheduler._
 import org.roaringbitmap._

 ...


 kryo.register(classOf[org.roaringbitmap.RoaringBitmap])
 kryo.register(classOf[org.roaringbitmap.RoaringArray])
 kryo.register(classOf[org.roaringbitmap.ArrayContainer])

 kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus])
 kryo.register(classOf[org.roaringbitmap.RoaringArray$Element])
 kryo.register(classOf[org.roaringbitmap.RoaringArray$Element[]])
 kryo.register(classOf[short[]])


 in build file:

 libraryDependencies += org.roaringbitmap % RoaringBitmap % 0.4.8


 This fails to compile:

 ...:53: identifier expected but ']' found.

 [error]
 kryo.register(classOf[org.roaringbitmap.RoaringArray$Element[]])

 also:

 :54: identifier expected but ']' found.

 [error] kryo.register(classOf[short[]])
 also:

 :51: class HighlyCompressedMapStatus in package scheduler cannot be
 accessed in package org.apache.spark.scheduler
 [error]
 kryo.register(classOf[org.apache.spark.scheduler.HighlyCompressedMapStatus])


 Suggestions?

 Arun

Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Yin Huai

Hi Nitay,

Can you try using backticks to quote the column name? Like
org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(
struct`int`:bigint)?

Thanks,

Yin

On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust mich...@databricks.com
wrote:

 Thanks for reporting.  This was a result of a change to our DDL parser
 that resulted in types becoming reserved words.  I've filled a JIRA and
 will investigate if this is something we can fix.
 https://issues.apache.org/jira/browse/SPARK-6250

 On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe ni...@actioniq.co wrote:

 In Spark 1.2 I used to be able to do this:

 scala
 org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType(structint:bigint)
 res30: org.apache.spark.sql.catalyst.types.DataType =
 StructType(List(StructField(int,LongType,true)))

 That is, the name of a column can be a keyword like int. This is no
 longer the case in 1.3:

 data-pipeline-shell HiveTypeHelper.toDataType(structint:bigint)
 org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.8]
 failure: ``'' expected but `int' found

 structint:bigint
^
 at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
 at
 org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:785)
 at
 org.apache.spark.sql.hive.HiveTypeHelper$.toDataType(HiveTypeHelper.scala:9)

 Note HiveTypeHelper is simply an object I load in to expose
 HiveMetastoreTypes since it was made private. See
 https://gist.github.com/nitay/460b41ed5fd7608507f5
 https://app.relateiq.com/r?c=chrome_gmailurl=https%3A%2F%2Fgist.github.com%2Fnitay%2F460b41ed5fd7608507f5t=AFwhZf262cJFT8YSR54ZotvY2aTmpm_zHTSKNSd4jeT-a6b8q-yMXQ-BqEX9-Ym54J1bkDFiFOXyRKsNxXoDGIh7bhqbBVKsGGq6YTJIfLZxs375XXPdS13KHsE_3Lffk4UIFkRFZ_7c

 This is actually a pretty big problem for us as we have a bunch of legacy
 tables with column names like timestamp. They work fine in 1.2, but now
 everything throws in 1.3.

 Any thoughts?

 Thanks,
 - Nitay
 Founder  CTO

SQL with Spark Streaming

Does Spark Streaming also supports SQLs? Something like how Esper does CEP.

RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

Oh, sorry, my bad, currently Spark SQL doesn’t provide the user interface for 
UDAF, but it can work seamlessly with Hive UDAF (via HiveContext).

I am also working on the UDAF interface refactoring, after that we can provide 
the custom interface for extension.

https://github.com/apache/spark/pull/3247


From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Wednesday, March 11, 2015 1:44 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

Thanks Hao,
But my question concerns UDAF (user defined aggregation function ) not UDTF( 
user defined type function ).
I appreciate if you could point me to some starting point on UDAF development 
in Spark.

Thanks
Shahab

On Tuesday, March 10, 2015, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
Currently, Spark SQL doesn’t provide interface for developing the custom UDTF, 
but it can work seamless with Hive UDTF.

I am working on the UDTF refactoring for Spark SQL, hopefully will provide an 
Hive independent UDTF soon after that.

From: shahab 
[mailto:shahab.mok...@gmail.comjavascript:_e(%7B%7D,'cvml','shahab.mok...@gmail.com');]
Sent: Tuesday, March 10, 2015 5:44 PM
To: user@spark.apache.orgjavascript:_e(%7B%7D,'cvml','user@spark.apache.org');
Subject: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

Hi,

I need o develop couple of UDAFs and use them in the SparkSQL. While UDFs can 
be registered as a function in HiveContext, I could not find any documentation 
of how UDAFs can be registered in the HiveContext?? so far what I have found is 
to make a JAR file, out of developed UDAF class, and then deploy the JAR file 
to SparkSQL .

But is there any way to avoid deploying the jar file and register it 
programmatically?


best,
/Shahab

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

I am not so sure if Hive supports change the metastore after initialized, I 
guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably 
that's why it doesn't work as expected for Q1.

BTW, in most of cases, people configure the metastore settings in 
hive-site.xml, and will not change that since then, is there any reason that 
you want to change that in runtime?

For Q2, probably something wrong in configuration, seems the HDFS run into the 
pseudo/single node mode, can you double check that? Or can you run the DDL 
(like create a table) from the spark shell with HiveContext?

From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; d...@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?


I'm using Spark 1.3.0 RC3 build with Hive support.



In Spark Shell, I want to reuse the HiveContext instance to different warehouse 
locations. Below are the steps for my test (Assume I have loaded a file into 
table src).



==

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support)..

SQL context available as sqlContext.

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w)

scala sqlContext.sql(SELECT * from src).saveAsTable(table1)

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2)

scala sqlContext.sql(SELECT * from src).saveAsTable(table2)

==

After these steps, the tables are stored in /test/w only. I expect table2 
to be stored in /test/w2 folder.



Another question is: if I set hive.metastore.warehouse.dir to a HDFS folder, 
I cannot use saveAsTable()? Is this by design? Exception stack trace is below:

==

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at 
TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS: 
hdfs://server:8020/space/warehouse/table2, expected: file:///file:///\\

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

at 
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)

at 
org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:370)

at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)

at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)

at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

at 
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217)

at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)

at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)

at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC$$iwC$$iwC.init(console:31)

at $iwC$$iwC$$iwC.init(console:33)

at $iwC$$iwC.init(console:35)

at $iwC.init(console:37)

at init(console:39)



Thank you very much!

Numbering RDD members Sequentially

2015-03-10 Thread Steve Lewis

I have Hadoop Input Format which reads records and produces

JavaPairRDDString,String locatedData  where
_1() is a formatted version of the file location - like
12690,, 24386 .27523 ...
_2() is data to be processed

For historical reasons  I want to convert _1() into in integer representing
the record number.
so keys become 0001, 002 ...

(Yes I know this cannot be done in parallel) The PairRDD may be too large
to collect and work on one machine but small enough to handle on a single
machine.
 I could use toLocalIterator to guarantee execution on one machine but last
time I tried this all kinds of jobs were launched to get the next element
of the iterator and I was not convinced this approach was efficient.
Any bright ideas?

RE: SQL with Spark Streaming