Re: orc read issue n spark

2015-11-18 Thread Reynold Xin
What do you mean by starts delay scheduling? Are you saying it is no longer
doing local reads?

If that's the case you can increase the spark.locality.read timeout.

On Wednesday, November 18, 2015, Renu Yadav  wrote:

> Hi ,
> I am using spark 1.4.1 and saving orc file using
> df.write.format("orc").save("outputlocation")
>
> outputloation size 440GB
>
> and while reading df.read.format("orc").load("outputlocation").count
>
>
> it has 2618 partitions .
> the count operation runs fine uptil 2500 but starts delay scheduling after
> that which results in slow performance.
>
> *If anyone has any idea on this.Please do reply as I need this  very
> urgent*
>
> Thanks in advance
>
>
> Regards,
> Renu Yadav
>
>
>


Re: ISDATE Function

2015-11-18 Thread Ruslan Dautkhanov
You could write your own UDF isdate().



-- 
Ruslan Dautkhanov

On Tue, Nov 17, 2015 at 11:25 PM, Ravisankar Mani  wrote:

> Hi Ted Yu,
>
> Thanks for your response. Is any other way to achieve in Spark Query?
>
>
> Regards,
> Ravi
>
> On Tue, Nov 17, 2015 at 10:26 AM, Ted Yu  wrote:
>
>> ISDATE() is currently not supported.
>> Since it is SQL Server specific, I guess it wouldn't be added to Spark.
>>
>> On Mon, Nov 16, 2015 at 10:46 PM, Ravisankar Mani 
>> wrote:
>>
>>> Hi Everyone,
>>>
>>>
>>>  In MSSQL server suppprt "ISDATE()" function is used to fine current
>>> column values date or not?.  Is any possible to achieve current column
>>> values date or not?
>>>
>>>
>>> Regards,
>>> Ravi
>>>
>>
>>
>


Re: how can evenly distribute my records in all partition

2015-11-18 Thread prateek arora
Hi
Thanks for the help.
In my Case ...
I want to perform operation on 30 record per second using spark streaming.
and difference between key of records is around 33-34 ms and my RDD that
have 30 records already have 4 partition.
and right now my algo take around 400 ms to perform operation on 1 record .
so i want to distribute my records evenly so every executor perform
operation only on one record and my 1 second batch will be completed
without delay.


On Tue, Nov 17, 2015 at 7:50 PM, Sonal Goyal  wrote:

> Think about how you want to distribute your data and how your keys are
> spread currently. Do you want to compute something per day, per week etc.
> Based on that, return a partition number. You could use mod 30 or some such
> function to get the partitions.
> On Nov 18, 2015 5:17 AM, "prateek arora" 
> wrote:
>
>> Hi
>> I am trying to implement custom partitioner using this link
>> http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where
>> ( in link example key value is from 0 to (noOfElement - 1))
>>
>> but not able to understand how i  implement  custom partitioner  in my
>> case:
>>
>> my parent RDD have 4 partition and RDD key is : TimeStamp and Value is
>> JPEG Byte Array
>>
>>
>> Regards
>> Prateek
>>
>>
>> On Tue, Nov 17, 2015 at 9:28 AM, Ted Yu  wrote:
>>
>>> Please take a look at the following for example:
>>>
>>> ./core/src/main/scala/org/apache/spark/api/python/PythonPartitioner.scala
>>> ./core/src/main/scala/org/apache/spark/Partitioner.scala
>>>
>>> Cheers
>>>
>>> On Tue, Nov 17, 2015 at 9:24 AM, prateek arora <
>>> prateek.arora...@gmail.com> wrote:
>>>
 Hi
 Thanks
 I am new in spark development so can you provide some help to write a
 custom partitioner to achieve this.
 if you have and link or example to write custom partitioner please
 provide to me.

 On Mon, Nov 16, 2015 at 6:13 PM, Sabarish Sasidharan <
 sabarish.sasidha...@manthan.com> wrote:

> You can write your own custom partitioner to achieve this
>
> Regards
> Sab
> On 17-Nov-2015 1:11 am, "prateek arora" 
> wrote:
>
>> Hi
>>
>> I have a RDD with 30 record ( Key/value pair ) and running 30
>> executor . i
>> want to reparation this RDD in to 30 partition so every partition
>> get one
>> record and assigned to one executor .
>>
>> when i used rdd.repartition(30) its repartition my rdd in 30
>> partition but
>> some partition get 2 record , some get 1 record and some not getting
>> any
>> record .
>>
>> is there any way in spark so i can evenly distribute my record in all
>> partition .
>>
>> Regards
>> Prateek
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-can-evenly-distribute-my-records-in-all-partition-tp25394.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

>>>
>>


Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-18 Thread Ted Yu
Interesting.

I will watching your PR.

On Wed, Nov 18, 2015 at 7:51 AM, 임정택  wrote:

> Ted,
>
> I suspect I hit the issue
> https://issues.apache.org/jira/browse/SPARK-11818
> Could you refer the issue and verify that it makes sense?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2015-11-18 20:32 GMT+09:00 Ted Yu :
>
>> Here is related code:
>>
>>   private static void checkDefaultsVersion(Configuration conf) {
>>
>> if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE))
>> return;
>>
>> String defaultsVersion = conf.get("hbase.defaults.for.version");
>>
>> String thisVersion = VersionInfo.getVersion();
>>
>> if (!thisVersion.equals(defaultsVersion)) {
>>
>>   throw new RuntimeException(
>>
>> "hbase-default.xml file seems to be for an older version of
>> HBase (" +
>>
>> defaultsVersion + "), this version is " + thisVersion);
>>
>> null means that "hbase.defaults.for.version" was not set in the other
>> hbase-default.xml
>>
>> Can you retrieve the classpath of Spark task so that we can have more
>> clue ?
>>
>>
>> Cheers
>>
>> On Tue, Nov 17, 2015 at 10:06 PM, 임정택  wrote:
>>
>>> Ted,
>>>
>>> Thanks for the reply.
>>>
>>> My fat jar has dependency with spark related library to only spark-core
>>> as "provided".
>>> Seems like Spark only adds 0.98.7-hadoop2 of hbase-common in
>>> spark-example module.
>>>
>>> And if there're two hbase-default.xml in the classpath, should one of
>>> them be loaded, instead of showing (null)?
>>>
>>> Best,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>>
>>> 2015-11-18 13:50 GMT+09:00 Ted Yu :
>>>
 Looks like there're two hbase-default.xml in the classpath: one for 0.98.6
 and another for 0.98.7-hadoop2 (used by Spark)

 You can specify hbase.defaults.for.version.skip as true in your
 hbase-site.xml

 Cheers

 On Tue, Nov 17, 2015 at 1:01 AM, 임정택  wrote:

> Hi all,
>
> I'm evaluating zeppelin to run driver which interacts with HBase.
> I use fat jar to include HBase dependencies, and see failures on
> executor level.
> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>
> I loaded fat jar via --jars option,
>
> > ./bin/spark-shell --jars hbase-included-assembled.jar
>
> and run driver code using provided SparkContext instance, and see
> failures from spark-shell console and executor logs.
>
> below is stack traces,
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 55 in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in 
> stage 0.0 (TID 281, ): java.lang.NoClassDefFoundError: 
> Could not initialize class 
> org.apache.hadoop.hbase.client.HConnectionManager
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
> at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> 

Re: Calculating Timeseries Aggregation

2015-11-18 Thread Sandip Mehta
TD thank you for your reply.

I agree on data store requirement. I am using HBase as an underlying store.

So for every batch interval of say 10 seconds

- Calculate the time dimension ( minutes, hours, day, week, month and quarter ) 
along with other dimensions and metrics
- Update relevant base table at each batch interval for relevant metrics for a 
given set of dimensions.

Only caveat I see is I’ll have to update each of the different roll up table 
for each batch window.

Is this a valid approach for calculating time series aggregation?

Regards
SM

For minutes level aggregates I have set up a streaming window say 10 seconds 
and storing minutes level aggregates across multiple dimension in HBase at 
every window interval. 

> On 18-Nov-2015, at 7:45 AM, Tathagata Das  wrote:
> 
> For this sort of long term aggregations you should use a dedicated data 
> storage systems. Like a database, or a key-value store. Spark Streaming would 
> just aggregate and push the necessary data to the data store. 
> 
> TD
> 
> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta  > wrote:
> Hi,
> 
> I am working on requirement of calculating real time metrics and building 
> prototype  on Spark streaming. I need to build aggregate at Seconds, Minutes, 
> Hours and Day level.
> 
> I am not sure whether I should calculate all these aggregates as  different 
> Windowed function on input DStream or shall I use updateStateByKey function 
> for the same. If I have to use updateStateByKey for these time series 
> aggregation, how can I remove keys from the state after different time lapsed?
> 
> Please suggest.
> 
> Regards
> SM
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-18 Thread 임정택
Ted,

I suspect I hit the issue https://issues.apache.org/jira/browse/SPARK-11818
Could you refer the issue and verify that it makes sense?

Thanks,
Jungtaek Lim (HeartSaVioR)

2015-11-18 20:32 GMT+09:00 Ted Yu :

> Here is related code:
>
>   private static void checkDefaultsVersion(Configuration conf) {
>
> if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE))
> return;
>
> String defaultsVersion = conf.get("hbase.defaults.for.version");
>
> String thisVersion = VersionInfo.getVersion();
>
> if (!thisVersion.equals(defaultsVersion)) {
>
>   throw new RuntimeException(
>
> "hbase-default.xml file seems to be for an older version of HBase
> (" +
>
> defaultsVersion + "), this version is " + thisVersion);
>
> null means that "hbase.defaults.for.version" was not set in the other
> hbase-default.xml
>
> Can you retrieve the classpath of Spark task so that we can have more clue
> ?
>
>
> Cheers
>
> On Tue, Nov 17, 2015 at 10:06 PM, 임정택  wrote:
>
>> Ted,
>>
>> Thanks for the reply.
>>
>> My fat jar has dependency with spark related library to only spark-core
>> as "provided".
>> Seems like Spark only adds 0.98.7-hadoop2 of hbase-common in
>> spark-example module.
>>
>> And if there're two hbase-default.xml in the classpath, should one of
>> them be loaded, instead of showing (null)?
>>
>> Best,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>>
>> 2015-11-18 13:50 GMT+09:00 Ted Yu :
>>
>>> Looks like there're two hbase-default.xml in the classpath: one for 0.98.6
>>> and another for 0.98.7-hadoop2 (used by Spark)
>>>
>>> You can specify hbase.defaults.for.version.skip as true in your
>>> hbase-site.xml
>>>
>>> Cheers
>>>
>>> On Tue, Nov 17, 2015 at 1:01 AM, 임정택  wrote:
>>>
 Hi all,

 I'm evaluating zeppelin to run driver which interacts with HBase.
 I use fat jar to include HBase dependencies, and see failures on
 executor level.
 I thought it is zeppelin's issue, but it fails on spark-shell, too.

 I loaded fat jar via --jars option,

 > ./bin/spark-shell --jars hbase-included-assembled.jar

 and run driver code using provided SparkContext instance, and see
 failures from spark-shell console and executor logs.

 below is stack traces,

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 
 in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 
 0.0 (TID 281, ): java.lang.NoClassDefFoundError: Could not 
 initialize class org.apache.hadoop.hbase.client.HConnectionManager
 at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
 at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
 at 
 org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
   

Unable to load native-hadoop library for your platform - already loaded in another classloader

2015-11-18 Thread Deenar Toraskar
Hi

I want to make sure we use short-circuit local reads for performance. I
have set the LD_LIBRARY_PATH correctly, confirmed that the native libraries
match our platform (i.e. are 64 bit and are loaded successfully). When I
start spark, i get the following message after increasing the logging level
for the relevant classes.

*15/11/18 17:47:23 DEBUG NativeCodeLoader: Failed to load native-hadoop
with error: java.lang.UnsatisfiedLinkError: Native Library
/usr/hdp/2.3.2.0-2950/hadoop/lib/native/libhadoop.so.1.0.0 already loaded
in another classloader*

Any idea what might be causing it and how to resolve this.

Regards
Deenar

[spark@edgenode1 spark-1.5.2-bin-hadoop2.6]$ bin/spark-shell
15/11/18 17:46:56 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...
15/11/18 17:46:56 DEBUG NativeCodeLoader: Loaded the native-hadoop library
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
15/11/18 17:47:00 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.
15/11/18 17:47:00 WARN Utils: Service 'SparkUI' could not bind on port
4041. Attempting port 4042.
15/11/18 17:47:00 WARN MetricsSystem: Using default name DAGScheduler for
source because spark.app.id is not set.
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
Spark context available as sc.
15/11/18 17:47:23 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...
*15/11/18 17:47:23 DEBUG NativeCodeLoader: Failed to load native-hadoop
with error: java.lang.UnsatisfiedLinkError: Native Library
/usr/hdp/2.3.2.0-2950/hadoop/lib/native/libhadoop.so.1.0.0 already loaded
in another classloader*
15/11/18 17:47:23 DEBUG NativeCodeLoader:
java.library.path=/usr/hdp/2.3.2.0-2950/hadoop/lib/native/:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
15/11/18 17:47:23 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/11/18 17:47:24 WARN DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
SQL context available as sqlContext.


*Think Reactive Ltd*
deenar.toras...@thinkreactive.co.uk
07714140812


Unable to import SharedSparkContext

2015-11-18 Thread njoshi
Hi,

Doesn't *SharedSparkContext* come with spark-core? Do I need to include any
special package in the library dependancies for using SharedSparkContext? 

I am trying to write a testSuite similar to the *LogisticRegressionSuite*
test in the Spak-ml. Unfortunately, I am unable to import any of the
following packages:

import org.apache.spark.SparkFunSuite
import org.apache.spark.ml.param.ParamsSuite
import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
import org.apache.spark.mllib.util.MLlibTestSparkContext
import org.apache.spark.mllib.util.TestingUtils._

Thanks in advance,
Nikhil



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-import-SharedSparkContext-tp25419.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



newbie simple app, small data set: Py4JJavaError java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-18 Thread Andy Davidson
Hi

I am working on a spark POC. I created a ec2 cluster on AWS using
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2

Bellow is a simple python program. I am running using IPython notebook. The
notebook server is running on my spark master. If I run my program more than
1 once using my large data set, I get the GC outOfMemory error. I run it
each time by ³re running the notebook cell². I can run my smaller set a
couple of times with out problems.

I launch using

pyspark --master $MASTER_URL --total-executor-cores 2


Any idea how I can debug this?

Kind regards

Andy

Using the master console I see
* there is only one app run (this is what I expect)
* There are 2 works each on a different slave (this is what I expect)
* Each worker is using 1 core (this is what I expect)
* Each worker memory usage is using 6154 (seems resonable)

* Alive Workers: 3
* Cores in use: 6 Total, 2 Used
* Memory in use: 18.8 GB Total, 12.0 GB Used
* Applications: 1 Running, 5 Completed
* Drivers: 0 Running, 0 Completed
* Status: ALIVE

The data file I am working with is small

I collected this data using spark streaming twitter utilities. All I do is
capture tweets, convert them to JSON and store as strings to hdfs

$ hadoop fs -count  hdfs:///largeSample hdfs:///smallSample

   13226   240098 3839228100

   1   228156   39689877


My python code. I am using python 3.4.

import json
import datetime

startTime = datetime.datetime.now()

#dataURL = "hdfs:///largeSample"
dataURL = "hdfs:///smallSample"
tweetStrings = sc.textFile(dataURL)

t2 = tweetStrings.take(2)
print (t2[1])
print("\n\nelapsed time:%s" % (datetime.datetime.now() - startTime))

---
Py4JJavaError Traceback (most recent call last)
 in ()
  8 tweetStrings = sc.textFile(dataURL)
  9 
---> 10 t2 = tweetStrings.take(2)
 11 print (t2[1])
 12 print("\n\nelapsed time:%s" % (datetime.datetime.now() - startTime))

/root/spark/python/pyspark/rdd.py in take(self, num)
   1267 """
   1268 items = []
-> 1269 totalParts = self.getNumPartitions()
   1270 partsScanned = 0
   1271 

/root/spark/python/pyspark/rdd.py in getNumPartitions(self)
354 2
355 """
--> 356 return self._jrdd.partitions().size()
357 
358 def filter(self, f):

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
__call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 34 def deco(*a, **kw):
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:
 38 s = e.java_exception.toString()

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o65.partitions.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.String.substring(String.java:1969)
at java.net.URI$Parser.substring(URI.java:2869)
at java.net.URI$Parser.parseHierarchical(URI.java:3106)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:145)
at org.apache.hadoop.fs.Path.(Path.java:71)
at org.apache.hadoop.fs.Path.(Path.java:50)
at 
org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.ja
va:215)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSy
stem.java:293)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSyste
m.java:352)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:862)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:887)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:185
)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at 

Re: Unable to import SharedSparkContext

2015-11-18 Thread Sourigna Phetsarath
Nikhil,

Please take a look at: https://github.com/holdenk/spark-testing-base

On Wed, Nov 18, 2015 at 2:12 PM, Marcelo Vanzin  wrote:

> On Wed, Nov 18, 2015 at 11:08 AM, njoshi  wrote:
> > Doesn't *SharedSparkContext* come with spark-core? Do I need to include
> any
> > special package in the library dependancies for using SharedSparkContext?
>
> That's a test class. It's not part of the Spark API.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 


*Gna Phetsarath*System Architect // AOL Platforms // Data Services //
Applied Research Chapter
770 Broadway, 5th Floor, New York, NY 10003
o: 212.402.4871 // m: 917.373.7363
vvmr: 8890237 aim: sphetsarath20 t: @sourigna

* *


DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Eran Medan
I understand that the following are equivalent

df.filter('account === "acct1")

sql("select * from tempTableName where account = 'acct1'")


But is Spark SQL "smart" to also push filter predicates down for the
initial load?

e.g.
sqlContext.read.jdbc(…).filter('account=== "acct1")

Is Spark "smart enough" to this for each partition?

   ‘select … where account= ‘acc1’ AND (partition where clause here)?

Or do I have to put it on each partition where clause otherwise it will
load the entire set and only then filter it in memory?

ᐧ


Re: Unable to import SharedSparkContext

2015-11-18 Thread Sourigna Phetsarath
Plus this article:
http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/

On Wed, Nov 18, 2015 at 2:25 PM, Sourigna Phetsarath <
gna.phetsar...@teamaol.com> wrote:

> Nikhil,
>
> Please take a look at: https://github.com/holdenk/spark-testing-base
>
> On Wed, Nov 18, 2015 at 2:12 PM, Marcelo Vanzin 
> wrote:
>
>> On Wed, Nov 18, 2015 at 11:08 AM, njoshi 
>> wrote:
>> > Doesn't *SharedSparkContext* come with spark-core? Do I need to include
>> any
>> > special package in the library dependancies for using
>> SharedSparkContext?
>>
>> That's a test class. It's not part of the Spark API.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
>
>
> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> Applied Research Chapter
> 770 Broadway, 5th Floor, New York, NY 10003
> o: 212.402.4871 // m: 917.373.7363
> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>
> * *
>



-- 


*Gna Phetsarath*System Architect // AOL Platforms // Data Services //
Applied Research Chapter
770 Broadway, 5th Floor, New York, NY 10003
o: 212.402.4871 // m: 917.373.7363
vvmr: 8890237 aim: sphetsarath20 t: @sourigna

* *


GraphX stopped without finishing and with no ERRORs !

2015-11-18 Thread Khaled Ammar
Hi all,

I have a problem running some algorithms on GraphX. Occasionally, it
stopped running without any errors. The task state is FINISHED, but the
executers state is KILLED. However, I can see that one job is not finished
yet. It took too much time (minutes) while every job/iteration were
typically finished in few seconds.

I am using Spark 1.5.2, I appreciate any suggestions or recommendation to
fix this or check it further.

-- 
Thanks,
-Khaled


Apache Groovy and Spark

2015-11-18 Thread tog
Hi

I start playing with both Apache projects and quickly got that exception.
Anyone being able to give some hint on the problem so that I can dig
further.
It seems to be a problem for Spark to load some of the groovy classes ...

Any idea?
Thanks
Guillaume


tog GroovySpark $ $GROOVY_HOME/bin/groovy
GroovySparkThroughGroovyShell.groovy

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage
0.0 (TID 1, localhost): java.lang.ClassNotFoundException:
Script1$_run_closure1

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:348)

at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)

at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)

at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)

at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)

at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at 

Re: Unable to import SharedSparkContext

2015-11-18 Thread Nikhil Joshi
Thanks Marcelo and Sourigna. I see the spark-testing-base being part of
Spark, but has been included under test package of Spark-core. That caused
the trouble :(.

On Wed, Nov 18, 2015 at 11:26 AM, Sourigna Phetsarath <
gna.phetsar...@teamaol.com> wrote:

> Plus this article:
> http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
>
> On Wed, Nov 18, 2015 at 2:25 PM, Sourigna Phetsarath <
> gna.phetsar...@teamaol.com> wrote:
>
>> Nikhil,
>>
>> Please take a look at: https://github.com/holdenk/spark-testing-base
>>
>> On Wed, Nov 18, 2015 at 2:12 PM, Marcelo Vanzin 
>> wrote:
>>
>>> On Wed, Nov 18, 2015 at 11:08 AM, njoshi 
>>> wrote:
>>> > Doesn't *SharedSparkContext* come with spark-core? Do I need to
>>> include any
>>> > special package in the library dependancies for using
>>> SharedSparkContext?
>>>
>>> That's a test class. It's not part of the Spark API.
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>>
>> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
>> Applied Research Chapter
>> 770 Broadway, 5th Floor, New York, NY 10003
>> o: 212.402.4871 // m: 917.373.7363
>> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>>
>> * *
>>
>
>
>
> --
>
>
> *Gna Phetsarath*System Architect // AOL Platforms // Data Services //
> Applied Research Chapter
> 770 Broadway, 5th Floor, New York, NY 10003
> o: 212.402.4871 // m: 917.373.7363
> vvmr: 8890237 aim: sphetsarath20 t: @sourigna
>
> * *
>



-- 

*Nikhil Joshi*Princ Data Scientist
*Aol*PLATFORMS.
*395 Page Mill Rd, *Palo Alto
, CA
 94306-2024
vvmr: 8894737


Re: How to disable SparkUI programmatically?

2015-11-18 Thread Ted Yu
Refer to core/src/test/scala/org/apache/spark/ui/UISuite.scala , around
line 41:
val conf = new SparkConf()
  .setMaster("local")
  .setAppName("test")
  .set("spark.ui.enabled", "true")
Cheers

On Wed, Nov 18, 2015 at 3:05 AM, Ted Yu  wrote:

> You can set spark.ui.enabled config parameter to false.
>
> Cheers
>
> On Nov 18, 2015, at 1:29 AM, Alex Luya  wrote:
>
> I noticed that blow bug has been fixed:
>
>  https://issues.apache.org/jira/browse/SPARK-2100
>
> but how to do it(I mean disabling SparkUI) programmatically?
>
> is it by sparkContext.setLocalProperty(?,?)?
>
> and I checked blow link,can't figured out which property to set
>
> http://localhost/spark/docs/configuration.html#spark-ui
>
>


Re: Additional Master daemon classpath

2015-11-18 Thread Michal Klos
Hi,

Thanks for the suggestion -- but those classpaths config options only affect 
the driver and executor processes -- not the standalone mode daemons (master 
and slave). Incidentally we have the extra jars we need set there.

I went through the docs but couldn't find a place to set extra classpath for 
the daemons. 

M

> On Nov 18, 2015, at 1:19 AM, "memorypr...@gmail.com"  
> wrote:
> 
> Have you tried using 
> spark.driver.extraClassPath
> and 
> spark.executor.extraClassPath
> 
> ?
> 
> AFAICT these config options replace SPARK_CLASSPATH. Further info in the 
> docs. I've had good luck with these options, and for ease of use I just set 
> them in the spark defaults config.
> 
> https://spark.apache.org/docs/latest/configuration.html
> 
>> On Tue, 17 Nov 2015 at 21:06 Michal Klos  wrote:
>> Hi,
>> 
>> We are running a Spark Standalone cluster on EMR (note: not using YARN) and 
>> are trying to use S3 w/ EmrFS as our event logging directory.
>> 
>> We are having difficulties with a ClassNotFoundException on EmrFileSystem 
>> when we navigate to the event log screen. This is to be expected as the 
>> EmrFs jars are not on the classpath.
>> 
>> But -- I have not been able to figure out a way to add additional classpath 
>> jars to the start-up of the Master daemon. SPARK_CLASSPATH has been 
>> deprecated, and looking around at spark-class, etc.. everything seems to be 
>> pretty locked down. 
>> 
>> Do I have to shove everything into the assembly jar?
>> 
>> Am I missing a simple way to add classpath to the daemons?
>> 
>> thanks,
>> Michal


Re: How to disable SparkUI programmatically?

2015-11-18 Thread Ted Yu
You can set spark.ui.enabled config parameter to false. 

Cheers

> On Nov 18, 2015, at 1:29 AM, Alex Luya  wrote:
> 
> I noticed that blow bug has been fixed:
> 
>  https://issues.apache.org/jira/browse/SPARK-2100
> but how to do it(I mean disabling SparkUI) programmatically?
> 
> is it by sparkContext.setLocalProperty(?,?)?
> 
> and I checked blow link,can't figured out which property to set
> 
> http://localhost/spark/docs/configuration.html#spark-ui


Shuffle FileNotFound Exception

2015-11-18 Thread Tom Arnfeld
Hey,

I’m wondering if anyone has run into issues with Spark 1.5 and a FileNotFound 
exception with shuffle.index files? It’s been cropping up with very large joins 
and aggregations, and causing all of our jobs to fail towards the end. The 
memory limit for the executors (we’re running on mesos) is touching 60GB+ with 
~10 cores per executor, which is way oversubscribed.

We’re running spark inside containers, and have configured 
“spark.executorEnv.SPARK_LOCAL_DIRS” to a special directory path in the 
container for performance/disk reasons, and since then the issue started to 
arise. I’m wondering if there’s a bug with the way spark looks for shuffle 
files, and one of the implementations isn’t obeying the path properly?

I don’t want to set "spark.local.dir” because that requires the driver also 
have this directory set up, which is not the case.

Has anyone seen this issue before?



15/11/18 11:32:27 ERROR storage.ShuffleBlockFetcherIterator: Failed to get 
block(s) from XXX:50777
java.lang.RuntimeException: java.io.FileNotFoundException: 
/mnt/mesos/sandbox/spark-tmp/blockmgr-7a5b85c8-7c5f-43c2-b9e2-2bc0ffdb902d/23/shuffle_2_94_0.index
 (No such file or directory)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.(FileInputStream.java:146)
   at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:98)
   at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:300)
   at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
   at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
   at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114)
   at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87)
   at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)
   at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
   at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
   at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
   at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
   at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
   at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
   at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
   at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
   at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
   at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
   at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
   at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
   at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
   at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
   at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
   at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
   at java.lang.Thread.run(Thread.java:745)

   at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:162)
   at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
   at 

Re: Shuffle FileNotFound Exception

2015-11-18 Thread Romi Kuntsman
take executor memory times spark.shuffle.memoryFraction
and divide the data so that each partition is less than the above

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Wed, Nov 18, 2015 at 2:09 PM, Tom Arnfeld  wrote:

> Hi Romi,
>
> Thanks! Could you give me an indication of how much increase the
> partitions by? We’ll take a stab in the dark, the input data is around 5M
> records (though each record is fairly small). We’ve had trouble both with
> DataFrames and RDDs.
>
> Tom.
>
> On 18 Nov 2015, at 12:04, Romi Kuntsman  wrote:
>
> I had many issues with shuffles (but not this one exactly), and what
> eventually solved it was to repartition to input into more parts. Have you
> tried that?
>
> P.S. not sure if related, but there's a memory leak in the shuffle
> mechanism
> https://issues.apache.org/jira/browse/SPARK-11293
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Wed, Nov 18, 2015 at 2:00 PM, Tom Arnfeld  wrote:
>
>> Hey,
>>
>> I’m wondering if anyone has run into issues with Spark 1.5 and a
>> FileNotFound exception with shuffle.index files? It’s been cropping up with
>> very large joins and aggregations, and causing all of our jobs to fail
>> towards the end. The memory limit for the executors (we’re running on
>> mesos) is touching 60GB+ with ~10 cores per executor, which is way
>> oversubscribed.
>>
>> We’re running spark inside containers, and have configured
>> “spark.executorEnv.SPARK_LOCAL_DIRS” to a special directory path in the
>> container for performance/disk reasons, and since then the issue started to
>> arise. I’m wondering if there’s a bug with the way spark looks for shuffle
>> files, and one of the implementations isn’t obeying the path properly?
>>
>> I don’t want to set "spark.local.dir” because that requires the driver
>> also have this directory set up, which is not the case.
>>
>> Has anyone seen this issue before?
>>
>> 
>>
>> 15/11/18 11:32:27 ERROR storage.ShuffleBlockFetcherIterator: Failed to
>> get block(s) from XXX:50777
>> java.lang.RuntimeException: java.io.FileNotFoundException:
>> /mnt/mesos/sandbox/spark-tmp/blockmgr-7a5b85c8-7c5f-43c2-b9e2-2bc0ffdb902d/23/shuffle_2_94_0.index
>> (No such file or directory)
>>at java.io.FileInputStream.open(Native Method)
>>at java.io.FileInputStream.(FileInputStream.java:146)
>>at
>> org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:98)
>>at
>> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:300)
>>at
>> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
>>at
>> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
>>at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>at
>> org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
>>at
>> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114)
>>at
>> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87)
>>at
>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)
>>at
>> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>>at
>> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>>at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>>at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>>at
>> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>>at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>>at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>>at
>> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>>at
>> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>>at
>> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>>at
>> 

Re: zeppelin (or spark-shell) with HBase fails on executor level

2015-11-18 Thread Ted Yu
Here is related code:

  private static void checkDefaultsVersion(Configuration conf) {

if (conf.getBoolean("hbase.defaults.for.version.skip", Boolean.FALSE))
return;

String defaultsVersion = conf.get("hbase.defaults.for.version");

String thisVersion = VersionInfo.getVersion();

if (!thisVersion.equals(defaultsVersion)) {

  throw new RuntimeException(

"hbase-default.xml file seems to be for an older version of HBase ("
+

defaultsVersion + "), this version is " + thisVersion);

null means that "hbase.defaults.for.version" was not set in the other
hbase-default.xml

Can you retrieve the classpath of Spark task so that we can have more clue ?


Cheers

On Tue, Nov 17, 2015 at 10:06 PM, 임정택  wrote:

> Ted,
>
> Thanks for the reply.
>
> My fat jar has dependency with spark related library to only spark-core as
> "provided".
> Seems like Spark only adds 0.98.7-hadoop2 of hbase-common in spark-example
> module.
>
> And if there're two hbase-default.xml in the classpath, should one of them
> be loaded, instead of showing (null)?
>
> Best,
> Jungtaek Lim (HeartSaVioR)
>
>
>
> 2015-11-18 13:50 GMT+09:00 Ted Yu :
>
>> Looks like there're two hbase-default.xml in the classpath: one for 0.98.6
>> and another for 0.98.7-hadoop2 (used by Spark)
>>
>> You can specify hbase.defaults.for.version.skip as true in your
>> hbase-site.xml
>>
>> Cheers
>>
>> On Tue, Nov 17, 2015 at 1:01 AM, 임정택  wrote:
>>
>>> Hi all,
>>>
>>> I'm evaluating zeppelin to run driver which interacts with HBase.
>>> I use fat jar to include HBase dependencies, and see failures on
>>> executor level.
>>> I thought it is zeppelin's issue, but it fails on spark-shell, too.
>>>
>>> I loaded fat jar via --jars option,
>>>
>>> > ./bin/spark-shell --jars hbase-included-assembled.jar
>>>
>>> and run driver code using provided SparkContext instance, and see
>>> failures from spark-shell console and executor logs.
>>>
>>> below is stack traces,
>>>
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 
>>> in stage 0.0 failed 4 times, most recent failure: Lost task 55.3 in stage 
>>> 0.0 (TID 281, ): java.lang.NoClassDefFoundError: Could not 
>>> initialize class org.apache.hadoop.hbase.client.HConnectionManager
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:197)
>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>>> at 
>>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>>> at 
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:128)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>>> at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Driver stacktrace:
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
>>> at 
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
>>> at scala.Option.foreach(Option.scala:236)
>>> at 
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
>>> at 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
>>> at 
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
>>> at 

Re: thought experiment: use spark ML to real time prediction

2015-11-18 Thread Nick Pentreath
One such "lightweight PMML in JSON" is here -
https://github.com/bigmlcom/json-pml. At least for the schema definitions.
But nothing available in terms of evaluation/scoring. Perhaps this is
something that can form a basis for such a new undertaking.

I agree that distributed models are only really applicable in the case of
massive scale factor models - and then anyway for latency purposes one
needs to use LSH or something similar to achieve sufficiently real-time
performance. These days one can easily spin up a single very powerful
server to handle even very large models.

On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai  wrote:

> I was thinking about to work on better version of PMML, JMML in JSON, but
> as you said, this requires a dedicated team to define the standard which
> will be a huge work.  However, option b) and c) still don't address the
> distributed models issue. In fact, most of the models in production have to
> be small enough to return the result to users within reasonable latency, so
> I doubt how usefulness of the distributed models in real production
> use-case. For R and Python, we can build a wrapper on-top of the
> lightweight "spark-ml-common" project.
>
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath 
> wrote:
>
>> I think the issue with pulling in all of spark-core is often with
>> dependencies (and versions) conflicting with the web framework (or Akka in
>> many cases). Plus it really is quite heavy if you just want a fairly
>> lightweight model-serving app. For example we've built a fairly simple but
>> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
>> really need is the web framework and Breeze (or an alternative linear
>> algebra lib).
>>
>> I definitely hear the pain-point that PMML might not be able to handle
>> some types of transformations or models that exist in Spark. However,
>> here's an example from scikit-learn -> PMML that may be instructive (
>> https://github.com/scikit-learn/scikit-learn/issues/1596 and
>> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list
>> of estimators and transformers are supported (including e.g. scaling and
>> encoding, and PCA).
>>
>> I definitely think the current model I/O and "export" or "deploy to
>> production" situation needs to be improved substantially. However, you are
>> left with the following options:
>>
>> (a) build out a lightweight "spark-ml-common" project that brings in the
>> dependencies needed for production scoring / transformation in independent
>> apps. However, here you only support Scala/Java - what about R and Python?
>> Also, what about the distributed models? Perhaps "local" wrappers can be
>> created, though this may not work for very large factor or LDA models. See
>> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html
>>
>> (b) build out Spark's PMML support, and add missing stuff to PMML where
>> possible. The benefit here is an existing standard with various tools for
>> scoring (via REST server, Java app, Pig, Hive, various language support).
>>
>> (c) build out a more comprehensive I/O, serialization and scoring
>> framework. Here you face the issue of supporting various predictors and
>> transformers generically, across platforms and versioning. i.e. you're
>> re-creating a new standard like PMML
>>
>> Option (a) is do-able, but I'm a bit concerned that it may be too "Spark
>> specific", or even too "Scala / Java" specific. But it is still potentially
>> very useful to Spark users to build this out and have a somewhat standard
>> production serving framework and/or library (there are obviously existing
>> options like PredictionIO etc).
>>
>> Option (b) is really building out the existing PMML support within Spark,
>> so a lot of the initial work has already been done. I know some folks had
>> (or have) licensing issues with some components of JPMML (e.g. the
>> evaluator and REST server). But perhaps the solution here is to build an
>> Apache2-licensed evaluator framework.
>>
>> Option (c) is obviously interesting - "let's build a better PMML (that
>> uses JSON or whatever instead of XML!)". But it also seems like a huge
>> amount of reinventing the wheel, and like any new standard would take time
>> to garner wide support (if at all).
>>
>> It would be really useful to start to understand what the main missing
>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
>> contribute improvements or additions to PMML.
>>
>>
>>
>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> That may not be an issue if the app using the models runs by itself (not
>>> bundled into an existing app), which may actually be the right way to
>>> design it considering separation of concerns.
>>>
>>> 

Re: Shuffle FileNotFound Exception

2015-11-18 Thread Tom Arnfeld
Hi Romi,

Thanks! Could you give me an indication of how much increase the partitions by? 
We’ll take a stab in the dark, the input data is around 5M records (though each 
record is fairly small). We’ve had trouble both with DataFrames and RDDs.

Tom.

> On 18 Nov 2015, at 12:04, Romi Kuntsman  wrote:
> 
> I had many issues with shuffles (but not this one exactly), and what 
> eventually solved it was to repartition to input into more parts. Have you 
> tried that?
> 
> P.S. not sure if related, but there's a memory leak in the shuffle mechanism
> https://issues.apache.org/jira/browse/SPARK-11293 
> 
> 
> Romi Kuntsman, Big Data Engineer
> http://www.totango.com 
> 
> On Wed, Nov 18, 2015 at 2:00 PM, Tom Arnfeld  > wrote:
> Hey,
> 
> I’m wondering if anyone has run into issues with Spark 1.5 and a FileNotFound 
> exception with shuffle.index files? It’s been cropping up with very large 
> joins and aggregations, and causing all of our jobs to fail towards the end. 
> The memory limit for the executors (we’re running on mesos) is touching 60GB+ 
> with ~10 cores per executor, which is way oversubscribed.
> 
> We’re running spark inside containers, and have configured 
> “spark.executorEnv.SPARK_LOCAL_DIRS” to a special directory path in the 
> container for performance/disk reasons, and since then the issue started to 
> arise. I’m wondering if there’s a bug with the way spark looks for shuffle 
> files, and one of the implementations isn’t obeying the path properly?
> 
> I don’t want to set "spark.local.dir” because that requires the driver also 
> have this directory set up, which is not the case.
> 
> Has anyone seen this issue before?
> 
> 
> 
> 15/11/18 11:32:27 ERROR storage.ShuffleBlockFetcherIterator: Failed to get 
> block(s) from XXX:50777
> java.lang.RuntimeException: java.io.FileNotFoundException: 
> /mnt/mesos/sandbox/spark-tmp/blockmgr-7a5b85c8-7c5f-43c2-b9e2-2bc0ffdb902d/23/shuffle_2_94_0.index
>  (No such file or directory)
>at java.io.FileInputStream.open(Native Method)
>at java.io.FileInputStream.(FileInputStream.java:146)
>at 
> org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:98)
>at 
> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:300)
>at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
>at 
> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
>at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>at 
> org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
>at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:114)
>at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87)
>at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101)
>at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>at 
> 

How to disable SparkUI programmatically?

2015-11-18 Thread Alex Luya
I noticed that blow bug has been fixed:

 https://issues.apache.org/jira/browse/SPARK-2100

but how to do it(I mean disabling SparkUI) programmatically?

is it by sparkContext.setLocalProperty(?,?)?

and I checked blow link,can't figured out which property to set

http://localhost/spark/docs/configuration.html#spark-ui


(send this email to subscribe)

2015-11-18 Thread Alex Luya



Re: (send this email to subscribe)

2015-11-18 Thread Nick Pentreath
To subscribe to the list, you need to send a mail to
user-subscr...@spark.apache.org

(see http://spark.apache.org/community.html for details and a subscribe
link).

On Wed, Nov 18, 2015 at 11:23 AM, Alex Luya 
wrote:

>
>


subscribe

2015-11-18 Thread Alex Luya
subscribe


Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Fengdong Yu
The simplest way is remove all “provided” in your pom.

then ‘sbt assembly” to build your final package. then get rid of ‘—jars’ 
because assembly already includes all dependencies.






> On Nov 18, 2015, at 2:15 PM, Jack Yang  wrote:
> 
> So weird. Is there anything wrong with the way I made the pom file (I 
> labelled them as provided)?
>  
> Is there missing jar I forget to add in “--jar”?
>  
> See the trace below:
>  
>  
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> breeze/storage/DefaultArrayValue
> at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: breeze.storage.DefaultArrayValue
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 10 more
> 15/11/18 17:15:15 INFO util.Utils: Shutdown hook called
>  
>  
> From: Ted Yu [mailto:yuzhih...@gmail.com] 
> Sent: Wednesday, 18 November 2015 4:01 PM
> To: Jack Yang
> Cc: user@spark.apache.org
> Subject: Re: spark with breeze error of NoClassDefFoundError
>  
> Looking in local maven repo, breeze_2.10-0.7.jar contains DefaultArrayValue :
>  
> jar tvf 
> /Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar | 
> grep !$
> jar tvf 
> /Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar | 
> grep DefaultArrayValue
>369 Wed Mar 19 11:18:32 PDT 2014 
> breeze/storage/DefaultArrayValue$mcZ$sp$class.class
>309 Wed Mar 19 11:18:32 PDT 2014 
> breeze/storage/DefaultArrayValue$mcJ$sp.class
>   2233 Wed Mar 19 11:18:32 PDT 2014 
> breeze/storage/DefaultArrayValue$DoubleDefaultArrayValue$.class
>  
> Can you show the complete stack trace ?
>  
> FYI
>  
> On Tue, Nov 17, 2015 at 8:33 PM, Jack Yang  > wrote:
> Hi all,
> I am using spark 1.4.0, and building my codes using maven.
> So in one of my scala, I used:
>  
> import breeze.linalg._
> val v1 = new breeze.linalg.SparseVector(commonVector.indices, 
> commonVector.values, commonVector.size)   
> val v2 = new breeze.linalg.SparseVector(commonVector2.indices, 
> commonVector2.values, commonVector2.size)
> println (v1.dot(v2) / (norm(v1) * norm(v2)) )
>  
>  
>  
> in my pom.xml file, I used:
> 
>  org.scalanlp
>  
> breeze-math_2.10
>  0.4
>  provided
>   
>  
>   
>  org.scalanlp
>  
> breeze_2.10
>  0.11.2
>  provided
>   
>  
>  
> When submit, I included breeze jars (breeze_2.10-0.11.2.jar 
> breeze-math_2.10-0.4.jar breeze-natives_2.10-0.11.2.jar 
> breeze-process_2.10-0.3.jar) using “--jar” arguments, although I doubt it is 
> necessary to do that.
>  
> however, the error is
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> breeze/storage/DefaultArrayValue
>  
> Any thoughts?
>  
>  
>  
> Best regards,
> Jack



Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread 金国栋
Have you tried to change to scope to `compile` ?

2015-11-18 14:15 GMT+08:00 Jack Yang :

> So weird. Is there anything wrong with the way I made the pom file (I
> labelled them as *provided*)?
>
>
>
> Is there missing jar I forget to add in “--jar”?
>
>
>
> See the trace below:
>
>
>
>
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> breeze/storage/DefaultArrayValue
>
> at
> smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.lang.ClassNotFoundException:
> breeze.storage.DefaultArrayValue
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> ... 10 more
>
> 15/11/18 17:15:15 INFO util.Utils: Shutdown hook called
>
>
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Wednesday, 18 November 2015 4:01 PM
> *To:* Jack Yang
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark with breeze error of NoClassDefFoundError
>
>
>
> Looking in local maven repo, breeze_2.10-0.7.jar
> contains DefaultArrayValue :
>
>
>
> jar tvf
> /Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar
> | grep !$
>
> jar tvf
> /Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar
> | grep DefaultArrayValue
>
>369 Wed Mar 19 11:18:32 PDT 2014
> breeze/storage/DefaultArrayValue$mcZ$sp$class.class
>
>309 Wed Mar 19 11:18:32 PDT 2014
> breeze/storage/DefaultArrayValue$mcJ$sp.class
>
>   2233 Wed Mar 19 11:18:32 PDT 2014
> breeze/storage/DefaultArrayValue$DoubleDefaultArrayValue$.class
>
>
>
> Can you show the complete stack trace ?
>
>
>
> FYI
>
>
>
> On Tue, Nov 17, 2015 at 8:33 PM, Jack Yang  wrote:
>
> Hi all,
>
> I am using spark 1.4.0, and building my codes using maven.
>
> So in one of my scala, I used:
>
>
>
> import breeze.linalg._
>
> val v1 = new breeze.linalg.SparseVector(commonVector.indices,
> commonVector.values, commonVector.size)
>
> val v2 = new breeze.linalg.SparseVector(commonVector2.indices,
> commonVector2.values, commonVector2.size)
>
> println (v1.dot(v2) / (norm(v1) * norm(v2)) )
>
>
>
>
>
>
>
> in my pom.xml file, I used:
>
> 
>
>
> org.scalanlp
>
>
> breeze-math_2.10
>
>  0.4
>
>  *provided*
>
>   
>
>
>
>   
>
>
> org.scalanlp
>
>
> breeze_2.10
>
>  0.11.2
>
>  *provided*
>
>   
>
>
>
>
>
> When submit, I included breeze jars (breeze_2.10-0.11.2.jar
> breeze-math_2.10-0.4.jar breeze-natives_2.10-0.11.2.jar
> breeze-process_2.10-0.3.jar) using “--jar” arguments, although I doubt it
> is necessary to do that.
>
>
>
> however, the error is
>
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> breeze/storage/DefaultArrayValue
>
>
>
> Any thoughts?
>
>
>
>
>
>
>
> Best regards,
>
> Jack
>
>
>
>
>


Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread swetha kasireddy
Looks like I can use mapPartitions but can it be done using
forEachPartition?

On Tue, Nov 17, 2015 at 11:51 PM, swetha  wrote:

> Hi,
>
> How to return an RDD of key/value pairs from an RDD that has
> foreachPartition applied. I have my code something like the following. It
> looks like an RDD that has foreachPartition can have only the return type
> as
> Unit. How do I apply foreachPartition and do a save and at the same return
> a
> pair RDD.
>
>  def saveDataPointsBatchNew(records: RDD[(String, (Long,
> java.util.LinkedHashMap[java.lang.Long,
> java.lang.Float],java.util.LinkedHashMap[java.lang.Long, java.lang.Float],
> java.util.HashSet[java.lang.String] , Boolean))])= {
> records.foreachPartition({ partitionOfRecords =>
>   val dataLoader = new DataLoaderImpl();
>   var metricList = new java.util.ArrayList[String]();
>   var storageTimeStamp = 0l
>
>   if (partitionOfRecords != null) {
> partitionOfRecords.foreach(record => {
>
> if (record._2._1 == 0l) {
> entrySet = record._2._3.entrySet()
> itr = entrySet.iterator();
> while (itr.hasNext()) {
> val entry = itr.next();
> storageTimeStamp = entry.getKey.toLong
> val dayCounts = entry.getValue
> metricsDayCounts += record._1 ->(storageTimeStamp,
> dayCounts.toFloat)
> }
> }
>}
> }
> )
>   }
>
>   //Code to insert the last successful batch/streaming timestamp  ends
>   dataLoader.saveDataPoints(metricList);
>   metricList = null
>
> })
>   }
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-return-a-pair-RDD-from-an-RDD-that-has-foreachPartition-applied-tp25411.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Streaming Job gives error after changing to version 1.5.2

2015-11-18 Thread swetha kasireddy
It works fine after some changes.

-Thanks,
Swetha

On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das  wrote:

> Can you verify that the cluster is running the correct version of Spark.
> 1.5.2.
>
> On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Sorry compile makes it work locally. But, the cluster
>> still seems to have issues with provided. Basically it
>> does not seem to process any records, no data is shown in any of the tabs
>> of the Streaming UI except the Streaming tab. Executors, Storage, Stages
>> etc show empty RDDs.
>>
>> On Tue, Nov 17, 2015 at 7:19 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Hi TD,
>>>
>>> Basically, I see two issues. With provided the job does
>>> not start localy. It does start in Cluster but seems  no data is
>>> getting processed.
>>>
>>> Thanks,
>>> Swetha
>>>
>>> On Tue, Nov 17, 2015 at 7:04 PM, Tim Barthram 
>>> wrote:
>>>
 If you are running a local context, could it be that you should use:



 provided



 ?



 Thanks,

 Tim



 *From:* swetha kasireddy [mailto:swethakasire...@gmail.com]
 *Sent:* Wednesday, 18 November 2015 2:01 PM
 *To:* Tathagata Das
 *Cc:* user
 *Subject:* Re: Streaming Job gives error after changing to version
 1.5.2



 This error I see locally.



 On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das 
 wrote:

 Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?



 On Tue, Nov 17, 2015 at 5:34 PM, swetha 
 wrote:



 Hi,

 I see  java.lang.NoClassDefFoundError after changing the Streaming job
 version to 1.5.2. Any idea as to why this is happening? Following are my
 dependencies and the error that I get.

   
 org.apache.spark
 spark-core_2.10
 ${sparkVersion}
 provided
 


 
 org.apache.spark
 spark-streaming_2.10
 ${sparkVersion}
 provided
 


 
 org.apache.spark
 spark-sql_2.10
 ${sparkVersion}
 provided
 


 
 org.apache.spark
 spark-hive_2.10
 ${sparkVersion}
 provided
 



 
 org.apache.spark
 spark-streaming-kafka_2.10
 ${sparkVersion}
 


 Exception in thread "main" java.lang.NoClassDefFoundError:
 org/apache/spark/streaming/StreamingContext
 at java.lang.Class.getDeclaredMethods0(Native Method)
 at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
 at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
 at java.lang.Class.getMethod0(Class.java:3010)
 at java.lang.Class.getMethod(Class.java:1776)
 at
 com.intellij.rt.execution.application.AppMain.main(AppMain.java:125)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.streaming.StreamingContext
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Job-gives-error-after-changing-to-version-1-5-2-tp25406.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 _

 The information transmitted in this message and its attachments (if
 any) is intended
 only for the person or entity to which it is addressed.
 The message may contain confidential and/or privileged material. Any
 review,
 retransmission, dissemination or other use of, or taking of any action
 in reliance
 upon this information, by persons or entities other than the intended
 recipient is
 prohibited.

 If you have received this in error, 

orc read issue n spark

2015-11-18 Thread Renu Yadav
Hi ,
I am using spark 1.4.1 and saving orc file using
df.write.format("orc").save("outputlocation")

outputloation size 440GB

and while reading df.read.format("orc").load("outputlocation").count


it has 2618 partitions .
the count operation runs fine uptil 2500 but starts delay scheduling after
that which results in slow performance.

*If anyone has any idea on this.Please do reply as I need this  very urgent*

Thanks in advance


Regards,
Renu Yadav


Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Zhan Zhang
When you have following query, 'account=== “acct1” will be pushdown to generate 
new query with “where account = acct1”

Thanks.

Zhan Zhang

On Nov 18, 2015, at 11:36 AM, Eran Medan 
> wrote:

I understand that the following are equivalent

df.filter('account === "acct1")

sql("select * from tempTableName where account = 'acct1'")


But is Spark SQL "smart" to also push filter predicates down for the initial 
load?

e.g.
sqlContext.read.jdbc(…).filter('account=== "acct1")

Is Spark "smart enough" to this for each partition?

   ‘select … where account= ‘acc1’ AND (partition where clause here)?

Or do I have to put it on each partition where clause otherwise it will load 
the entire set and only then filter it in memory?

[https://mailfoogae.appspot.com/t?sender=aZWhyYW5uLm1laGRhbkBnbWFpbC5jb20%3D=zerocontent=4e81181c-98d1-4dac-b047-a4c9e7d864d9]ᐧ



getting different results from same line of code repeated

2015-11-18 Thread Walrus theCat
Hi,

I'm launching a Spark cluster with the spark-ec2 script and playing around
in spark-shell. I'm running the same line of code over and over again, and
getting different results, and sometimes exceptions.  Towards the end,
after I cache the first RDD, it gives me the correct result multiple times
in a row before throwing an exception.  How can I get correct behavior out
of these operations on these RDDs?

scala> val targets =
data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[116] at
sortBy at :36

scala> targets.first
res26: (String, Int) = (\bguns?\b,1253)

scala> val targets = data map {_.REGEX} groupBy{identity} map {
Function.tupled(_->_.size)} sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[125] at
sortBy at :36

scala> targets.first
res27: (String, Int) = (nika,7)


scala> val targets =
data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[134] at
sortBy at :36

scala> targets.first
res28: (String, Int) = (\bcalientes?\b,6)

scala> targets.sortBy(_._2,false).first
java.lang.UnsupportedOperationException: empty collection

scala> val targets =
data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false).cache
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[283] at
sortBy at :36

scala> targets.first
res46: (String, Int) = (\bhurting\ yous?\b,8)

scala> val targets =
data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false).cache
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[292] at
sortBy at :36

scala> targets.first
java.lang.UnsupportedOperationException: empty collection

scala> val targets =
data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[301] at
sortBy at :36

scala> targets.first
res48: (String, Int) = (\bguns?\b,1253)

scala> val targets =
data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[310] at
sortBy at :36

scala> targets.first
res49: (String, Int) = (\bguns?\b,1253)

scala> val targets =
data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[319] at
sortBy at :36

scala> targets.first
res50: (String, Int) = (\bguns?\b,1253)

scala> val targets =
data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[328] at
sortBy at :36

scala> targets.first
java.lang.UnsupportedOperationException: empty collection


Re: Spark job workflow engine recommendations

2015-11-18 Thread Vikram Kone
Hi Feng,
Does airflow allow remote submissions of spark jobs via spark-submit?

On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu 
wrote:

> Hi,
>
> we use ‘Airflow'  as our job workflow scheduler.
>
>
>
>
> On Nov 19, 2015, at 9:47 AM, Vikram Kone  wrote:
>
> Hi Nick,
> Quick question about spark-submit command executed from azkaban with
> command job type.
> I see that when I press kill in azkaban portal on a spark-submit job, it
> doesn't actually kill the application on spark master and it continues to
> run even though azkaban thinks that it's killed.
> How do you get around this? Is there a way to kill the spark-submit jobs
> from azkaban portal?
>
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath 
> wrote:
>
>> Hi Vikram,
>>
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
>> local mode deployment and it is fairly easy to set up. It is pretty easy to
>> use and has a nice scheduling and logging interface, as well as SLAs (like
>> kill job and notify if it doesn't complete in 3 hours or whatever).
>>
>> However Spark support is not present directly - we run everything with
>> shell scripts and spark-submit. There is a plugin interface where one could
>> create a Spark plugin, but I found it very cumbersome when I did
>> investigate and didn't have the time to work through it to develop that.
>>
>> It has some quirks and while there is actually a REST API for adding jos
>> and dynamically scheduling jobs, it is not documented anywhere so you kinda
>> have to figure it out for yourself. But in terms of ease of use I found it
>> way better than Oozie. I haven't tried Chronos, and it seemed quite
>> involved to set up. Haven't tried Luigi either.
>>
>> Spark job server is good but as you say lacks some stuff like scheduling
>> and DAG type workflows (independent of spark-defined job flows).
>>
>>
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  wrote:
>>
>>> Check also falcon in combination with oozie
>>>
>>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>>> écrit :
>>>
 Looks like Oozie can satisfy most of your requirements.



 On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
 wrote:

> Hi,
> I'm looking for open source workflow tools/engines that allow us to
> schedule spark jobs on a datastax cassandra cluster. Since there are 
> tonnes
> of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
> wanted to check with people here to see what they are using today.
>
> Some of the requirements of the workflow engine that I'm looking for
> are
>
> 1. First class support for submitting Spark jobs on Cassandra. Not
> some wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production
> scale.
> 3. Should be dead easy to write job dependencices using XML or web
> interface . Ex; job A depends on Job B and Job C, so run Job A after B and
> C are finished. Don't need to write full blown java applications to 
> specify
> job parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given
> time every hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on
> daily basis.
>
> I have looked at Ooyala's spark job server which seems to be hated
> towards making spark jobs run faster by sharing contexts between the jobs
> but isn't a full blown workflow engine per se. A combination of spark job
> server and workflow engine would be ideal
>
> Thanks for the inputs
>


>>
>
>


Re: Calculating Timeseries Aggregation

2015-11-18 Thread Sandip Mehta
Thank you TD for your time and help.

SM
> On 19-Nov-2015, at 6:58 AM, Tathagata Das  wrote:
> 
> There are different ways to do the rollups. Either update rollups from the 
> streaming application, or you can generate roll ups in a later process - say 
> periodic Spark job every hour. Or you could just generate rollups on demand, 
> when it is queried.
> The whole thing depends on your downstream requirements - if you always to 
> have up to date rollups to show up in dashboard (even day-level stuff), then 
> the first approach is better. Otherwise, second and third approaches are more 
> efficient.
> 
> TD
> 
> 
> On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta  > wrote:
> TD thank you for your reply.
> 
> I agree on data store requirement. I am using HBase as an underlying store.
> 
> So for every batch interval of say 10 seconds
> 
> - Calculate the time dimension ( minutes, hours, day, week, month and quarter 
> ) along with other dimensions and metrics
> - Update relevant base table at each batch interval for relevant metrics for 
> a given set of dimensions.
> 
> Only caveat I see is I’ll have to update each of the different roll up table 
> for each batch window.
> 
> Is this a valid approach for calculating time series aggregation?
> 
> Regards
> SM
> 
> For minutes level aggregates I have set up a streaming window say 10 seconds 
> and storing minutes level aggregates across multiple dimension in HBase at 
> every window interval. 
> 
>> On 18-Nov-2015, at 7:45 AM, Tathagata Das > > wrote:
>> 
>> For this sort of long term aggregations you should use a dedicated data 
>> storage systems. Like a database, or a key-value store. Spark Streaming 
>> would just aggregate and push the necessary data to the data store. 
>> 
>> TD
>> 
>> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta > > wrote:
>> Hi,
>> 
>> I am working on requirement of calculating real time metrics and building 
>> prototype  on Spark streaming. I need to build aggregate at Seconds, 
>> Minutes, Hours and Day level.
>> 
>> I am not sure whether I should calculate all these aggregates as  different 
>> Windowed function on input DStream or shall I use updateStateByKey function 
>> for the same. If I have to use updateStateByKey for these time series 
>> aggregation, how can I remove keys from the state after different time 
>> lapsed?
>> 
>> Please suggest.
>> 
>> Regards
>> SM
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>> 
> 
> 



Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Hi,

we use ‘Airflow'  as our job workflow scheduler.




> On Nov 19, 2015, at 9:47 AM, Vikram Kone  wrote:
> 
> Hi Nick,
> Quick question about spark-submit command executed from azkaban with command 
> job type.
> I see that when I press kill in azkaban portal on a spark-submit job, it 
> doesn't actually kill the application on spark master and it continues to run 
> even though azkaban thinks that it's killed.
> How do you get around this? Is there a way to kill the spark-submit jobs from 
> azkaban portal?
> 
> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath  > wrote:
> Hi Vikram,
> 
> We use Azkaban (2.5.0) in our production workflow scheduling. We just use 
> local mode deployment and it is fairly easy to set up. It is pretty easy to 
> use and has a nice scheduling and logging interface, as well as SLAs (like 
> kill job and notify if it doesn't complete in 3 hours or whatever). 
> 
> However Spark support is not present directly - we run everything with shell 
> scripts and spark-submit. There is a plugin interface where one could create 
> a Spark plugin, but I found it very cumbersome when I did investigate and 
> didn't have the time to work through it to develop that.
> 
> It has some quirks and while there is actually a REST API for adding jos and 
> dynamically scheduling jobs, it is not documented anywhere so you kinda have 
> to figure it out for yourself. But in terms of ease of use I found it way 
> better than Oozie. I haven't tried Chronos, and it seemed quite involved to 
> set up. Haven't tried Luigi either.
> 
> Spark job server is good but as you say lacks some stuff like scheduling and 
> DAG type workflows (independent of spark-defined job flows).
> 
> 
> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  > wrote:
> Check also falcon in combination with oozie
> 
> Le ven. 7 août 2015 à 17:51, Hien Luu  a écrit :
> Looks like Oozie can satisfy most of your requirements. 
> 
> 
> 
> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone  > wrote:
> Hi,
> I'm looking for open source workflow tools/engines that allow us to schedule 
> spark jobs on a datastax cassandra cluster. Since there are tonnes of 
> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to 
> check with people here to see what they are using today.
> 
> Some of the requirements of the workflow engine that I'm looking for are
> 
> 1. First class support for submitting Spark jobs on Cassandra. Not some 
> wrapper Java code to submit tasks.
> 2. Active open source community support and well tested at production scale.
> 3. Should be dead easy to write job dependencices using XML or web interface 
> . Ex; job A depends on Job B and Job C, so run Job A after B and C are 
> finished. Don't need to write full blown java applications to specify job 
> parameters and dependencies. Should be very simple to use.
> 4. Time based  recurrent scheduling. Run the spark jobs at a given time every 
> hour or day or week or month.
> 5. Job monitoring, alerting on failures and email notifications on daily 
> basis.
> 
> I have looked at Ooyala's spark job server which seems to be hated towards 
> making spark jobs run faster by sharing contexts between the jobs but isn't a 
> full blown workflow engine per se. A combination of spark job server and 
> workflow engine would be ideal 
> 
> Thanks for the inputs
> 
> 
> 



Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
The following works against a hive table from spark sql

hc.sql("select id,r from (select id, name, rank()  over (order by name) as
r from tt2) v where v.r >= 1 and v.r <= 12")

But when using  a standard sql context against a temporary table the
following occurs:


Exception in thread "main" java.lang.RuntimeException: [3.25]
  failure: ``)'' expected but `(' found

rank() over (order by name) as r
^


RE: how to group timestamp data and filter on it

2015-11-18 Thread Tim Barthram
Hi LCassa,

Try:

Map to pair, then reduce by key.

The spark documentation is a pretty good reference for this & there are plenty 
of word count examples on the internet.

Warm regards,
TimB


From: Cassa L [mailto:lcas...@gmail.com]
Sent: Thursday, 19 November 2015 11:27 AM
To: user
Subject: how to group timestamp data and filter on it

Hi,
I have a data stream (JavaDStream) in following format-
timestamp=second1,  map(key1=value1, key2=value2)
timestamp=second2,map(key1=value3, key2=value4)
timestamp=second2, map(key1=value1, key2=value5)

I want to group data by 'timestamp' first and then filter each RDD for 
Key1=value1 or key1=value3 etc.
Each of above row represent POJO in RDD like:
public class Data{
long timestamp;
Map map;
}
How do do this in spark? I am trying to figure out if I need to use map or 
flatMap etc?
Thanks,
LCassa


_

The information transmitted in this message and its attachments (if any) is 
intended 
only for the person or entity to which it is addressed.
The message may contain confidential and/or privileged material. Any review, 
retransmission, dissemination or other use of, or taking of any action in 
reliance 
upon this information, by persons or entities other than the intended recipient 
is 
prohibited.

If you have received this in error, please contact the sender and delete this 
e-mail 
and associated material from any computer.

The intended recipient of this e-mail may only use, reproduce, disclose or 
distribute 
the information contained in this e-mail and any attached files, with the 
permission 
of the sender.

This message has been scanned for viruses.
_


Re: Spark LogisticRegression returns scaled coefficients

2015-11-18 Thread robert_dodier
njoshi wrote
> I am testing the LogisticRegression performance on a synthetically
> generated data. 

Hmm, seems like a good idea. Can you give the code for generating the
training data?

best,

Robert Dodier



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LogisticRegression-returns-scaled-coefficients-tp25405p25421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Streaming Job gives error after changing to version 1.5.2

2015-11-18 Thread Tathagata Das
If possible, could you give us the root cause and solution for future
readers of this thread.

On Wed, Nov 18, 2015 at 6:37 AM, swetha kasireddy  wrote:

> It works fine after some changes.
>
> -Thanks,
> Swetha
>
> On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das 
> wrote:
>
>> Can you verify that the cluster is running the correct version of Spark.
>> 1.5.2.
>>
>> On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Sorry compile makes it work locally. But, the cluster
>>> still seems to have issues with provided. Basically it
>>> does not seem to process any records, no data is shown in any of the tabs
>>> of the Streaming UI except the Streaming tab. Executors, Storage, Stages
>>> etc show empty RDDs.
>>>
>>> On Tue, Nov 17, 2015 at 7:19 PM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
 Hi TD,

 Basically, I see two issues. With provided the job does
 not start localy. It does start in Cluster but seems  no data is
 getting processed.

 Thanks,
 Swetha

 On Tue, Nov 17, 2015 at 7:04 PM, Tim Barthram 
 wrote:

> If you are running a local context, could it be that you should use:
>
>
>
> provided
>
>
>
> ?
>
>
>
> Thanks,
>
> Tim
>
>
>
> *From:* swetha kasireddy [mailto:swethakasire...@gmail.com]
> *Sent:* Wednesday, 18 November 2015 2:01 PM
> *To:* Tathagata Das
> *Cc:* user
> *Subject:* Re: Streaming Job gives error after changing to version
> 1.5.2
>
>
>
> This error I see locally.
>
>
>
> On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das 
> wrote:
>
> Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?
>
>
>
> On Tue, Nov 17, 2015 at 5:34 PM, swetha 
> wrote:
>
>
>
> Hi,
>
> I see  java.lang.NoClassDefFoundError after changing the Streaming job
> version to 1.5.2. Any idea as to why this is happening? Following are
> my
> dependencies and the error that I get.
>
>   
> org.apache.spark
> spark-core_2.10
> ${sparkVersion}
> provided
> 
>
>
> 
> org.apache.spark
> spark-streaming_2.10
> ${sparkVersion}
> provided
> 
>
>
> 
> org.apache.spark
> spark-sql_2.10
> ${sparkVersion}
> provided
> 
>
>
> 
> org.apache.spark
> spark-hive_2.10
> ${sparkVersion}
> provided
> 
>
>
>
> 
> org.apache.spark
> spark-streaming-kafka_2.10
> ${sparkVersion}
> 
>
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/streaming/StreamingContext
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2693)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
> at java.lang.Class.getMethod0(Class.java:3010)
> at java.lang.Class.getMethod(Class.java:1776)
> at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:125)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.streaming.StreamingContext
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Job-gives-error-after-changing-to-version-1-5-2-tp25406.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
> _
>
> The information transmitted in this message and its attachments (if
> any) is intended
> only for the person or 

Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Yes, you can submit job remotely.



> On Nov 19, 2015, at 10:10 AM, Vikram Kone  wrote:
> 
> Hi Feng,
> Does airflow allow remote submissions of spark jobs via spark-submit?
> 
> On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu  > wrote:
> Hi,
> 
> we use ‘Airflow'  as our job workflow scheduler.
> 
> 
> 
> 
>> On Nov 19, 2015, at 9:47 AM, Vikram Kone > > wrote:
>> 
>> Hi Nick,
>> Quick question about spark-submit command executed from azkaban with command 
>> job type.
>> I see that when I press kill in azkaban portal on a spark-submit job, it 
>> doesn't actually kill the application on spark master and it continues to 
>> run even though azkaban thinks that it's killed.
>> How do you get around this? Is there a way to kill the spark-submit jobs 
>> from azkaban portal?
>> 
>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath > > wrote:
>> Hi Vikram,
>> 
>> We use Azkaban (2.5.0) in our production workflow scheduling. We just use 
>> local mode deployment and it is fairly easy to set up. It is pretty easy to 
>> use and has a nice scheduling and logging interface, as well as SLAs (like 
>> kill job and notify if it doesn't complete in 3 hours or whatever). 
>> 
>> However Spark support is not present directly - we run everything with shell 
>> scripts and spark-submit. There is a plugin interface where one could create 
>> a Spark plugin, but I found it very cumbersome when I did investigate and 
>> didn't have the time to work through it to develop that.
>> 
>> It has some quirks and while there is actually a REST API for adding jos and 
>> dynamically scheduling jobs, it is not documented anywhere so you kinda have 
>> to figure it out for yourself. But in terms of ease of use I found it way 
>> better than Oozie. I haven't tried Chronos, and it seemed quite involved to 
>> set up. Haven't tried Luigi either.
>> 
>> Spark job server is good but as you say lacks some stuff like scheduling and 
>> DAG type workflows (independent of spark-defined job flows).
>> 
>> 
>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke > > wrote:
>> Check also falcon in combination with oozie
>> 
>> Le ven. 7 août 2015 à 17:51, Hien Luu > > a écrit :
>> Looks like Oozie can satisfy most of your requirements. 
>> 
>> 
>> 
>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone > > wrote:
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to schedule 
>> spark jobs on a datastax cassandra cluster. Since there are tonnes of 
>> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to 
>> check with people here to see what they are using today.
>> 
>> Some of the requirements of the workflow engine that I'm looking for are
>> 
>> 1. First class support for submitting Spark jobs on Cassandra. Not some 
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production scale.
>> 3. Should be dead easy to write job dependencices using XML or web interface 
>> . Ex; job A depends on Job B and Job C, so run Job A after B and C are 
>> finished. Don't need to write full blown java applications to specify job 
>> parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time 
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily 
>> basis.
>> 
>> I have looked at Ooyala's spark job server which seems to be hated towards 
>> making spark jobs run faster by sharing contexts between the jobs but isn't 
>> a full blown workflow engine per se. A combination of spark job server and 
>> workflow engine would be ideal 
>> 
>> Thanks for the inputs
>> 
>> 
>> 
> 
> 



DataFrame.insertIntoJDBC throws AnalysisException -- cannot save

2015-11-18 Thread jonpowell
For this simple example, we are importing 4 lines of 3 columns of a CSV file:

Administrator,FiveHundredAddresses1,92121
Ann,FiveHundredAddresses2,92109
Bobby,FiveHundredAddresses3,92101
Charles,FiveHundredAddresses4,92111

We are running spark-1.5.1-bin-hadoop2.6 with master and one slave, and the
JDBC thrift server and beeline client. They seem to all interconnect and are
able to communicate. From what I can understand, Hive is included in this
release in the datanucleus jars. I have configured directories to hold the
Hive files, but have no conf/hive-config.xml.

The users table has been pre-created in the beeline client using

  CREATE TABLE users(first_name STRING, last_name STRING, zip_code STRING);
  show tables;// it's there

For the scala REPL session on the master:

  val connectionUrl = "jdbc:hive2://x.y.z.t:1/users?user=blah="
  val userCsvFile = sc.textFile("/home/jpowell/Downloads/Users4.csv")
  case class User(first_name:String, last_name:String, work_zip:String)
  val users = userCsvFile.map(_.split(",")).map(l => User(l(0), l(1), l(2)))
  val usersDf = sqlContext.createDataFrame(users)
  usersDf.count()  // 4
  usersDf.schema  // res92: org.apache.spark.sql.types.StructType =
StructType(StructField(first_name,StringType,true),
StructField(last_name,StringType,true),
StructField(work_zip,StringType,true))
  usersDf.insertIntoJDBC(connectionUrl,"users",true)

OR

  usersDf.createJDBCTable(connectionUrl, "users", true)// w/o beeline
creation

*throws* 

warning: there were 1 deprecation warning(s); re-run with -deprecation for
details
*java.sql.SQLException: org.apache.spark.sql.AnalysisException: cannot
recognize input near 'TEXT' ',' 'last_name' in column type; line 1 pos 31
*   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
at 
org.apache.hive.jdbc.HiveStatement.executeUpdate(HiveStatement.java:406)
at
org.apache.hive.jdbc.HivePreparedStatement.executeUpdate(HivePreparedStatement.java:119)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275)
at org.apache.spark.sql.DataFrame.insertIntoJDBC(DataFrame.scala:1629)


Any ideas where I'm going wrong? Can this version actually write JDBC files
from a DataFrame?

Thanks for any help!

Jon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-insertIntoJDBC-throws-AnalysisException-cannot-save-tp25422.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Calculating Timeseries Aggregation

2015-11-18 Thread Tathagata Das
There are different ways to do the rollups. Either update rollups from the
streaming application, or you can generate roll ups in a later process -
say periodic Spark job every hour. Or you could just generate rollups on
demand, when it is queried.
The whole thing depends on your downstream requirements - if you always to
have up to date rollups to show up in dashboard (even day-level stuff),
then the first approach is better. Otherwise, second and third approaches
are more efficient.

TD


On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta 
wrote:

> TD thank you for your reply.
>
> I agree on data store requirement. I am using HBase as an underlying store.
>
> So for every batch interval of say 10 seconds
>
> - Calculate the time dimension ( minutes, hours, day, week, month and
> quarter ) along with other dimensions and metrics
> - Update relevant base table at each batch interval for relevant metrics
> for a given set of dimensions.
>
> Only caveat I see is I’ll have to update each of the different roll up
> table for each batch window.
>
> Is this a valid approach for calculating time series aggregation?
>
> Regards
> SM
>
> For minutes level aggregates I have set up a streaming window say 10
> seconds and storing minutes level aggregates across multiple dimension in
> HBase at every window interval.
>
> On 18-Nov-2015, at 7:45 AM, Tathagata Das  wrote:
>
> For this sort of long term aggregations you should use a dedicated data
> storage systems. Like a database, or a key-value store. Spark Streaming
> would just aggregate and push the necessary data to the data store.
>
> TD
>
> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta 
> wrote:
>
>> Hi,
>>
>> I am working on requirement of calculating real time metrics and building
>> prototype  on Spark streaming. I need to build aggregate at Seconds,
>> Minutes, Hours and Day level.
>>
>> I am not sure whether I should calculate all these aggregates as
>> different Windowed function on input DStream or shall I use
>> updateStateByKey function for the same. If I have to use updateStateByKey
>> for these time series aggregation, how can I remove keys from the state
>> after different time lapsed?
>>
>> Please suggest.
>>
>> Regards
>> SM
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Spark job workflow engine recommendations

2015-11-18 Thread Vikram Kone
Hi Nick,
Quick question about spark-submit command executed from azkaban with
command job type.
I see that when I press kill in azkaban portal on a spark-submit job, it
doesn't actually kill the application on spark master and it continues to
run even though azkaban thinks that it's killed.
How do you get around this? Is there a way to kill the spark-submit jobs
from azkaban portal?

On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath 
wrote:

> Hi Vikram,
>
> We use Azkaban (2.5.0) in our production workflow scheduling. We just use
> local mode deployment and it is fairly easy to set up. It is pretty easy to
> use and has a nice scheduling and logging interface, as well as SLAs (like
> kill job and notify if it doesn't complete in 3 hours or whatever).
>
> However Spark support is not present directly - we run everything with
> shell scripts and spark-submit. There is a plugin interface where one could
> create a Spark plugin, but I found it very cumbersome when I did
> investigate and didn't have the time to work through it to develop that.
>
> It has some quirks and while there is actually a REST API for adding jos
> and dynamically scheduling jobs, it is not documented anywhere so you kinda
> have to figure it out for yourself. But in terms of ease of use I found it
> way better than Oozie. I haven't tried Chronos, and it seemed quite
> involved to set up. Haven't tried Luigi either.
>
> Spark job server is good but as you say lacks some stuff like scheduling
> and DAG type workflows (independent of spark-defined job flows).
>
>
> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke  wrote:
>
>> Check also falcon in combination with oozie
>>
>> Le ven. 7 août 2015 à 17:51, Hien Luu  a
>> écrit :
>>
>>> Looks like Oozie can satisfy most of your requirements.
>>>
>>>
>>>
>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone 
>>> wrote:
>>>
 Hi,
 I'm looking for open source workflow tools/engines that allow us to
 schedule spark jobs on a datastax cassandra cluster. Since there are tonnes
 of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I
 wanted to check with people here to see what they are using today.

 Some of the requirements of the workflow engine that I'm looking for are

 1. First class support for submitting Spark jobs on Cassandra. Not some
 wrapper Java code to submit tasks.
 2. Active open source community support and well tested at production
 scale.
 3. Should be dead easy to write job dependencices using XML or web
 interface . Ex; job A depends on Job B and Job C, so run Job A after B and
 C are finished. Don't need to write full blown java applications to specify
 job parameters and dependencies. Should be very simple to use.
 4. Time based  recurrent scheduling. Run the spark jobs at a given time
 every hour or day or week or month.
 5. Job monitoring, alerting on failures and email notifications on
 daily basis.

 I have looked at Ooyala's spark job server which seems to be hated
 towards making spark jobs run faster by sharing contexts between the jobs
 but isn't a full blown workflow engine per se. A combination of spark job
 server and workflow engine would be ideal

 Thanks for the inputs

>>>
>>>
>


Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread Sathish Kumaran Vairavelu
I think you can use mapPartitions that returns PairRDDs followed by
forEachPartition for saving it

On Wed, Nov 18, 2015 at 9:31 AM swetha kasireddy 
wrote:

> Looks like I can use mapPartitions but can it be done using
> forEachPartition?
>
> On Tue, Nov 17, 2015 at 11:51 PM, swetha 
> wrote:
>
>> Hi,
>>
>> How to return an RDD of key/value pairs from an RDD that has
>> foreachPartition applied. I have my code something like the following. It
>> looks like an RDD that has foreachPartition can have only the return type
>> as
>> Unit. How do I apply foreachPartition and do a save and at the same
>> return a
>> pair RDD.
>>
>>  def saveDataPointsBatchNew(records: RDD[(String, (Long,
>> java.util.LinkedHashMap[java.lang.Long,
>> java.lang.Float],java.util.LinkedHashMap[java.lang.Long, java.lang.Float],
>> java.util.HashSet[java.lang.String] , Boolean))])= {
>> records.foreachPartition({ partitionOfRecords =>
>>   val dataLoader = new DataLoaderImpl();
>>   var metricList = new java.util.ArrayList[String]();
>>   var storageTimeStamp = 0l
>>
>>   if (partitionOfRecords != null) {
>> partitionOfRecords.foreach(record => {
>>
>> if (record._2._1 == 0l) {
>> entrySet = record._2._3.entrySet()
>> itr = entrySet.iterator();
>> while (itr.hasNext()) {
>> val entry = itr.next();
>> storageTimeStamp = entry.getKey.toLong
>> val dayCounts = entry.getValue
>> metricsDayCounts += record._1 ->(storageTimeStamp,
>> dayCounts.toFloat)
>> }
>> }
>>}
>> }
>> )
>>   }
>>
>>   //Code to insert the last successful batch/streaming timestamp  ends
>>   dataLoader.saveDataPoints(metricList);
>>   metricList = null
>>
>> })
>>   }
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-return-a-pair-RDD-from-an-RDD-that-has-foreachPartition-applied-tp25411.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


unsubscribe

2015-11-18 Thread VJ Anand
-- 
*VJ Anand*
*Founder *
*Sankia*
vjan...@sankia.com
925-640-1340
www.sankia.com

*Confidentiality Notice*: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and privileged information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply e-mail and destroy all copies
of the original message


Re: Apache Groovy and Spark

2015-11-18 Thread Steve Loughran

Looks like groovy scripts dont' serialize over the wire properly.

Back in 2011 I hooked up groovy to mapreduce, so that you could do mappers and 
reducers there; "grumpy"

https://github.com/steveloughran/grumpy

slides: http://www.slideshare.net/steve_l/hadoop-gets-groovy

What I ended up doing (see slide 13) was send the raw script around as text and 
compile it in to a Script instance at the far end. Compilation took some time, 
but the real barrier is that groovy is not at all fast.

It used to be 10x slow, maybe now with static compiles and the java7 
invoke-dynamic JARs things are better. I'm still unsure I'd use it in 
production, and, given spark's focus on Scala and Python, I'd pick one of those 
two


On 18 Nov 2015, at 20:35, tog 
> wrote:

Hi

I start playing with both Apache projects and quickly got that exception. 
Anyone being able to give some hint on the problem so that I can dig further.
It seems to be a problem for Spark to load some of the groovy classes ...

Any idea?
Thanks
Guillaume


tog GroovySpark $ $GROOVY_HOME/bin/groovy GroovySparkThroughGroovyShell.groovy

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 
1, localhost): java.lang.ClassNotFoundException: Script1$_run_closure1

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:348)

at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)

at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)

at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)

at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)

at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Driver stacktrace:

at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)

at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)

at 

Re: Additional Master daemon classpath

2015-11-18 Thread Michal Klos
We solved this by adding to spark-class script. At the bottom before the exec 
statement we intercepted the command that was constructed and injected our 
additional class path :

for ((i=0; i<${#CMD[@]}; i++));
do
if [[ ${CMD[$i]} == *"$SPARK_ASSEMBLY_JAR"* ]]
then
   
CMD[$i]="${CMD[$i]}:/usr/lib/hadoop/*.jar:/usr/share/aws/aws-java-sdk/aws-java-sdk-cloudwatch-1.10.4.jar:/usr/share/aws/emr/emrfs/lib/*"
fi
done

exec "${CMD[@]}"

M


> On Nov 18, 2015, at 1:19 AM, "memorypr...@gmail.com"  
> wrote:
> 
> Have you tried using 
> spark.driver.extraClassPath
> and 
> spark.executor.extraClassPath
> 
> ?
> 
> AFAICT these config options replace SPARK_CLASSPATH. Further info in the 
> docs. I've had good luck with these options, and for ease of use I just set 
> them in the spark defaults config.
> 
> https://spark.apache.org/docs/latest/configuration.html
> 
>> On Tue, 17 Nov 2015 at 21:06 Michal Klos  wrote:
>> Hi,
>> 
>> We are running a Spark Standalone cluster on EMR (note: not using YARN) and 
>> are trying to use S3 w/ EmrFS as our event logging directory.
>> 
>> We are having difficulties with a ClassNotFoundException on EmrFileSystem 
>> when we navigate to the event log screen. This is to be expected as the 
>> EmrFs jars are not on the classpath.
>> 
>> But -- I have not been able to figure out a way to add additional classpath 
>> jars to the start-up of the Master daemon. SPARK_CLASSPATH has been 
>> deprecated, and looking around at spark-class, etc.. everything seems to be 
>> pretty locked down. 
>> 
>> Do I have to shove everything into the assembly jar?
>> 
>> Am I missing a simple way to add classpath to the daemons?
>> 
>> thanks,
>> Michal


RE: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Jack Yang
If I tried to change “provided” to “compile”.. then the error changed to :

Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing 
class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/11/19 10:28:29 INFO util.Utils: Shutdown hook called

Meanwhile, I will prefer to use maven to compile the jar file rather than sbt, 
although it is indeed another option.

Best regards,
Jack



From: Fengdong Yu [mailto:fengdo...@everstring.com]
Sent: Wednesday, 18 November 2015 7:30 PM
To: Jack Yang
Cc: Ted Yu; user@spark.apache.org
Subject: Re: spark with breeze error of NoClassDefFoundError

The simplest way is remove all “provided” in your pom.

then ‘sbt assembly” to build your final package. then get rid of ‘—jars’ 
because assembly already includes all dependencies.






On Nov 18, 2015, at 2:15 PM, Jack Yang 
> wrote:

So weird. Is there anything wrong with the way I made the pom file (I labelled 
them as provided)?

Is there missing jar I forget to add in “--jar”?

See the trace below:



Exception in thread "main" java.lang.NoClassDefFoundError: 
breeze/storage/DefaultArrayValue
at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: breeze.storage.DefaultArrayValue
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 10 more
15/11/18 17:15:15 INFO util.Utils: Shutdown hook called


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, 18 November 2015 4:01 PM
To: Jack Yang
Cc: user@spark.apache.org
Subject: Re: spark with breeze error of NoClassDefFoundError

Looking in local maven repo, breeze_2.10-0.7.jar contains DefaultArrayValue :

jar tvf 
/Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar | 
grep !$
jar tvf 
/Users/tyu/.m2/repository//org/scalanlp/breeze_2.10/0.7/breeze_2.10-0.7.jar | 
grep DefaultArrayValue
   369 Wed Mar 19 11:18:32 PDT 2014 
breeze/storage/DefaultArrayValue$mcZ$sp$class.class
   309 Wed Mar 19 11:18:32 PDT 2014 
breeze/storage/DefaultArrayValue$mcJ$sp.class
  2233 Wed Mar 19 11:18:32 PDT 2014 
breeze/storage/DefaultArrayValue$DoubleDefaultArrayValue$.class

Can you show the complete stack trace ?

FYI

On Tue, Nov 17, 2015 at 8:33 PM, Jack Yang 
> wrote:
Hi all,
I am 

how to group timestamp data and filter on it

2015-11-18 Thread Cassa L
Hi,
I have a data stream (JavaDStream) in following format-
timestamp=second1,  map(key1=value1, key2=value2)
timestamp=second2,map(key1=value3, key2=value4)
timestamp=second2, map(key1=value1, key2=value5)


I want to group data by 'timestamp' first and then filter each RDD for
Key1=value1 or key1=value3 etc.

Each of above row represent POJO in RDD like:
public class Data{
long timestamp;
Map map;
}

How do do this in spark? I am trying to figure out if I need to use map or
flatMap etc?

Thanks,
LCassa


How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-18 Thread swetha
Hi,

We have a lot of temp files that gets created due to shuffles caused by
group by. How to clear the files that gets created due to intermediate
operations in group by?


Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-clear-the-temp-files-that-gets-created-by-shuffle-in-Spark-Streaming-tp25425.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Checked out 1.6.0-SNAPSHOT 60 minutes ago

2015-11-18 19:19 GMT-08:00 Jack Yang :

> Which version of spark are you using?
>
>
>
> *From:* Stephen Boesch [mailto:java...@gmail.com]
> *Sent:* Thursday, 19 November 2015 2:12 PM
> *To:* user
> *Subject:* Do windowing functions require hive support?
>
>
>
>
>
> The following works against a hive table from spark sql
>
>
>
> hc.sql("select id,r from (select id, name, rank()  over (order by name) as
> r from tt2) v where v.r >= 1 and v.r <= 12")
>
>
>
> But when using  a standard sql context against a temporary table the
> following occurs:
>
>
>
>
>
> Exception in thread "main" java.lang.RuntimeException: [3.25]
>
>   failure: ``)'' expected but `(' found
>
>
>
> rank() over (order by name) as r
>
> ^
>
>


Re: SequenceFile and object reuse

2015-11-18 Thread Ryan Williams
Hey Jeff, in addition to what Sandy said, there are two more reasons that
this might not be as bad as it seems; I may be incorrect in my
understanding though.

First, the "additional step" you're referring to is not likely to be adding
any overhead; the "extra map" is really just materializing the data once
(as opposed to zero times), which is what you want (assuming your access
pattern couldn't be reformulated in the way Sandy described, i.e. where all
the objects in a partition don't need to be in memory at the same time).

Secondly, even if this was an "extra" map step, it wouldn't add any extra
stages to a given pipeline, being a "narrow" dependency, so it would likely
be low-cost anyway.

Let me know if any of the above seems incorrect, thanks!

On Thu, Nov 19, 2015 at 12:41 AM Sandy Ryza  wrote:

> Hi Jeff,
>
> Many access patterns simply take the result of hadoopFile and use it to
> create some other object, and thus have no need for each input record to
> refer to a different object.  In those cases, the current API is more
> performant than an alternative that would create an object for each record,
> because it avoids the unnecessary overhead of creating Java objects.  As
> you've pointed out, this is at the expense of making the code more verbose
> when caching.
>
> -Sandy
>
> On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi 
> wrote:
>
>> So we tried reading a sequencefile in Spark and realized that all our
>> records have ended up becoming the same.
>> THen one of us found this:
>>
>> Note: Because Hadoop's RecordReader class re-uses the same Writable
>> object for each record, directly caching the returned RDD or directly
>> passing it to an aggregation or shuffle operation will create many
>> references to the same object. If you plan to directly cache, sort, or
>> aggregate Hadoop writable objects, you should first copy them using a map
>> function.
>>
>> Is there anyone that can shed some light on this bizzare behavior and the
>> decisions behind it?
>> And I also would like to know if anyone's able to read a binary file and
>> not to incur the additional map() as suggested by the above? What format
>> did you use?
>>
>> thanks
>> Jeff
>>
>
>


Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Zhiliang Zhu
Dear Ted,I just looked at the link you provided, it is great!
For my understanding, I could also directly use other Breeze part (except spark 
mllib package linalg ) in spark (scala or java ) program after importing Breeze 
package,it is right?
Thanks a lot in advance again!Zhiliang  


 On Thursday, November 19, 2015 1:46 PM, Ted Yu  wrote:
   

 Have you looked athttps://github.com/scalanlp/breeze/wiki
Cheers
On Nov 18, 2015, at 9:34 PM, Zhiliang Zhu  wrote:


Dear Jack,
As is known, Breeze is numerical calculation package wrote by scala , spark 
mllib also use it as underlying package for algebra usage.Here I am also 
preparing to use Breeze for nonlinear equation optimization, however, it seemed 
that I could not find the exact doc or API for Breeze except spark linalg 
package...
Could you help some to provide me the official doc or API website for Breeze 
?Thank you in advance!
Zhiliang 
 


 On Thursday, November 19, 2015 7:32 AM, Jack Yang  wrote:
   

  #yiv6155504207 #yiv6155504207 -- filtered {font-family:SimSun;panose-1:2 1 6 
0 3 1 1 1 1 1;}#yiv6155504207 filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 
1 1 1 1;}#yiv6155504207 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 
2 4;}#yiv6155504207 filtered {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 
4;}#yiv6155504207 filtered {panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv6155504207 
p.yiv6155504207MsoNormal, #yiv6155504207 li.yiv6155504207MsoNormal, 
#yiv6155504207 div.yiv6155504207MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv6155504207 a:link, 
#yiv6155504207 span.yiv6155504207MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv6155504207 a:visited, #yiv6155504207 
span.yiv6155504207MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv6155504207 
p.yiv6155504207MsoAcetate, #yiv6155504207 li.yiv6155504207MsoAcetate, 
#yiv6155504207 div.yiv6155504207MsoAcetate 
{margin:0cm;margin-bottom:.0001pt;font-size:8.0pt;}#yiv6155504207 
span.yiv6155504207apple-converted-space {}#yiv6155504207 
span.yiv6155504207EmailStyle18 {color:#1F497D;}#yiv6155504207 
span.yiv6155504207BalloonTextChar {}#yiv6155504207 .yiv6155504207MsoChpDefault 
{font-size:10.0pt;}#yiv6155504207 filtered {margin:72.0pt 72.0pt 72.0pt 
72.0pt;}#yiv6155504207 div.yiv6155504207WordSection1 {}#yiv6155504207 If I 
tried to change “provided” to “compile”.. then the error changed to :    
Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing 
class     at java.lang.ClassLoader.defineClass1(Native Method)     at 
java.lang.ClassLoader.defineClass(ClassLoader.java:800)     at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)     
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)     at 
java.net.URLClassLoader.access$100(URLClassLoader.java:71)     at 
java.net.URLClassLoader$1.run(URLClassLoader.java:361)     at 
java.net.URLClassLoader$1.run(URLClassLoader.java:355)     at 
java.security.AccessController.doPrivileged(Native Method)     at 
java.net.URLClassLoader.findClass(URLClassLoader.java:354)     at 
java.lang.ClassLoader.loadClass(ClassLoader.java:425)     at 
java.lang.ClassLoader.loadClass(ClassLoader.java:358)     
atsmartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)     at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)   
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)     at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
     at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)     
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)     
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)     at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/11/19 10:28:29 
INFO util.Utils: Shutdown hook called    Meanwhile, I will prefer to use maven 
to compile the jar file rather than sbt, although it is indeed another option.  
  Best regards, Jack          From: Fengdong Yu 
[mailto:fengdo...@everstring.com]
Sent: Wednesday, 18 November 2015 7:30 PM
To: Jack Yang
Cc: Ted Yu; user@spark.apache.org
Subject: Re: spark with breeze error of NoClassDefFoundError    The simplest 
way is remove all “provided” in your pom.    then ‘sbt assembly” to build your 
final package. then get rid of ‘—jars’ because assembly already includes all 
dependencies.                   
On Nov 18, 2015, at 2:15 PM, Jack Yang  wrote:    So weird. Is 
there anything wrong with the way I made the pom file (I labelled them as 
provided)?   Is there missing jar I forget to add in “--jar”?   
See the trace 

Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
But to focus the attention properly: I had already tried out 1.5.2.

2015-11-18 19:46 GMT-08:00 Stephen Boesch :

> Checked out 1.6.0-SNAPSHOT 60 minutes ago
>
> 2015-11-18 19:19 GMT-08:00 Jack Yang :
>
>> Which version of spark are you using?
>>
>>
>>
>> *From:* Stephen Boesch [mailto:java...@gmail.com]
>> *Sent:* Thursday, 19 November 2015 2:12 PM
>> *To:* user
>> *Subject:* Do windowing functions require hive support?
>>
>>
>>
>>
>>
>> The following works against a hive table from spark sql
>>
>>
>>
>> hc.sql("select id,r from (select id, name, rank()  over (order by name)
>> as r from tt2) v where v.r >= 1 and v.r <= 12")
>>
>>
>>
>> But when using  a standard sql context against a temporary table the
>> following occurs:
>>
>>
>>
>>
>>
>> Exception in thread "main" java.lang.RuntimeException: [3.25]
>>
>>   failure: ``)'' expected but `(' found
>>
>>
>>
>> rank() over (order by name) as r
>>
>> ^
>>
>>
>


Re: Do windowing functions require hive support?

2015-11-18 Thread Michael Armbrust
Yes they do.

On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch  wrote:

> But to focus the attention properly: I had already tried out 1.5.2.
>
> 2015-11-18 19:46 GMT-08:00 Stephen Boesch :
>
>> Checked out 1.6.0-SNAPSHOT 60 minutes ago
>>
>> 2015-11-18 19:19 GMT-08:00 Jack Yang :
>>
>>> Which version of spark are you using?
>>>
>>>
>>>
>>> *From:* Stephen Boesch [mailto:java...@gmail.com]
>>> *Sent:* Thursday, 19 November 2015 2:12 PM
>>> *To:* user
>>> *Subject:* Do windowing functions require hive support?
>>>
>>>
>>>
>>>
>>>
>>> The following works against a hive table from spark sql
>>>
>>>
>>>
>>> hc.sql("select id,r from (select id, name, rank()  over (order by name)
>>> as r from tt2) v where v.r >= 1 and v.r <= 12")
>>>
>>>
>>>
>>> But when using  a standard sql context against a temporary table the
>>> following occurs:
>>>
>>>
>>>
>>>
>>>
>>> Exception in thread "main" java.lang.RuntimeException: [3.25]
>>>
>>>   failure: ``)'' expected but `(' found
>>>
>>>
>>>
>>> rank() over (order by name) as r
>>>
>>> ^
>>>
>>>
>>
>


RE: Do windowing functions require hive support?

2015-11-18 Thread Jack Yang
SQLContext only implements a subset of the SQL function, not included the 
window function.
In HiveContext it is fine though.

From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, 19 November 2015 3:01 PM
To: Michael Armbrust
Cc: Jack Yang; user
Subject: Re: Do windowing functions require hive support?

Why is the same query (and actually i tried several variations) working against 
a hivecontext and not against the sql context?

2015-11-18 19:57 GMT-08:00 Michael Armbrust 
>:
Yes they do.

On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch 
> wrote:
But to focus the attention properly: I had already tried out 1.5.2.

2015-11-18 19:46 GMT-08:00 Stephen Boesch 
>:
Checked out 1.6.0-SNAPSHOT 60 minutes ago

2015-11-18 19:19 GMT-08:00 Jack Yang >:
Which version of spark are you using?

From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, 19 November 2015 2:12 PM
To: user
Subject: Do windowing functions require hive support?


The following works against a hive table from spark sql

hc.sql("select id,r from (select id, name, rank()  over (order by name) as r 
from tt2) v where v.r >= 1 and v.r <= 12")

But when using  a standard sql context against a temporary table the following 
occurs:



Exception in thread "main" java.lang.RuntimeException: [3.25]

  failure: ``)'' expected but `(' found



rank() over (order by name) as r

^






Re: Apache Groovy and Spark

2015-11-18 Thread Nick Pentreath
Given there is no existing Groovy integration out there, I'd tend to agree to 
use Scala if possible - the basics of functional-style Groovy is fairly similar 
to Scala.



—
Sent from Mailbox

On Wed, Nov 18, 2015 at 11:52 PM, Steve Loughran 
wrote:

> Looks like groovy scripts dont' serialize over the wire properly.
> Back in 2011 I hooked up groovy to mapreduce, so that you could do mappers 
> and reducers there; "grumpy"
> https://github.com/steveloughran/grumpy
> slides: http://www.slideshare.net/steve_l/hadoop-gets-groovy
> What I ended up doing (see slide 13) was send the raw script around as text 
> and compile it in to a Script instance at the far end. Compilation took some 
> time, but the real barrier is that groovy is not at all fast.
> It used to be 10x slow, maybe now with static compiles and the java7 
> invoke-dynamic JARs things are better. I'm still unsure I'd use it in 
> production, and, given spark's focus on Scala and Python, I'd pick one of 
> those two
> On 18 Nov 2015, at 20:35, tog 
> > wrote:
> Hi
> I start playing with both Apache projects and quickly got that exception. 
> Anyone being able to give some hint on the problem so that I can dig further.
> It seems to be a problem for Spark to load some of the groovy classes ...
> Any idea?
> Thanks
> Guillaume
> tog GroovySpark $ $GROOVY_HOME/bin/groovy GroovySparkThroughGroovyShell.groovy
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1, localhost): java.lang.ClassNotFoundException: Script1$_run_closure1
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at 
> 

Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Zhiliang Zhu
Dear Jack,
As is known, Breeze is numerical calculation package wrote by scala , spark 
mllib also use it as underlying package for algebra usage.Here I am also 
preparing to use Breeze for nonlinear equation optimization, however, it seemed 
that I could not find the exact doc or API for Breeze except spark linalg 
package...
Could you help some to provide me the official doc or API website for Breeze 
?Thank you in advance!
Zhiliang 
 


 On Thursday, November 19, 2015 7:32 AM, Jack Yang  wrote:
   

  If I tried to change 
“provided” to “compile”.. then the error changed to :    Exception in thread 
"main" java.lang.IncompatibleClassChangeError: Implementing class     at 
java.lang.ClassLoader.defineClass1(Native Method)     at 
java.lang.ClassLoader.defineClass(ClassLoader.java:800)     at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)     
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)     at 
java.net.URLClassLoader.access$100(URLClassLoader.java:71)     at 
java.net.URLClassLoader$1.run(URLClassLoader.java:361)     at 
java.net.URLClassLoader$1.run(URLClassLoader.java:355)     at 
java.security.AccessController.doPrivileged(Native Method)     at 
java.net.URLClassLoader.findClass(URLClassLoader.java:354)     at 
java.lang.ClassLoader.loadClass(ClassLoader.java:425)     at 
java.lang.ClassLoader.loadClass(ClassLoader.java:358)     
atsmartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)     at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)   
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)     at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
     at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)     
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)     
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)     at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/11/19 10:28:29 
INFO util.Utils: Shutdown hook called    Meanwhile, I will prefer to use maven 
to compile the jar file rather than sbt, although it is indeed another option.  
  Best regards, Jack          From: Fengdong Yu 
[mailto:fengdo...@everstring.com]
Sent: Wednesday, 18 November 2015 7:30 PM
To: Jack Yang
Cc: Ted Yu; user@spark.apache.org
Subject: Re: spark with breeze error of NoClassDefFoundError    The simplest 
way is remove all “provided” in your pom.    then ‘sbt assembly” to build your 
final package. then get rid of ‘—jars’ because assembly already includes all 
dependencies.                   
On Nov 18, 2015, at 2:15 PM, Jack Yang  wrote:    So weird. Is 
there anything wrong with the way I made the pom file (I labelled them as 
provided)?   Is there missing jar I forget to add in “--jar”?   
See the trace below:       Exception in thread "main" 
java.lang.NoClassDefFoundError: breeze/storage/DefaultArrayValue     at 
smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)     at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)   
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)     at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
     at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)     
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)     
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)     at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: 
java.lang.ClassNotFoundException: breeze.storage.DefaultArrayValue     at 
java.net.URLClassLoader$1.run(URLClassLoader.java:366)     at 
java.net.URLClassLoader$1.run(URLClassLoader.java:355)     at 
java.security.AccessController.doPrivileged(Native Method)     at 
java.net.URLClassLoader.findClass(URLClassLoader.java:354)     at 
java.lang.ClassLoader.loadClass(ClassLoader.java:425)     at 
java.lang.ClassLoader.loadClass(ClassLoader.java:358)     ... 10 more 
15/11/18 17:15:15 INFO util.Utils: Shutdown hook called     From: Ted Yu 
[mailto:yuzhih...@gmail.com] 
Sent: Wednesday, 18 November 2015 4:01 PM
To: Jack Yang
Cc: user@spark.apache.org
Subject: Re: spark with breeze error of NoClassDefFoundError   Looking in local 
maven repo, breeze_2.10-0.7.jar contains DefaultArrayValue :   jar tvf 

Re: SequenceFile and object reuse

2015-11-18 Thread Sandy Ryza
Hi Jeff,

Many access patterns simply take the result of hadoopFile and use it to
create some other object, and thus have no need for each input record to
refer to a different object.  In those cases, the current API is more
performant than an alternative that would create an object for each record,
because it avoids the unnecessary overhead of creating Java objects.  As
you've pointed out, this is at the expense of making the code more verbose
when caching.

-Sandy

On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi 
wrote:

> So we tried reading a sequencefile in Spark and realized that all our
> records have ended up becoming the same.
> THen one of us found this:
>
> Note: Because Hadoop's RecordReader class re-uses the same Writable object
> for each record, directly caching the returned RDD or directly passing it
> to an aggregation or shuffle operation will create many references to the
> same object. If you plan to directly cache, sort, or aggregate Hadoop
> writable objects, you should first copy them using a map function.
>
> Is there anyone that can shed some light on this bizzare behavior and the
> decisions behind it?
> And I also would like to know if anyone's able to read a binary file and
> not to incur the additional map() as suggested by the above? What format
> did you use?
>
> thanks
> Jeff
>


Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Ted Yu
Have you looked at
https://github.com/scalanlp/breeze/wiki

Cheers

> On Nov 18, 2015, at 9:34 PM, Zhiliang Zhu  wrote:
> 
> Dear Jack,
> 
> As is known, Breeze is numerical calculation package wrote by scala , spark 
> mllib also use it as underlying package for algebra usage.
> Here I am also preparing to use Breeze for nonlinear equation optimization, 
> however, it seemed that I could not find the exact doc or API for Breeze 
> except spark linalg package...
> 
> Could you help some to provide me the official doc or API website for Breeze ?
> Thank you in advance!
> 
> Zhiliang 
> 
> 
> 
> 
> On Thursday, November 19, 2015 7:32 AM, Jack Yang  wrote:
> 
> 
> If I tried to change “provided” to “compile”.. then the error changed to :
>  
> Exception in thread "main" java.lang.IncompatibleClassChangeError: 
> Implementing class
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 15/11/19 10:28:29 INFO util.Utils: Shutdown hook called
>  
> Meanwhile, I will prefer to use maven to compile the jar file rather than 
> sbt, although it is indeed another option.
>  
> Best regards,
> Jack
>  
>  
>  
> From: Fengdong Yu [mailto:fengdo...@everstring.com] 
> Sent: Wednesday, 18 November 2015 7:30 PM
> To: Jack Yang
> Cc: Ted Yu; user@spark.apache.org
> Subject: Re: spark with breeze error of NoClassDefFoundError
>  
> The simplest way is remove all “provided” in your pom.
>  
> then ‘sbt assembly” to build your final package. then get rid of ‘—jars’ 
> because assembly already includes all dependencies.
>  
>  
>  
>  
>  
>  
> On Nov 18, 2015, at 2:15 PM, Jack Yang  wrote:
>  
> So weird. Is there anything wrong with the way I made the pom file (I 
> labelled them as provided)?
>  
> Is there missing jar I forget to add in “--jar”?
>  
> See the trace below:
>  
>  
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> breeze/storage/DefaultArrayValue
> at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: breeze.storage.DefaultArrayValue
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 10 more
> 15/11/18 17:15:15 INFO util.Utils: Shutdown hook called
>  
>  
> From: Ted Yu 

RE: Do windowing functions require hive support?

2015-11-18 Thread Jack Yang
Which version of spark are you using?

From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, 19 November 2015 2:12 PM
To: user
Subject: Do windowing functions require hive support?


The following works against a hive table from spark sql

hc.sql("select id,r from (select id, name, rank()  over (order by name) as r 
from tt2) v where v.r >= 1 and v.r <= 12")

But when using  a standard sql context against a temporary table the following 
occurs:



Exception in thread "main" java.lang.RuntimeException: [3.25]

  failure: ``)'' expected but `(' found



rank() over (order by name) as r

^


Re: Do windowing functions require hive support?

2015-11-18 Thread Stephen Boesch
Why is the same query (and actually i tried several variations) working
against a hivecontext and not against the sql context?

2015-11-18 19:57 GMT-08:00 Michael Armbrust :

> Yes they do.
>
> On Wed, Nov 18, 2015 at 7:49 PM, Stephen Boesch  wrote:
>
>> But to focus the attention properly: I had already tried out 1.5.2.
>>
>> 2015-11-18 19:46 GMT-08:00 Stephen Boesch :
>>
>>> Checked out 1.6.0-SNAPSHOT 60 minutes ago
>>>
>>> 2015-11-18 19:19 GMT-08:00 Jack Yang :
>>>
 Which version of spark are you using?



 *From:* Stephen Boesch [mailto:java...@gmail.com]
 *Sent:* Thursday, 19 November 2015 2:12 PM
 *To:* user
 *Subject:* Do windowing functions require hive support?





 The following works against a hive table from spark sql



 hc.sql("select id,r from (select id, name, rank()  over (order by name)
 as r from tt2) v where v.r >= 1 and v.r <= 12")



 But when using  a standard sql context against a temporary table the
 following occurs:





 Exception in thread "main" java.lang.RuntimeException: [3.25]

   failure: ``)'' expected but `(' found



 rank() over (order by name) as r

 ^


>>>
>>
>


RE: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Jack Yang
Back to my question. If  I use “provided”, the jar file will 
expect some libraries are provided by the system.
However, the “ compiled ” is the default setting, which means 
the third-party library will be included inside jar file after compiling.
So when I use “provided”, the error is they cannot find the 
Class, but with “compiled” the error is IncompatibleClassChangeError.

Ok, so can someone tell me which version of breeze and breeze-math are used in 
spark 1.4?

From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com]
Sent: Thursday, 19 November 2015 5:10 PM
To: Ted Yu
Cc: Jack Yang; Fengdong Yu; user@spark.apache.org
Subject: Re: spark with breeze error of NoClassDefFoundError

Dear Ted,
I just looked at the link you provided, it is great!

For my understanding, I could also directly use other Breeze part (except spark 
mllib package linalg ) in spark (scala or java ) program after importing Breeze 
package,
it is right?

Thanks a lot in advance again!
Zhiliang



On Thursday, November 19, 2015 1:46 PM, Ted Yu 
> wrote:

Have you looked at
https://github.com/scalanlp/breeze/wiki

Cheers

On Nov 18, 2015, at 9:34 PM, Zhiliang Zhu 
> wrote:
Dear Jack,

As is known, Breeze is numerical calculation package wrote by scala , spark 
mllib also use it as underlying package for algebra usage.
Here I am also preparing to use Breeze for nonlinear equation optimization, 
however, it seemed that I could not find the exact doc or API for Breeze except 
spark linalg package...

Could you help some to provide me the official doc or API website for Breeze ?
Thank you in advance!

Zhiliang



On Thursday, November 19, 2015 7:32 AM, Jack Yang 
> wrote:

If I tried to change “provided” to “compile”.. then the error changed to :

Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing 
class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/11/19 10:28:29 INFO util.Utils: Shutdown hook called

Meanwhile, I will prefer to use maven to compile the jar file rather than sbt, 
although it is indeed another option.

Best regards,
Jack



From: Fengdong Yu [mailto:fengdo...@everstring.com]
Sent: Wednesday, 18 November 2015 7:30 PM
To: Jack Yang
Cc: Ted Yu; user@spark.apache.org
Subject: Re: spark with breeze error of NoClassDefFoundError

The simplest way is remove all “provided” in your pom.

then ‘sbt assembly” to build your final package. then get rid of ‘—jars’ 
because assembly already includes all dependencies.






On Nov 18, 2015, at 2:15 PM, Jack Yang 
> wrote:

So weird. Is there anything wrong with the way I made the pom file (I labelled 
them as provided)?

Is there missing jar I forget to add in “--jar”?

See the trace below:



Exception in thread "main" java.lang.NoClassDefFoundError: 
breeze/storage/DefaultArrayValue
at smartapp.smart.sparkwithscala.textMingApp.main(textMingApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

Spark Monitoring to get Spark GCs and records processed

2015-11-18 Thread rakesh rakshit
Hi all,

I want to monitor Spark to get the following:

1. All the GC stats for Spark JVMs
2. Records successfully processed in a batch
3. Records failed in a batch
4. Getting historical data for batches,jobs,stages,tasks,etc,

Please let me know how can I get these information in Spark.

Regards,
Rakesh


Re: Apache Groovy and Spark

2015-11-18 Thread tog
Hi Steve

Since you are familiar with groovy it will go a bit deeper in details. My
(simple) groovy scripts are working fine with Apache Spark - a closure
(when dehydrated) will nicely serialize.
My issue comes when I want to use GroovyShell to run my scripts (my
ultimate goal is to integrate with Apache Zeppelin so I would need to use
GroovyShell to run the scripts) and this is where I got the previous
exception.

Sure you may question the use of Groovy while Scala/Python are nicely
supported. For me it is more a way to support the wider Java community ...
and after all scala/groovy/java are all working on the JVM! Beyond that
point, I would be interested to know from the Spark community if there is
any plan to integrate closer with java especially with a Java REPL landing
in Java 9.

Cheers
Guillaume

On 18 November 2015 at 21:51, Steve Loughran  wrote:

>
> Looks like groovy scripts dont' serialize over the wire properly.
>
> Back in 2011 I hooked up groovy to mapreduce, so that you could do mappers
> and reducers there; "grumpy"
>
> https://github.com/steveloughran/grumpy
>
> slides: http://www.slideshare.net/steve_l/hadoop-gets-groovy
>
> What I ended up doing (see slide 13) was send the raw script around as
> text and compile it in to a Script instance at the far end. Compilation
> took some time, but the real barrier is that groovy is not at all fast.
>
> It used to be 10x slow, maybe now with static compiles and the java7
> invoke-dynamic JARs things are better. I'm still unsure I'd use it in
> production, and, given spark's focus on Scala and Python, I'd pick one of
> those two
>
>
> On 18 Nov 2015, at 20:35, tog  wrote:
>
> Hi
>
> I start playing with both Apache projects and quickly got that exception.
> Anyone being able to give some hint on the problem so that I can dig
> further.
> It seems to be a problem for Spark to load some of the groovy classes ...
>
> Any idea?
> Thanks
> Guillaume
>
>
> tog GroovySpark $ $GROOVY_HOME/bin/groovy
> GroovySparkThroughGroovyShell.groovy
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> 0.0 (TID 1, localhost): java.lang.ClassNotFoundException:
> Script1$_run_closure1
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>
> at
> 

Spark twitter streaming in Java

2015-11-18 Thread Soni spark
Dear Friends,

I am struggling with spark twitter streaming. I am not getting any data.
Please correct below code if you found any mistakes.

import org.apache.spark.*;
import org.apache.spark.api.java.
function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.twitter.*;
import twitter4j.GeoLocation;
import twitter4j.Status;
import java.util.Arrays;
import scala.Tuple2;

public class SparkTwitterStreaming {

public static void main(String[] args) {

final String consumerKey = "XXX";
final String consumerSecret = "XX";
final String accessToken = "XX";
final String accessTokenSecret = "XXX";
SparkConf conf = new
SparkConf().setMaster("local[2]").setAppName("SparkTwitterStreaming");
JavaStreamingContext jssc = new JavaStreamingContext(conf, new
Duration(6));
System.setProperty("twitter4j.oauth.consumerKey", consumerKey);
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret);
System.setProperty("twitter4j.oauth.accessToken", accessToken);
System.setProperty("twitter4j.oauth.accessTokenSecret",
accessTokenSecret);
String[] filters = new String[] {"Narendra Modi"};
JavaReceiverInputDStream twitterStream =
TwitterUtils.createStream(jssc,filters);

// Without filter: Output text of all tweets
JavaDStream statuses = twitterStream.map(
new Function() {
public String call(Status status) { return
status.getText(); }
}
);
statuses.print();
statuses.dstream().saveAsTextFiles("/home/apache/tweets", "txt");

  }

}


Spark JDBCRDD query

2015-11-18 Thread sunil m
Hello Spark experts!
I am new to Spark and i have the following query...

What I am trying to do:  Run a spark 1.5.1 job local[*] on a 4 core CPU.
This will ping oracle data base and fetch 5000 records each in jdbcRDD, I
 increase the number of partitions by 1 for every 5000 records i fetch.
I have taken care that all partitions get same count of records.

What i expected to happen ideally : All tasks will start at same time T0
ping oracle database in parallel  store value in JDBCRDD and finish in
parallel at T1.

What I Observed : There was one task for every partition, Tasks on Web-UI
were staggered, some were spawned or scheduled way after first task was
scheduled.

Is there a configuration to change how many tasks can run simultaneously on
a executor  core? Or in other words IS it possible that  one core get more
than one task which can run simultaneously on that core?

Thanks...

Warm regards,
Sunil M.


Re: How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-18 Thread Ted Yu
Have you seen SPARK-5836 ?
Note TD's comment at the end.

Cheers

On Wed, Nov 18, 2015 at 7:28 PM, swetha  wrote:

> Hi,
>
> We have a lot of temp files that gets created due to shuffles caused by
> group by. How to clear the files that gets created due to intermediate
> operations in group by?
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-clear-the-temp-files-that-gets-created-by-shuffle-in-Spark-Streaming-tp25425.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: getting different results from same line of code repeated

2015-11-18 Thread Dean Wampler
Methods like first() and take(n) can't guarantee to return the same result
in a distributed context, because Spark uses an algorithm to grab data from
one or more partitions that involves running a distributed job over the
cluster, with tasks on the nodes where the chosen partitions are located.
You can look at the logic in the Spark code base, RDD.scala (first method
calls the take method) and SparkContext.scala (runJob method, which take
calls).

However, the exceptions definitely look like bugs to me. There must be some
empty partitions.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Nov 18, 2015 at 4:52 PM, Walrus theCat 
wrote:

> Hi,
>
> I'm launching a Spark cluster with the spark-ec2 script and playing around
> in spark-shell. I'm running the same line of code over and over again, and
> getting different results, and sometimes exceptions.  Towards the end,
> after I cache the first RDD, it gives me the correct result multiple times
> in a row before throwing an exception.  How can I get correct behavior out
> of these operations on these RDDs?
>
> scala> val targets =
> data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[116]
> at sortBy at :36
>
> scala> targets.first
> res26: (String, Int) = (\bguns?\b,1253)
>
> scala> val targets = data map {_.REGEX} groupBy{identity} map {
> Function.tupled(_->_.size)} sortBy(_._2,false)
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[125]
> at sortBy at :36
>
> scala> targets.first
> res27: (String, Int) = (nika,7)
>
>
> scala> val targets =
> data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[134]
> at sortBy at :36
>
> scala> targets.first
> res28: (String, Int) = (\bcalientes?\b,6)
>
> scala> targets.sortBy(_._2,false).first
> java.lang.UnsupportedOperationException: empty collection
>
> scala> val targets =
> data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false).cache
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[283]
> at sortBy at :36
>
> scala> targets.first
> res46: (String, Int) = (\bhurting\ yous?\b,8)
>
> scala> val targets =
> data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false).cache
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[292]
> at sortBy at :36
>
> scala> targets.first
> java.lang.UnsupportedOperationException: empty collection
>
> scala> val targets =
> data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[301]
> at sortBy at :36
>
> scala> targets.first
> res48: (String, Int) = (\bguns?\b,1253)
>
> scala> val targets =
> data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[310]
> at sortBy at :36
>
> scala> targets.first
> res49: (String, Int) = (\bguns?\b,1253)
>
> scala> val targets =
> data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[319]
> at sortBy at :36
>
> scala> targets.first
> res50: (String, Int) = (\bguns?\b,1253)
>
> scala> val targets =
> data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
> targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[328]
> at sortBy at :36
>
> scala> targets.first
> java.lang.UnsupportedOperationException: empty collection
>
>
>
>
>


Re: Incorrect results with reduceByKey

2015-11-18 Thread tovbinm
Deep copying the data solved the issue:
data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id,
List(t)) }).reduceByKey(_ ++ _)

(noted here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003)


Thanks Igor Berman, for pointing that out.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Incorrect-results-with-reduceByKey-tp25410p25420.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org