[PySpark 2.1.0] - SparkContext not properly initialized by SparkConf

2017-01-25 Thread Sidney Feiner
Hey, I'm pasting a question I asked on Stack Overflow without getting any 
answers(:()
I hope somebody here knows the answer, thanks in advance!
Link to 
post
I'm migrating from Spark 1.6 to 2.1.0 and I've run into a problem migrating my 
PySpark application.
I'm dynamically setting up my SparkConf object based on configurations in a 
file and when I was on Spark 1.6, the app would run with the correct configs. 
But now, when I open the Spark UI, I can see that NONE of those configs are 
loaded into the SparkContext. Here's my code:
spark_conf = SparkConf().setAll(
filter(lambda x: x[0].startswith('spark.'), conf_dict.items())
)
sc = SparkContext(conf=spark_conf)
I've also added a print before initializing the SparkContext to make sure the 
SparkConf has all the relevant configs:
[print("{0}: {1}".format(key, value)) for (key, value) in spark_conf.getAll()]
And this outputs all the configs I need:
* spark.app.name: MyApp
* spark.akka.threads: 4
* spark.driver.memory: 2G
* spark.streaming.receiver.maxRate: 25
* spark.streaming.backpressure.enabled: true
* spark.executor.logs.rolling.maxRetainedFiles: 7
* spark.executor.memory: 3G
* spark.cores.max: 24
* spark.executor.cores: 4
* spark.streaming.blockInterval: 350ms
* spark.memory.storageFraction: 0.2
* spark.memory.useLegacyMode: false
* spark.memory.fraction: 0.8
* spark.executor.logs.rolling.time.interval: daily
I submit my job with the following:
/usr/local/spark/bin/spark-submit --conf spark.driver.host=i-${HOSTNAME} 
--master spark://i-${HOSTNAME}:7077 /path/to/main/file.py /path/to/config/file
Does anybody know why my SparkContext doesn't get initialized with my SparkConf?
Thanks :)


Sidney Feiner   /  SW Developer
M: +972.528197720  /  Skype: sidney.feiner.startapp

[StartApp]

[Meet Us at] 


Ingesting Large csv File to relational database

2017-01-25 Thread Eric Dain
Hi,

I need to write nightly job that ingest large csv files (~15GB each) and
add/update/delete the changed rows to relational database.

If a row is identical to what in the database, I don't want to re-write the
row to the database. Also, if same item comes from multiple sources (files)
I need to implement a logic to choose if the new source is preferred or the
current one in the database should be kept unchanged.

Obviously, I don't want to query the database for each item to check if the
item has changed or no. I prefer to maintain the state inside Spark.

Is there a preferred and performant way to do that using Apache Spark ?

Best,
Eric


do I need to run spark standalone master with supervisord?

2017-01-25 Thread kant kodali
Do I need to run spark standalone master with a process supervisor such as
supervisord or systemd? Does spark standalone master aborts itself if
zookeeper tells it is not a master anymore?

Thanks!


Re: Dataframe fails to save to MySQL table in spark app, but succeeds in spark shell

2017-01-25 Thread Takeshi Yamamuro
Hi,

I'm not familiar with MyQL though, your MySQL got the same query in the two
patterns?
Have you checked MySQL logs?

// maropu


On Thu, Jan 26, 2017 at 12:42 PM, Xuan Dzung Doan <
doanxuand...@yahoo.com.invalid> wrote:

> Hi,
>
> Spark version 2.1.0
> MySQL community server version 5.7.17
> MySQL Connector Java 5.1.40
>
> I need to save a dataframe to a MySQL table. In spark shell, the following
> statement succeeds:
>
> scala> df.write.mode(SaveMode.Append).format("jdbc").option("url",
> "jdbc:mysql://127.0.0.1:3306/mydb").option("dbtable",
> "person").option("user", "username").option("password", "password").save()
>
> I write an app that basically does the same thing, issuing the same
> statement saving the same dataframe to the same MySQL table. I run it using
> spark-submit, but it fails, reporting some error in the SQL syntax. Here's
> the detailed stack trace:
>
> 17/01/25 16:06:02 INFO DAGScheduler: Job 2 failed: save at
> DataIngestionJob.scala:119, took 0.159574 s
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent
> failure: Lost task 0.0 in stage 2.0 (TID 3, localhost, executor driver):
> java.sql.BatchUpdateException: You have an error in your SQL syntax; check
> the manual that corresponds to your MySQL server version for the right
> syntax to use near '"user","age","state") VALUES ('user3',44,'CT')' at line
> 1
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
> at com.mysql.jdbc.Util.getInstance(Util.java:408)
> at com.mysql.jdbc.SQLError.createBatchUpdateException(
> SQLError.java:1162)
> at com.mysql.jdbc.PreparedStatement.executeBatchSerially(
> PreparedStatement.java:1773)
> at com.mysql.jdbc.PreparedStatement.executeBatchInternal(
> PreparedStatement.java:1257)
> at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.
> java:958)
> at org.apache.spark.sql.execution.datasources.jdbc.
> JdbcUtils$.savePartition(JdbcUtils.scala:597)
> at org.apache.spark.sql.execution.datasources.jdbc.
> JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
> at org.apache.spark.sql.execution.datasources.jdbc.
> JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$
> anonfun$apply$29.apply(RDD.scala:925)
> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$
> anonfun$apply$29.apply(RDD.scala:925)
> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(
> SparkContext.scala:1944)
> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(
> SparkContext.scala:1944)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:282)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You
> have an error in your SQL syntax; check the manual that corresponds to your
> MySQL server version for the right syntax to use near
> '"user","age","state") VALUES ('user3',44,'CT')' at line 1
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
> at com.mysql.jdbc.Util.getInstance(Util.java:408)
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:943)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3970)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3906)
> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2524)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2677)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
> at com.mysql.jdbc.PreparedStatement.executeInternal(
> PreparedStatement.java:1861)
> at com.mysql.jdbc.PreparedStatement.executeUpdateInternal(
> PreparedStatement.java:2073)
> at 

Re: Issue returning Map from UDAF

2017-01-25 Thread Takeshi Yamamuro
Hi,

Quickly looking around the attached, I found you wrongly passed the dataType
of your aggregator output in line70.
So, you need to at lease return `MapType` instead of `StructType`.
The stacktrace you showed explicitly say this type unmatch.

// maropu


On Thu, Jan 26, 2017 at 12:07 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi,
>
> I have a dataset with tuple of ID and Timestamp. I want to do a group by
> on ID and then create a map with frequency per hour for the ID.
>
> Input:
> 1| 20160106061005
> 1| 20160106061515
> 1| 20160106064010
> 1| 20160106050402
> 1| 20160106040101
> 2| 20160106040101
> 3| 20160106051451
>
> Expected Output:
> 1|{2016010604 <(201)%20601-0604>:1, 2016010605 <(201)%20601-0605>:1,
> 2016010606 <(201)%20601-0606>:3}
> 2|{2016010604 <(201)%20601-0604>:1}
> 3|{2016010605 <(201)%20601-0605>:1}
>
> As I could not find a function in org.apache.spark.sql.functions library
> that can do this aggregation I wrote a UDAF but when I execute it, it
> throws below exception.
>
> I am using Dataset API from Spark 2.0 and am using Java library. Also
> attached is the code with the test data.
>
> scala.MatchError: {2016010606 <(201)%20601-0606>=1} (of class
> scala.collection.convert.Wrappers$MapWrapper)
> at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.
> toCatalystImpl(CatalystTypeConverters.scala:256)
> at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.
> toCatalystImpl(CatalystTypeConverters.scala:251)
> at org.apache.spark.sql.catalyst.CatalystTypeConverters$
> CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
> at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$
> createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
> at org.apache.spark.sql.execution.aggregate.ScalaUDAF.eval(udaf.scala:440)
> at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$
> generateResultProjection$1.apply(AggregationIterator.scala:228)
> at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$
> generateResultProjection$1.apply(AggregationIterator.scala:220)
> at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.
> next(SortBasedAggregationIterator.scala:152)
> at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.
> next(SortBasedAggregationIterator.scala:29)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:247)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:784)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:784)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/01/25 18:20:58 INFO Executor: Executor is trying to kill task 29.0 in
> stage 7.0 (TID 398)
> 17/01/25 18:20:58 INFO DAGScheduler: ResultStage 7 (show at
> EdgeAggregator.java:29) failed in 0.699 s
> 17/01/25 18:20:58 INFO DAGScheduler: Job 3 failed: show at
> EdgeAggregator.java:29, took 0.712912 s
> In merge hr: 2016010606 <(201)%20601-0606>
> 17/01/25 18:20:58 WARN TaskSetManager: Lost task 29.0 in stage 7.0 (TID
> 398, localhost): scala.MatchError: {2016010606=1} (of class
> scala.collection.convert.Wrappers$MapWrapper)
> at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.
> toCatalystImpl(CatalystTypeConverters.scala:256)
> at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.
> toCatalystImpl(CatalystTypeConverters.scala:251)
> at org.apache.spark.sql.catalyst.CatalystTypeConverters$
> CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
> at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$
> createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
> at org.apache.spark.sql.execution.aggregate.ScalaUDAF.eval(udaf.scala:440)
> at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$
> generateResultProjection$1.apply(AggregationIterator.scala:228)
> at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$
> generateResultProjection$1.apply(AggregationIterator.scala:220)
> at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.
> next(SortBasedAggregationIterator.scala:152)
> at 

Dataframe fails to save to MySQL table in spark app, but succeeds in spark shell

2017-01-25 Thread Xuan Dzung Doan
Hi,

Spark version 2.1.0
MySQL community server version 5.7.17
MySQL Connector Java 5.1.40

I need to save a dataframe to a MySQL table. In spark shell, the following 
statement succeeds:

scala> df.write.mode(SaveMode.Append).format("jdbc").option("url", 
"jdbc:mysql://127.0.0.1:3306/mydb").option("dbtable", "person").option("user", 
"username").option("password", "password").save()

I write an app that basically does the same thing, issuing the same statement 
saving the same dataframe to the same MySQL table. I run it using spark-submit, 
but it fails, reporting some error in the SQL syntax. Here's the detailed stack 
trace:

17/01/25 16:06:02 INFO DAGScheduler: Job 2 failed: save at 
DataIngestionJob.scala:119, took 0.159574 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 2.0 (TID 3, localhost, executor driver): 
java.sql.BatchUpdateException: You have an error in your SQL syntax; check the 
manual that corresponds to your MySQL server version for the right syntax to 
use near '"user","age","state") VALUES ('user3',44,'CT')' at line 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.Util.getInstance(Util.java:408)
at 
com.mysql.jdbc.SQLError.createBatchUpdateException(SQLError.java:1162)
at 
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1773)
at 
com.mysql.jdbc.PreparedStatement.executeBatchInternal(PreparedStatement.java:1257)
at com.mysql.jdbc.StatementImpl.executeBatch(StatementImpl.java:958)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have 
an error in your SQL syntax; check the manual that corresponds to your MySQL 
server version for the right syntax to use near '"user","age","state") VALUES 
('user3',44,'CT')' at line 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.Util.getInstance(Util.java:408)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:943)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3970)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3906)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2524)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2677)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
at 
com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
at 
com.mysql.jdbc.PreparedStatement.executeUpdateInternal(PreparedStatement.java:2073)
at 
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1751)
... 15 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at 

Re: Catalyst Expression(s) - Cleanup

2017-01-25 Thread Takeshi Yamamuro
Hi,

if you mean clean-up in executors, how about using
TaskContext#addTaskCompletionListener?

// maropu

On Thu, Jan 26, 2017 at 3:13 AM, Bowden, Chris  wrote:

> Is there currently any way to receive a signal when an Expression will no
> longer receive any rows so internal resources can be cleaned up?
>
> I have seen Generators are allowed to terminate() but my Expression(s) do
> not need to emit 0..N rows.
>



-- 
---
Takeshi Yamamuro


Re: How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-01-25 Thread Raghavendra Pandey
When you start a slave you pass address of master as a parameter. That
slave will contact master and register itself.

On Jan 25, 2017 4:12 AM, "kant kodali"  wrote:

> Hi,
>
> How do I dynamically add nodes to spark standalone cluster and be able to
> discover them? Does Zookeeper do service discovery? What is the standard
> tool for these things?
>
> Thanks,
> kant
>


Issue returning Map from UDAF

2017-01-25 Thread Ankur Srivastava
Hi,

I have a dataset with tuple of ID and Timestamp. I want to do a group by on
ID and then create a map with frequency per hour for the ID.

Input:
1| 20160106061005
1| 20160106061515
1| 20160106064010
1| 20160106050402
1| 20160106040101
2| 20160106040101
3| 20160106051451

Expected Output:
1|{2016010604:1, 2016010605:1, 2016010606:3}
2|{2016010604:1}
3|{2016010605:1}

As I could not find a function in org.apache.spark.sql.functions library
that can do this aggregation I wrote a UDAF but when I execute it, it
throws below exception.

I am using Dataset API from Spark 2.0 and am using Java library. Also
attached is the code with the test data.

scala.MatchError: {2016010606=1} (of class
scala.collection.convert.Wrappers$MapWrapper)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at org.apache.spark.sql.execution.aggregate.ScalaUDAF.eval(udaf.scala:440)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:228)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:220)
at
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:152)
at
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/01/25 18:20:58 INFO Executor: Executor is trying to kill task 29.0 in
stage 7.0 (TID 398)
17/01/25 18:20:58 INFO DAGScheduler: ResultStage 7 (show at
EdgeAggregator.java:29) failed in 0.699 s
17/01/25 18:20:58 INFO DAGScheduler: Job 3 failed: show at
EdgeAggregator.java:29, took 0.712912 s
In merge hr: 2016010606
17/01/25 18:20:58 WARN TaskSetManager: Lost task 29.0 in stage 7.0 (TID
398, localhost): scala.MatchError: {2016010606=1} (of class
scala.collection.convert.Wrappers$MapWrapper)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at org.apache.spark.sql.execution.aggregate.ScalaUDAF.eval(udaf.scala:440)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:228)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:220)
at
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:152)
at
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at 

Re: freeing up memory occupied by processed Stream Blocks

2017-01-25 Thread Takeshi Yamamuro
AFAIK spark has no public APIs to clean up those RDDs.

On Wed, Jan 25, 2017 at 11:30 PM, Andrew Milkowski 
wrote:

> Hi Takeshi thanks for the answer, looks like spark would free up old RDD's
> however using admin UI we see ie
>
>  Block ID, it corresponds with each receiver and a timestamp.
> For example, block input-0-1485275695898 is from receiver 0 and it was
> created at 1485275695898 (1/24/2017, 11:34:55 AM GMT-5:00).
> That corresponds with the start time.
>
> that block even after running whole day is still not being released! RDD's
> in our scenario are Strings coming from kinesis stream
>
> is there a way to explicitly purge RDD after last step in M/R process once
> and for all ?
>
> thanks much!
>
> On Fri, Jan 20, 2017 at 2:35 AM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> AFAIK, the blocks of minibatch RDDs are checked every job finished, and
>> older blocks automatically removed (See: https://github.com/apach
>> e/spark/blob/master/streaming/src/main/scala/org/apache/
>> spark/streaming/dstream/DStream.scala#L463).
>>
>> You can control this behaviour by StreamingContext#remember to some
>> extent.
>>
>> // maropu
>>
>>
>> On Fri, Jan 20, 2017 at 3:17 AM, Andrew Milkowski 
>> wrote:
>>
>>> hello
>>>
>>> using spark 2.0.2  and while running sample streaming app with kinesis
>>> noticed (in admin ui Storage tab)  "Stream Blocks" for each worker keeps
>>> climbing up
>>>
>>> then also (on same ui page) in Blocks section I see blocks such as below
>>>
>>> input-0-1484753367056
>>>
>>> that are marked as Memory Serialized
>>>
>>> that do not seem to be "released"
>>>
>>> above eventually consumes executor memories leading to out of memory
>>> exception on some
>>>
>>> is there a way to "release" these blocks free them up , app is sample
>>> m/r
>>>
>>> I attempted rdd.unpersist(false) in the code but that did not lead to
>>> memory free up
>>>
>>> thanks much in advance!
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
---
Takeshi Yamamuro


Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
Does the storage handler provide bulk load capability ?

Cheers

> On Jan 25, 2017, at 3:39 AM, Amrit Jangid  wrote:
> 
> Hi chetan,
> 
> If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE with 
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'.
> 
> Try this if you problem can be solved 
> 
> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
> 
> Regards
> Amrit
> 
> .
> 
>> On Wed, Jan 25, 2017 at 5:02 PM, Chetan Khatri  
>> wrote:
>> Hello Spark Community Folks,
>> 
>> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load 
>> from Hbase to Hive.
>> 
>> I have seen couple of good example at HBase Github Repo: 
>> https://github.com/apache/hbase/tree/master/hbase-spark
>> 
>> If I would like to use HBaseContext with HBase 1.2.4, how it can be done ? 
>> Or which version of HBase has more stability with HBaseContext ?
>> 
>> Thanks.
> 
> 
>  


Question about DStreamCheckpointData

2017-01-25 Thread Nikhil Goyal
Hi,

I am using DStreamCheckpointData and it seems that spark checkpoints data
even if the rdd processing fails. It seems to checkpoint at the moment it
creates the rdd rather than waiting till its completion. Anybody knows how
to make it wait till completion?

Thanks
Nikhil


SparkML 1.6 GBTRegressor crashes with high maxIter hyper-parameter?

2017-01-25 Thread Aris
When I train a GBTRegressor model from a DataFrame in the latest
1.6.4-Snapshot, with a high number for the hyper-parameter maxIter, say
500, we have java.lang.StackOverflowError; GBTRegressor does work with
maxIter set about 100.

Does this make sense? Are there any known solutions? This is running in
within executors with 18G of RAM each; I'm not clear how to debug this
short of somehow getting more JVM Heap perhaps.

Thanks!


Re: where is mapWithState executed?

2017-01-25 Thread shyla deshpande
After more reading, I know the state is distributed across the cluster. But If
I need to lookup a map in the updatefunction, I need to broadcast it.

Just want to make sure I am on the right path.

Appreciate your help. Thanks

On Wed, Jan 25, 2017 at 2:33 PM, shyla deshpande 
wrote:

> Is it executed on the driver or executor.  If I need to lookup a map in
> the updatefunction, I need to broadcast it,  if mapWithState executed runs
> on executor.
>
> Thanks
>


where is mapWithState executed?

2017-01-25 Thread shyla deshpande
Is it executed on the driver or executor.  If I need to lookup a map in the
updatefunction, I need to broadcast it,  if mapWithState executed runs on
executor.

Thanks


can we plz open up encoder on dataset

2017-01-25 Thread Koert Kuipers
i often run into problems like this:

i need to write a Dataset[T] => Dataset[T], and inside i need to switch to
DataFrame for a particular operation.

but if i do:
dataset.toDF.map(...).as[T] i get error:
Unable to find encoder for type stored in a Dataset.

i know it has an encoder, because i started with Dataset[T]

so i would like to do:
dataset.toDF.map(...).as[T](dataset.encoder)


Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Processing of the same data more than once can happen only when the app
recovers after failure or during upgrade. So how do I apply your 2nd
solution only for 1-2 hrs after restart.

On Wed, Jan 25, 2017 at 12:51 PM, shyla deshpande 
wrote:

> Thanks Burak. I do want accuracy, that is why I want to make it
> idempotent.
> I will try out your 2nd solution.
>
> On Wed, Jan 25, 2017 at 12:27 PM, Burak Yavuz  wrote:
>
>> Yes you may. Depends on if you want exact values or if you're okay with
>> approximations. With Big Data, generally you would be okay with
>> approximations. Try both out, see what scales/works with your dataset.
>> Maybe you may handle the second implementation.
>>
>> On Wed, Jan 25, 2017 at 12:23 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Thanks Burak. But with BloomFilter, won't I be getting a false poisitve?
>>>
>>> On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz  wrote:
>>>
 I noticed that 1 wouldn't be a problem, because you'll save the
 BloomFilter in the state.

 For 2, you would keep a Map of UUID's to the timestamp of when you saw
 them. If the UUID exists in the map, then you wouldn't increase the count.
 If the timestamp of a UUID expires, you would remove it from the map. The
 reason we remove from the map is to keep a bounded amount of space. It'll
 probably take a lot more space than the BloomFilter though depending on
 your data volume.

 On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> In the previous email you gave me 2 solutions
> 1. Bloom filter --> problem in repopulating the bloom filter on
> restarts
> 2. keeping the state of the unique ids
>
> Please elaborate on 2.
>
>
>
> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz 
> wrote:
>
>> I don't have any sample code, but on a high level:
>>
>> My state would be: (Long, BloomFilter[UUID])
>> In the update function, my value will be the UUID of the record,
>> since the word itself is the key.
>> I'll ask my BloomFilter if I've seen this UUID before. If not
>> increase count, also add to Filter.
>>
>> Does that make sense?
>>
>>
>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Hi Burak,
>>> Thanks for the response. Can you please elaborate on your idea of
>>> storing the state of the unique ids.
>>> Do you have any sample code or links I can refer to.
>>> Thanks
>>>
>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz 
>>> wrote:
>>>
 Off the top of my head... (Each may have it's own issues)

 If upstream you add a uniqueId to all your records, then you may
 use a BloomFilter to approximate if you've seen a row before.
 The problem I can see with that approach is how to repopulate the
 bloom filter on restarts.

 If you are certain that you're not going to reprocess some data
 after a certain time, i.e. there is no way I'm going to get the same 
 data
 in 2 hours, it may only happen in the last 2 hours, then you may also 
 keep
 the state of uniqueId's as well, and then age them out after a certain 
 time.


 Best,
 Burak

 On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> Please share your thoughts.
>
> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>>
>>
>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> My streaming application stores lot of aggregations using
>>> mapWithState.
>>>
>>> I want to know what are all the possible ways I can make it
>>> idempotent.
>>>
>>> Please share your views.
>>>
>>> Thanks
>>>
>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 In a Wordcount application which  stores the count of all the
 words input so far using mapWithState.  How do I make sure my 
 counts are
 not messed up if I happen to read a line more than once?

 Appreciate your response.

 Thanks

>>>
>>>
>>
>

>>>
>>
>

>>>
>>
>


Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Thanks Burak. I do want accuracy, that is why I want to make it idempotent.
I will try out your 2nd solution.

On Wed, Jan 25, 2017 at 12:27 PM, Burak Yavuz  wrote:

> Yes you may. Depends on if you want exact values or if you're okay with
> approximations. With Big Data, generally you would be okay with
> approximations. Try both out, see what scales/works with your dataset.
> Maybe you may handle the second implementation.
>
> On Wed, Jan 25, 2017 at 12:23 PM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> Thanks Burak. But with BloomFilter, won't I be getting a false poisitve?
>>
>> On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz  wrote:
>>
>>> I noticed that 1 wouldn't be a problem, because you'll save the
>>> BloomFilter in the state.
>>>
>>> For 2, you would keep a Map of UUID's to the timestamp of when you saw
>>> them. If the UUID exists in the map, then you wouldn't increase the count.
>>> If the timestamp of a UUID expires, you would remove it from the map. The
>>> reason we remove from the map is to keep a bounded amount of space. It'll
>>> probably take a lot more space than the BloomFilter though depending on
>>> your data volume.
>>>
>>> On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 In the previous email you gave me 2 solutions
 1. Bloom filter --> problem in repopulating the bloom filter on
 restarts
 2. keeping the state of the unique ids

 Please elaborate on 2.



 On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz  wrote:

> I don't have any sample code, but on a high level:
>
> My state would be: (Long, BloomFilter[UUID])
> In the update function, my value will be the UUID of the record, since
> the word itself is the key.
> I'll ask my BloomFilter if I've seen this UUID before. If not increase
> count, also add to Filter.
>
> Does that make sense?
>
>
> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> Hi Burak,
>> Thanks for the response. Can you please elaborate on your idea of
>> storing the state of the unique ids.
>> Do you have any sample code or links I can refer to.
>> Thanks
>>
>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz 
>> wrote:
>>
>>> Off the top of my head... (Each may have it's own issues)
>>>
>>> If upstream you add a uniqueId to all your records, then you may use
>>> a BloomFilter to approximate if you've seen a row before.
>>> The problem I can see with that approach is how to repopulate the
>>> bloom filter on restarts.
>>>
>>> If you are certain that you're not going to reprocess some data
>>> after a certain time, i.e. there is no way I'm going to get the same 
>>> data
>>> in 2 hours, it may only happen in the last 2 hours, then you may also 
>>> keep
>>> the state of uniqueId's as well, and then age them out after a certain 
>>> time.
>>>
>>>
>>> Best,
>>> Burak
>>>
>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 Please share your thoughts.

 On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

>
>
> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> My streaming application stores lot of aggregations using
>> mapWithState.
>>
>> I want to know what are all the possible ways I can make it
>> idempotent.
>>
>> Please share your views.
>>
>> Thanks
>>
>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> In a Wordcount application which  stores the count of all the
>>> words input so far using mapWithState.  How do I make sure my 
>>> counts are
>>> not messed up if I happen to read a line more than once?
>>>
>>> Appreciate your response.
>>>
>>> Thanks
>>>
>>
>>
>

>>>
>>
>

>>>
>>
>


Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
Yes you may. Depends on if you want exact values or if you're okay with
approximations. With Big Data, generally you would be okay with
approximations. Try both out, see what scales/works with your dataset.
Maybe you may handle the second implementation.

On Wed, Jan 25, 2017 at 12:23 PM, shyla deshpande 
wrote:

> Thanks Burak. But with BloomFilter, won't I be getting a false poisitve?
>
> On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz  wrote:
>
>> I noticed that 1 wouldn't be a problem, because you'll save the
>> BloomFilter in the state.
>>
>> For 2, you would keep a Map of UUID's to the timestamp of when you saw
>> them. If the UUID exists in the map, then you wouldn't increase the count.
>> If the timestamp of a UUID expires, you would remove it from the map. The
>> reason we remove from the map is to keep a bounded amount of space. It'll
>> probably take a lot more space than the BloomFilter though depending on
>> your data volume.
>>
>> On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> In the previous email you gave me 2 solutions
>>> 1. Bloom filter --> problem in repopulating the bloom filter on restarts
>>> 2. keeping the state of the unique ids
>>>
>>> Please elaborate on 2.
>>>
>>>
>>>
>>> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz  wrote:
>>>
 I don't have any sample code, but on a high level:

 My state would be: (Long, BloomFilter[UUID])
 In the update function, my value will be the UUID of the record, since
 the word itself is the key.
 I'll ask my BloomFilter if I've seen this UUID before. If not increase
 count, also add to Filter.

 Does that make sense?


 On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> Hi Burak,
> Thanks for the response. Can you please elaborate on your idea of
> storing the state of the unique ids.
> Do you have any sample code or links I can refer to.
> Thanks
>
> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz  wrote:
>
>> Off the top of my head... (Each may have it's own issues)
>>
>> If upstream you add a uniqueId to all your records, then you may use
>> a BloomFilter to approximate if you've seen a row before.
>> The problem I can see with that approach is how to repopulate the
>> bloom filter on restarts.
>>
>> If you are certain that you're not going to reprocess some data after
>> a certain time, i.e. there is no way I'm going to get the same data in 2
>> hours, it may only happen in the last 2 hours, then you may also keep the
>> state of uniqueId's as well, and then age them out after a certain time.
>>
>>
>> Best,
>> Burak
>>
>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Please share your thoughts.
>>>
>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>


 On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> My streaming application stores lot of aggregations using
> mapWithState.
>
> I want to know what are all the possible ways I can make it
> idempotent.
>
> Please share your views.
>
> Thanks
>
> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> In a Wordcount application which  stores the count of all the
>> words input so far using mapWithState.  How do I make sure my counts 
>> are
>> not messed up if I happen to read a line more than once?
>>
>> Appreciate your response.
>>
>> Thanks
>>
>
>

>>>
>>
>

>>>
>>
>


Drop Partition Fails

2017-01-25 Thread Subacini Balakrishnan
Hi,

When we execute drop partition command on hive external table  from
spark-shell we are getting below error.Same command works fine from hive
shell.

It is a table with just two records

Spark Version : 1.5.2

scala> hiveCtx.sql("select * from
spark_2_test").collect().foreach(println);
[1210,xcv,2016-10-10]
[1210,xcv,2016-10-11]

*Show create table *

CREATE EXTERNAL TABLE `spark_2_test`(
`name` string,
`dept` string)
PARTITIONED BY (
`server_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs:///spark/sp'
TBLPROPERTIES (
'STATS_GENERATED_VIA_STATS_TASK'='true',
'transient_lastDdlTime'='1485202737')


scala> hiveCtx.sql("ALTER TABLE spark_2_test DROP IF EXISTS PARTITION
(server_date ='2016-10-10')")



Thanks in advance,
Subacini

17/01/23 22:09:04 ERROR Driver: FAILED: SemanticException [Error 10006]:
Partition not found (server_date = 2016-10-10)
org.apache.hadoop.hive.ql.parse.SemanticException: Partition not found
(server_date = 2016-10-10)
at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.addTableDropPartsOutputs(DDLSemanticAnalyzer.java:3178)
at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableDropParts(DDLSemanticAnalyzer.java:2694)
at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:278)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1122)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1170)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:451)
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:440)
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:278)
at
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:233)
at
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:270)
at
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:440)
at
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:430)
at
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:561)
at
org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
at
$line105.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at
$line105.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
at $line105.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $line105.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $line105.$read$$iwC$$iwC$$iwC$$iwC.(:35)
at $line105.$read$$iwC$$iwC$$iwC.(:37)
at $line105.$read$$iwC$$iwC.(:39)
at $line105.$read$$iwC.(:41)
at $line105.$read.(:43)
at $line105.$read$.(:47)
at $line105.$read$.()
at $line105.$eval$.(:7)
at $line105.$eval$.()
at $line105.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Thanks Burak. But with BloomFilter, won't I be getting a false poisitve?

On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz  wrote:

> I noticed that 1 wouldn't be a problem, because you'll save the
> BloomFilter in the state.
>
> For 2, you would keep a Map of UUID's to the timestamp of when you saw
> them. If the UUID exists in the map, then you wouldn't increase the count.
> If the timestamp of a UUID expires, you would remove it from the map. The
> reason we remove from the map is to keep a bounded amount of space. It'll
> probably take a lot more space than the BloomFilter though depending on
> your data volume.
>
> On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> In the previous email you gave me 2 solutions
>> 1. Bloom filter --> problem in repopulating the bloom filter on restarts
>> 2. keeping the state of the unique ids
>>
>> Please elaborate on 2.
>>
>>
>>
>> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz  wrote:
>>
>>> I don't have any sample code, but on a high level:
>>>
>>> My state would be: (Long, BloomFilter[UUID])
>>> In the update function, my value will be the UUID of the record, since
>>> the word itself is the key.
>>> I'll ask my BloomFilter if I've seen this UUID before. If not increase
>>> count, also add to Filter.
>>>
>>> Does that make sense?
>>>
>>>
>>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 Hi Burak,
 Thanks for the response. Can you please elaborate on your idea of
 storing the state of the unique ids.
 Do you have any sample code or links I can refer to.
 Thanks

 On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz  wrote:

> Off the top of my head... (Each may have it's own issues)
>
> If upstream you add a uniqueId to all your records, then you may use a
> BloomFilter to approximate if you've seen a row before.
> The problem I can see with that approach is how to repopulate the
> bloom filter on restarts.
>
> If you are certain that you're not going to reprocess some data after
> a certain time, i.e. there is no way I'm going to get the same data in 2
> hours, it may only happen in the last 2 hours, then you may also keep the
> state of uniqueId's as well, and then age them out after a certain time.
>
>
> Best,
> Burak
>
> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> Please share your thoughts.
>>
>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>>
>>>
>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 My streaming application stores lot of aggregations using
 mapWithState.

 I want to know what are all the possible ways I can make it
 idempotent.

 Please share your views.

 Thanks

 On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> In a Wordcount application which  stores the count of all the
> words input so far using mapWithState.  How do I make sure my counts 
> are
> not messed up if I happen to read a line more than once?
>
> Appreciate your response.
>
> Thanks
>


>>>
>>
>

>>>
>>
>


[ML - Beginner - How To] - GaussianMixtureModel and GaussianMixtureModel$

2017-01-25 Thread Saulo Ricci
Hi,

I'm studying the Java implementation code of the ml library, and I'd like
to know why there is 2 implementations of GaussianMixtureModel - #1
GaussianMixtureModel and #2 GaussianMixtureModel$.

I appreciate the answers.

Thank you,
Saulo


Re: spark intermediate data fills up the disk

2017-01-25 Thread kant kodali
oh sorry its actually in the documentation. I should just
set spark.worker.cleanup.enabled = true

On Wed, Jan 25, 2017 at 11:30 AM, kant kodali  wrote:

> I have bunch of .index and .data files like that fills up my disk. I am
> not sure what the fix is? I am running spark 2.0.2 in stand alone mode
>
> Thanks!
>
>
>
>


Decompressing Spark Eventlogs with snappy (*.snappy)

2017-01-25 Thread satishl
Our spark job eventlogs are stored in compressed .snappy format. 
What do I need to do to decompress these files programmatically?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Decompressing-Spark-Eventlogs-with-snappy-snappy-tp28340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark intermediate data fills up the disk

2017-01-25 Thread kant kodali
I have bunch of .index and .data files like that fills up my disk. I am not
sure what the fix is? I am running spark 2.0.2 in stand alone mode

Thanks!


Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
I noticed that 1 wouldn't be a problem, because you'll save the BloomFilter
in the state.

For 2, you would keep a Map of UUID's to the timestamp of when you saw
them. If the UUID exists in the map, then you wouldn't increase the count.
If the timestamp of a UUID expires, you would remove it from the map. The
reason we remove from the map is to keep a bounded amount of space. It'll
probably take a lot more space than the BloomFilter though depending on
your data volume.

On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande 
wrote:

> In the previous email you gave me 2 solutions
> 1. Bloom filter --> problem in repopulating the bloom filter on restarts
> 2. keeping the state of the unique ids
>
> Please elaborate on 2.
>
>
>
> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz  wrote:
>
>> I don't have any sample code, but on a high level:
>>
>> My state would be: (Long, BloomFilter[UUID])
>> In the update function, my value will be the UUID of the record, since
>> the word itself is the key.
>> I'll ask my BloomFilter if I've seen this UUID before. If not increase
>> count, also add to Filter.
>>
>> Does that make sense?
>>
>>
>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Hi Burak,
>>> Thanks for the response. Can you please elaborate on your idea of
>>> storing the state of the unique ids.
>>> Do you have any sample code or links I can refer to.
>>> Thanks
>>>
>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz  wrote:
>>>
 Off the top of my head... (Each may have it's own issues)

 If upstream you add a uniqueId to all your records, then you may use a
 BloomFilter to approximate if you've seen a row before.
 The problem I can see with that approach is how to repopulate the bloom
 filter on restarts.

 If you are certain that you're not going to reprocess some data after a
 certain time, i.e. there is no way I'm going to get the same data in 2
 hours, it may only happen in the last 2 hours, then you may also keep the
 state of uniqueId's as well, and then age them out after a certain time.


 Best,
 Burak

 On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> Please share your thoughts.
>
> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>>
>>
>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> My streaming application stores lot of aggregations using
>>> mapWithState.
>>>
>>> I want to know what are all the possible ways I can make it
>>> idempotent.
>>>
>>> Please share your views.
>>>
>>> Thanks
>>>
>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 In a Wordcount application which  stores the count of all the words
 input so far using mapWithState.  How do I make sure my counts are not
 messed up if I happen to read a line more than once?

 Appreciate your response.

 Thanks

>>>
>>>
>>
>

>>>
>>
>


Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
In the previous email you gave me 2 solutions
1. Bloom filter --> problem in repopulating the bloom filter on restarts
2. keeping the state of the unique ids

Please elaborate on 2.



On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz  wrote:

> I don't have any sample code, but on a high level:
>
> My state would be: (Long, BloomFilter[UUID])
> In the update function, my value will be the UUID of the record, since the
> word itself is the key.
> I'll ask my BloomFilter if I've seen this UUID before. If not increase
> count, also add to Filter.
>
> Does that make sense?
>
>
> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande  > wrote:
>
>> Hi Burak,
>> Thanks for the response. Can you please elaborate on your idea of storing
>> the state of the unique ids.
>> Do you have any sample code or links I can refer to.
>> Thanks
>>
>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz  wrote:
>>
>>> Off the top of my head... (Each may have it's own issues)
>>>
>>> If upstream you add a uniqueId to all your records, then you may use a
>>> BloomFilter to approximate if you've seen a row before.
>>> The problem I can see with that approach is how to repopulate the bloom
>>> filter on restarts.
>>>
>>> If you are certain that you're not going to reprocess some data after a
>>> certain time, i.e. there is no way I'm going to get the same data in 2
>>> hours, it may only happen in the last 2 hours, then you may also keep the
>>> state of uniqueId's as well, and then age them out after a certain time.
>>>
>>>
>>> Best,
>>> Burak
>>>
>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 Please share your thoughts.

 On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

>
>
> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> My streaming application stores lot of aggregations using
>> mapWithState.
>>
>> I want to know what are all the possible ways I can make it
>> idempotent.
>>
>> Please share your views.
>>
>> Thanks
>>
>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> In a Wordcount application which  stores the count of all the words
>>> input so far using mapWithState.  How do I make sure my counts are not
>>> messed up if I happen to read a line more than once?
>>>
>>> Appreciate your response.
>>>
>>> Thanks
>>>
>>
>>
>

>>>
>>
>


Spark Summit East in Boston ‒ 20% off Code

2017-01-25 Thread Scott walent
*There’s less than two weeks to go until Spark Summit East 2017, happening
February 7-9 at the Hynes Convention Center in downtown Boston. It will be
the largest Spark Summit conference ever held on the East Coast, and we
hope to see you there. Sign up at https://spark-summit.org/east-2017
 and use promo code "SPARK17" to save
20% on a two-day pass.The program will explore the future of Apache Spark
and the latest developments in data science, artificial intelligence,
machine learning and more with 110+ community talks in seven different
tracks. There will also be a dozen Keynote Presentations featuring leading
experts from the Broad Institute, Databricks, Forrester, IBM, Intel, UC
Berkeley’s new RISE Lab and other organizations.You’re also invited to
participated in the various networking activities associated with the
conference like the pre-conference Meetup with the Boston Apache Users
Group and the Women in Big Data luncheon.View the full schedule and
register to attend at https://spark-summit.org/east-2017
. We look forward to seeing you there.*


Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
I don't have any sample code, but on a high level:

My state would be: (Long, BloomFilter[UUID])
In the update function, my value will be the UUID of the record, since the
word itself is the key.
I'll ask my BloomFilter if I've seen this UUID before. If not increase
count, also add to Filter.

Does that make sense?


On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande 
wrote:

> Hi Burak,
> Thanks for the response. Can you please elaborate on your idea of storing
> the state of the unique ids.
> Do you have any sample code or links I can refer to.
> Thanks
>
> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz  wrote:
>
>> Off the top of my head... (Each may have it's own issues)
>>
>> If upstream you add a uniqueId to all your records, then you may use a
>> BloomFilter to approximate if you've seen a row before.
>> The problem I can see with that approach is how to repopulate the bloom
>> filter on restarts.
>>
>> If you are certain that you're not going to reprocess some data after a
>> certain time, i.e. there is no way I'm going to get the same data in 2
>> hours, it may only happen in the last 2 hours, then you may also keep the
>> state of uniqueId's as well, and then age them out after a certain time.
>>
>>
>> Best,
>> Burak
>>
>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Please share your thoughts.
>>>
>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>


 On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> My streaming application stores lot of aggregations using
> mapWithState.
>
> I want to know what are all the possible ways I can make it
> idempotent.
>
> Please share your views.
>
> Thanks
>
> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> In a Wordcount application which  stores the count of all the words
>> input so far using mapWithState.  How do I make sure my counts are not
>> messed up if I happen to read a line more than once?
>>
>> Appreciate your response.
>>
>> Thanks
>>
>
>

>>>
>>
>


Re: printSchema showing incorrect datatype?

2017-01-25 Thread Koert Kuipers
should we change "def schema" to show the materialized schema?

On Wed, Jan 25, 2017 at 1:04 PM, Michael Armbrust 
wrote:

> Encoders are just an object based view on a Dataset.  Until you actually
> materialize and object, they are not used and thus will not change the
> schema of the dataframe.
>
> On Tue, Jan 24, 2017 at 8:28 AM, Koert Kuipers  wrote:
>
>> scala> val x = Seq("a", "b").toDF("x")
>> x: org.apache.spark.sql.DataFrame = [x: string]
>>
>> scala> x.as[Array[Byte]].printSchema
>> root
>>  |-- x: string (nullable = true)
>>
>> scala> x.as[Array[Byte]].map(x => x).printSchema
>> root
>>  |-- value: binary (nullable = true)
>>
>> why does the first schema show string instead of binary?
>>
>
>


Catalyst Expression(s) - Cleanup

2017-01-25 Thread Bowden, Chris
Is there currently any way to receive a signal when an Expression will no 
longer receive any rows so internal resources can be cleaned up?

I have seen Generators are allowed to terminate() but my Expression(s) do not 
need to emit 0..N rows.


Re: printSchema showing incorrect datatype?

2017-01-25 Thread Michael Armbrust
Encoders are just an object based view on a Dataset.  Until you actually
materialize and object, they are not used and thus will not change the
schema of the dataframe.

On Tue, Jan 24, 2017 at 8:28 AM, Koert Kuipers  wrote:

> scala> val x = Seq("a", "b").toDF("x")
> x: org.apache.spark.sql.DataFrame = [x: string]
>
> scala> x.as[Array[Byte]].printSchema
> root
>  |-- x: string (nullable = true)
>
> scala> x.as[Array[Byte]].map(x => x).printSchema
> root
>  |-- value: binary (nullable = true)
>
> why does the first schema show string instead of binary?
>


Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Hi Burak,
Thanks for the response. Can you please elaborate on your idea of storing
the state of the unique ids.
Do you have any sample code or links I can refer to.
Thanks

On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz  wrote:

> Off the top of my head... (Each may have it's own issues)
>
> If upstream you add a uniqueId to all your records, then you may use a
> BloomFilter to approximate if you've seen a row before.
> The problem I can see with that approach is how to repopulate the bloom
> filter on restarts.
>
> If you are certain that you're not going to reprocess some data after a
> certain time, i.e. there is no way I'm going to get the same data in 2
> hours, it may only happen in the last 2 hours, then you may also keep the
> state of uniqueId's as well, and then age them out after a certain time.
>
>
> Best,
> Burak
>
> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande  > wrote:
>
>> Please share your thoughts.
>>
>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>>
>>>
>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 My streaming application stores lot of aggregations using mapWithState.

 I want to know what are all the possible ways I can make it idempotent.

 Please share your views.

 Thanks

 On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
 deshpandesh...@gmail.com> wrote:

> In a Wordcount application which  stores the count of all the words
> input so far using mapWithState.  How do I make sure my counts are not
> messed up if I happen to read a line more than once?
>
> Appreciate your response.
>
> Thanks
>


>>>
>>
>


Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
Off the top of my head... (Each may have it's own issues)

If upstream you add a uniqueId to all your records, then you may use a
BloomFilter to approximate if you've seen a row before.
The problem I can see with that approach is how to repopulate the bloom
filter on restarts.

If you are certain that you're not going to reprocess some data after a
certain time, i.e. there is no way I'm going to get the same data in 2
hours, it may only happen in the last 2 hours, then you may also keep the
state of uniqueId's as well, and then age them out after a certain time.


Best,
Burak

On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande 
wrote:

> Please share your thoughts.
>
> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande  > wrote:
>
>>
>>
>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> My streaming application stores lot of aggregations using mapWithState.
>>>
>>> I want to know what are all the possible ways I can make it idempotent.
>>>
>>> Please share your views.
>>>
>>> Thanks
>>>
>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 In a Wordcount application which  stores the count of all the words
 input so far using mapWithState.  How do I make sure my counts are not
 messed up if I happen to read a line more than once?

 Appreciate your response.

 Thanks

>>>
>>>
>>
>


Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
The references are vendor specific.

Suggest contacting vendor's mailing list for your PR.

My initial interpretation of HBase repository is that of Apache.

Cheers

On Wed, Jan 25, 2017 at 7:38 AM, Chetan Khatri 
wrote:

> @Ted Yu, Correct but HBase-Spark module available at HBase repository
> seems too old and written code is not optimized yet, I have been already
> submitted PR for the same. I dont know if it is clearly mentioned that now
> it is part of HBase itself then people are committing to older repo where
> original code is still old. [1]
>
> Other sources has updated info [2]
>
> Ref.
> [1] http://blog.cloudera.com/blog/2015/08/apache-spark-
> comes-to-apache-hbase-with-hbase-spark-module/
> [2] https://github.com/cloudera-labs/SparkOnHBase ,
> https://github.com/esamson/SparkOnHBase
>
> On Wed, Jan 25, 2017 at 8:13 PM, Ted Yu  wrote:
>
>> Though no hbase release has the hbase-spark module, you can find the
>> backport patch on HBASE-14160 (for Spark 1.6)
>>
>> You can build the hbase-spark module yourself.
>>
>> Cheers
>>
>> On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Spark Community Folks,
>>>
>>> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
>>> Load from Hbase to Hive.
>>>
>>> I have seen couple of good example at HBase Github Repo:
>>> https://github.com/apache/hbase/tree/master/hbase-spark
>>>
>>> If I would like to use HBaseContext with HBase 1.2.4, how it can be done
>>> ? Or which version of HBase has more stability with HBaseContext ?
>>>
>>> Thanks.
>>>
>>
>>
>


Re: Failure handling

2017-01-25 Thread Erwan ALLAIN
I agree

We are try catching streamingcontext.awaittermination and
When exception occurs we stop the streaming context and system.exit
(50)(equal to SparkUnhandledCode)

Sounds ok.

On Tuesday, January 24, 2017, Cody Koeninger  wrote:

> Can you identify the error case and call System.exit ?  It'll get
> retried on another executor, but as long as that one fails the same
> way...
>
> If you can identify the error case at the time you're doing database
> interaction and just prevent data being written then, that's what I
> typically do.
>
> On Tue, Jan 24, 2017 at 7:50 AM, Erwan ALLAIN  > wrote:
> > Hello guys,
> >
> > I have a question regarding how spark handle failure.
> >
> > I’m using kafka direct stream
> > Spark 2.0.2
> > Kafka 0.10.0.1
> >
> > Here is a snippet of code
> >
> > val stream = createDirectStream(….)
> >
> > stream
> >  .map(…)
> > .forEachRDD( doSomething)
> >
> > stream
> > .map(…)
> > .forEachRDD( doSomethingElse)
> >
> > The execution is in FIFO, so the first action ends after the second
> starts
> > so far so good.
> > However, I would like that when an error (fatal or not) occurs during the
> > execution of the first action, the streaming context is stopped
> immediately.
> > It's like the driver is not notified of the exception and launch the
> second
> > action.
> >
> > In our case, the second action is performing checkpointing in an external
> > database and we do not want to checkpoint if an error occurs before.
> > We do not want to rely on spark checkpoint as it causes issue when
> upgrading
> > application.
> >
> > Let me know if it’s not clear !
> >
> > Thanks !
> >
> > Erwan
>


Re: HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
@Ted Yu, Correct but HBase-Spark module available at HBase repository seems
too old and written code is not optimized yet, I have been already
submitted PR for the same. I dont know if it is clearly mentioned that now
it is part of HBase itself then people are committing to older repo where
original code is still old. [1]

Other sources has updated info [2]

Ref.
[1]
http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
[2] https://github.com/cloudera-labs/SparkOnHBase ,
https://github.com/esamson/SparkOnHBase

On Wed, Jan 25, 2017 at 8:13 PM, Ted Yu  wrote:

> Though no hbase release has the hbase-spark module, you can find the
> backport patch on HBASE-14160 (for Spark 1.6)
>
> You can build the hbase-spark module yourself.
>
> Cheers
>
> On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Spark Community Folks,
>>
>> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
>> Load from Hbase to Hive.
>>
>> I have seen couple of good example at HBase Github Repo:
>> https://github.com/apache/hbase/tree/master/hbase-spark
>>
>> If I would like to use HBaseContext with HBase 1.2.4, how it can be done
>> ? Or which version of HBase has more stability with HBaseContext ?
>>
>> Thanks.
>>
>
>


Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
Though no hbase release has the hbase-spark module, you can find the
backport patch on HBASE-14160 (for Spark 1.6)

You can build the hbase-spark module yourself.

Cheers

On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri 
wrote:

> Hello Spark Community Folks,
>
> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
> Load from Hbase to Hive.
>
> I have seen couple of good example at HBase Github Repo:
> https://github.com/apache/hbase/tree/master/hbase-spark
>
> If I would like to use HBaseContext with HBase 1.2.4, how it can be done ?
> Or which version of HBase has more stability with HBaseContext ?
>
> Thanks.
>


Re: freeing up memory occupied by processed Stream Blocks

2017-01-25 Thread Andrew Milkowski
Hi Takeshi thanks for the answer, looks like spark would free up old RDD's
however using admin UI we see ie

 Block ID, it corresponds with each receiver and a timestamp.
For example, block input-0-1485275695898 is from receiver 0 and it was
created at 1485275695898 (1/24/2017, 11:34:55 AM GMT-5:00).
That corresponds with the start time.

that block even after running whole day is still not being released! RDD's
in our scenario are Strings coming from kinesis stream

is there a way to explicitly purge RDD after last step in M/R process once
and for all ?

thanks much!

On Fri, Jan 20, 2017 at 2:35 AM, Takeshi Yamamuro 
wrote:

> Hi,
>
> AFAIK, the blocks of minibatch RDDs are checked every job finished, and
> older blocks automatically removed (See: https://github.com/
> apache/spark/blob/master/streaming/src/main/scala/org/
> apache/spark/streaming/dstream/DStream.scala#L463).
>
> You can control this behaviour by StreamingContext#remember to some extent.
>
> // maropu
>
>
> On Fri, Jan 20, 2017 at 3:17 AM, Andrew Milkowski 
> wrote:
>
>> hello
>>
>> using spark 2.0.2  and while running sample streaming app with kinesis
>> noticed (in admin ui Storage tab)  "Stream Blocks" for each worker keeps
>> climbing up
>>
>> then also (on same ui page) in Blocks section I see blocks such as below
>>
>> input-0-1484753367056
>>
>> that are marked as Memory Serialized
>>
>> that do not seem to be "released"
>>
>> above eventually consumes executor memories leading to out of memory
>> exception on some
>>
>> is there a way to "release" these blocks free them up , app is sample m/r
>>
>> I attempted rdd.unpersist(false) in the code but that did not lead to
>> memory free up
>>
>> thanks much in advance!
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Java heap error during matrix multiplication

2017-01-25 Thread Petr Shestov
Hi all!

I'm using Spark 2.0.1 with two workers (one executor each) with 20Gb each.
And run following code:

JavaRDD entries = ...; // filing the dataCoordinateMatrix
cmatrix = new CoordinateMatrix(entries.rdd());BlockMatrix matrix =
cmatrix.toBlockMatrix(100, 1000);BlockMatrix cooc =
matrix.transpose().multiply(matrix);

My matrix is approx 8 000 000 x 3000, but only 10 000 000 cells have
meaningful value. During multiplication I always get:

17/01/24 08:03:10 WARN TaskMemoryManager: leak 1322.6 MB memory from
org.apache.spark.util.collection.ExternalAppendOnlyMap@649e701917/01/24
08:03:10 ERROR Executor: Exception in task 1.0 in stage 57.0 (TID
83664)
java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.mllib.linalg.DenseMatrix$.zeros(Matrices.scala:453)
at 
org.apache.spark.mllib.linalg.Matrix$class.multiply(Matrices.scala:101)
at 
org.apache.spark.mllib.linalg.SparseMatrix.multiply(Matrices.scala:565)
at 
org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23$$anonfun$apply$9$$anonfun$apply$11.apply(BlockMatrix.scala:483)
at 
org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23$$anonfun$apply$9$$anonfun$apply$11.apply(BlockMatrix.scala:480)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23$$anonfun$apply$9.apply(BlockMatrix.scala:480)
at 
org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23$$anonfun$apply$9.apply(BlockMatrix.scala:479)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.util.collection.CompactBuffer$$anon$1.foreach(CompactBuffer.scala:115)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.util.collection.CompactBuffer.foreach(CompactBuffer.scala:30)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at 
org.apache.spark.util.collection.CompactBuffer.flatMap(CompactBuffer.scala:30)
at 
org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23.apply(BlockMatrix.scala:479)
at 
org.apache.spark.mllib.linalg.distributed.BlockMatrix$$anonfun$23.apply(BlockMatrix.scala:478)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Now I'm even trying to use only one core per executor. What can be the
problem? And how can I debug it and find root cause? What could I miss in
spark configuration?

I've already tried increasing spark.default.parallelism and decreasing
blocks size for BlockMatrix.

Thanks.


Re: HBaseContext with Spark

2017-01-25 Thread Amrit Jangid
Hi chetan,

If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE with

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'.


Try this if you problem can be solved


https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration


Regards

Amrit


.

On Wed, Jan 25, 2017 at 5:02 PM, Chetan Khatri 
wrote:

> Hello Spark Community Folks,
>
> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
> Load from Hbase to Hive.
>
> I have seen couple of good example at HBase Github Repo:
> https://github.com/apache/hbase/tree/master/hbase-spark
>
> If I would like to use HBaseContext with HBase 1.2.4, how it can be done ?
> Or which version of HBase has more stability with HBaseContext ?
>
> Thanks.
>


HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
Hello Spark Community Folks,

Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load
from Hbase to Hive.

I have seen couple of good example at HBase Github Repo: https://github.com/
apache/hbase/tree/master/hbase-spark

If I would like to use HBaseContext with HBase 1.2.4, how it can be done ?
Or which version of HBase has more stability with HBaseContext ?

Thanks.


Re: Scala Developers

2017-01-25 Thread Sean Owen
Yes, job postings are strongly discouraged on ASF lists, if not outright
disallowed. You will see, sometimes posts prefixed with [JOBS] that are
tolerated, but here I would assume they are not. This particular project
and list is so big that there is no job posting I can imagine that is
relevant to even a small fraction of readers. This wasn't even much about
Spark, so, definitely a no-no.

On Wed, Jan 25, 2017 at 11:26 AM Hyukjin Kwon  wrote:

> Just as a subscriber in this mailing list, I don't want to recieve job
> recruiting emails and even make some efforts to set a filter for it.
>
> I don't know the policy in details but I feel inappropriate to send them
> where, in my experience, Spark users usually ask some questions and discuss
> about Spark itself.
>
> Please let me know if it is legitimate. I will stop complaining and try to
> set a filter at my side.
>
>
> On 25 Jan 2017 5:25 p.m., "marcos rebelo"  wrote:
>
> Hy all,
>
> I’m looking for Scala Developers willing to work on Berlin. We are working
> with Spark, AWS (the latest product are being prototyped StepFunctions,
> Batch Service, and old services Lambda Function, DynamoDB, ...) and
> building Data Products (JSON REST endpoints).
>
> We are responsible to chose the perfect article to the client, collect all
> the data on the site, …  We are responsible to build Recommenders, Site
> Search, … The technology is always changing, so you can learn on the job.
>
> Do you see yourself working as a Data Engineer? Contact me, I promise a
> great team.
>
> Best Regards
>
> Marcos Rebelo
>
>


Re: Scala Developers

2017-01-25 Thread Hyukjin Kwon
Just as a subscriber in this mailing list, I don't want to recieve job
recruiting emails and even make some efforts to set a filter for it.

I don't know the policy in details but I feel inappropriate to send them
where, in my experience, Spark users usually ask some questions and discuss
about Spark itself.

Please let me know if it is legitimate. I will stop complaining and try to
set a filter at my side.


On 25 Jan 2017 5:25 p.m., "marcos rebelo"  wrote:

> Hy all,
>
> I’m looking for Scala Developers willing to work on Berlin. We are working
> with Spark, AWS (the latest product are being prototyped StepFunctions,
> Batch Service, and old services Lambda Function, DynamoDB, ...) and
> building Data Products (JSON REST endpoints).
>
> We are responsible to chose the perfect article to the client, collect all
> the data on the site, …  We are responsible to build Recommenders, Site
> Search, … The technology is always changing, so you can learn on the job.
>
> Do you see yourself working as a Data Engineer? Contact me, I promise a
> great team.
>
> Best Regards
>
> Marcos Rebelo
>


is it possible to read .mdb file in spark

2017-01-25 Thread Selvam Raman
-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Scala Developers

2017-01-25 Thread marcos rebelo
Hy all,

I’m looking for Scala Developers willing to work on Berlin. We are working
with Spark, AWS (the latest product are being prototyped StepFunctions,
Batch Service, and old services Lambda Function, DynamoDB, ...) and
building Data Products (JSON REST endpoints).

We are responsible to chose the perfect article to the client, collect all
the data on the site, …  We are responsible to build Recommenders, Site
Search, … The technology is always changing, so you can learn on the job.

Do you see yourself working as a Data Engineer? Contact me, I promise a
great team.

Best Regards

Marcos Rebelo