Re: Submit job with driver options in Mesos Cluster mode

2016-10-27 Thread Rodrick Brown
Try setting the values in $SPARK_HOME/conf/spark-defaults.conf 

i.e. 

$ egrep 'spark.(driver|executor).extra' 
/data/orchard/spark-2.0.1/conf/spark-defaults.conf
spark.executor.extraJavaOptions -Duser.timezone=UTC 
-Xloggc:garbage-collector.log
spark.driver.extraJavaOptions   -Duser.timezone=UTC 
-Xloggc:garbage-collector.log

-- 
 
Rodrick Brown / DevOPs Engineer 
+1 917 445 6839 / rodr...@orchardplatform.com 

Orchard Platform 
101 5th Avenue, 4th Floor, New York, NY 10003 
http://www.orchardplatform.com 
Orchard Blog  | Marketplace Lending 
Meetup 
> On Oct 6, 2016, at 12:20 PM, vonnagy  wrote:
> 
> I am trying to submit a job to spark running in a Mesos cluster. We need to
> pass custom java options to the driver and executor for configuration, but
> the driver task never includes the options. Here is an example submit. 
> 
> GC_OPTS="-XX:+UseConcMarkSweepGC 
> -verbose:gc -XX:+PrintGCTimeStamps -Xloggc:$appdir/gc.out 
> -XX:MaxPermSize=512m 
> -XX:+CMSClassUnloadingEnabled " 
> 
> EXEC_PARAMS="-Dloglevel=DEBUG -Dkafka.broker-address=${KAFKA_ADDRESS}
> -Dredis.master=${REDIS_MASTER} -Dredis.port=${REDIS_PORT} 
> 
> spark-submit \ 
>  --name client-events-intake \ 
>  --class ClientEventsApp \ 
>  --deploy-mode cluster \ 
>  --driver-java-options "${EXEC_PARAMS} ${GC_OPTS}" \ 
>  --conf "spark.ui.killEnabled=true" \ 
>  --conf "spark.mesos.coarse=true" \ 
>  --conf "spark.driver.extraJavaOptions=${EXEC_PARAMS}" \ 
>  --conf "spark.executor.extraJavaOptions=${EXEC_PARAMS}" \ 
>  --master mesos://someip:7077 \ 
>  --verbose \ 
>  some.jar 
> 
> When the driver task runs in Mesos it is creating the following command: 
> 
> sh -c 'cd spark-1*;  bin/spark-submit --name client-events-intake --class
> ClientEventsApp --master mesos://someip:5050 --driver-cores 1.0
> --driver-memory 512M ../some.jar ' 
> 
> There are no options for the driver here, thus the driver app blows up
> because it can't find the java options. However, the environment variables
> contain the executor options: 
> 
> SPARK_EXECUTOR_OPTS -> -Dspark.executor.extraJavaOptions=-Dloglevel=DEBUG
> ... 
> 
> Any help would be great. I know that we can set some "spark.*" settings in
> default configs, but these are not necessarily spark related. This is not an
> issue when running the same logic outside of a Mesos cluster in Spark
> standalone mode. 
> 
> Thanks! 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Submit-job-with-driver-options-in-Mesos-Cluster-mode-tp27853.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.


[ANNOUNCE] Apache Bahir 2.0.1

2016-10-27 Thread Luciano Resende
The Apache Bahir PMC is pleased to announce the release of Apache Bahir
2.0.1  which is our first major release and provides the following
extensions for Apache Spark 2.0.1 :

Akka Streaming
MQTT Streaming and Structured Streaming
Twitter Streaming
ZeroMQ Streaming

For more information about Apache Bahir and to download the release:

http://bahir.apache.org

-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Need help Creating a rule using the Streaming API

2016-10-27 Thread patrickhuang
Hi,
Maybe you can try like this?
val transformed= events.map(event => ((event.user, event.ip),
1).reduceByKey(_ +_)
val alarm= transformed.filter(transformed._2 >= 10)

Patrick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-Creating-a-rule-using-the-Streaming-API-tp27954p27974.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading AVRO from S3 - No parallelism

2016-10-27 Thread prithish
 
 
The Avro files were 500-600kb in size and that folder contained around 1200 
files. The total folder size was around 600mb. Will try repartition. Thank you.
 
   
 
 
 

 
 
 
 

 
 
>  
> On Oct 28, 2016 at 2:24 AM,   (mailto:mich...@databricks.com)>  wrote:
>  
>  
>  
> How big are your avro files?We collapse many small files into a single 
> partition to eliminate scheduler overhead.If you need explicit 
> parallelism you can also repartition.
>  
>
>  
> On Thu, Oct 27, 2016 at 5:19 AM, Prithish   (mailto:prith...@gmail.com)>  wrote:
>  
> >  
> >  
> > I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. 
> > No matter how many executors I use or what configuration changes I make, 
> > the cluster doesn't seem to use all the executors. I am using the 
> > com.databricks.spark.avro library from databricks to read the AVRO.  
> >  
> >
> >  
> > However, if I try the same on CSV files (same S3 folder, same configuration 
> > and cluster), it does use all executors.  
> >  
> >
> >  
> > Is there something that I need to do to enable parallelism when using the 
> > AVRO databricks library?
> >  
> >
> >  
> > Thanks for your help.  
> >  
> >
> >  
> >
> >
>  

Re: Submit job with driver options in Mesos Cluster mode

2016-10-27 Thread vonnagy
We were using 1.6, but now we are on 2.0.1. Both versions show the same
issue.

I dove deep into the Spark code and have identified that the extra java
options are /not/ added to the process on the executors. At this point, I
believe you have to use spark-defaults.conf to set any values that will be
used. The problem for us, is that these extra Java options are not the same
for each job that is submitted and thus can't put the values in
spark-defaults.conf.

Ivan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submit-job-with-driver-options-in-Mesos-Cluster-mode-tp27853p27973.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Submit job with driver options in Mesos Cluster mode

2016-10-27 Thread csakoda
I'm seeing something very similar in my own Mesos/Spark Cluster.

High level summary: When I use `--deploy-mode cluster`, java properties that
I pass to my driver via `spark.driver.extraJavaOptions` are not available to
the driver.  I've confirmed this by inspecting the output of
`System.getProperties` and the environment variables from within the driver.  

I also see SPARK_EXECUTOR_OPTS showing the values that I wish were available
as java system properties.

Did you find anything that helped you either understand or resolve this
issue?  I'm still stuck.

What version of spark + mesos are you using?  1.6.1 and
1.0.1-2.0.93.ubuntu1404 respectively over here.

The only helpful bit I've found is that if I set the properties I care about
in spark-defaults.conf on all the spark workers, they appear as desired in
the drivers java properties.  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submit-job-with-driver-options-in-Mesos-Cluster-mode-tp27853p27972.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.0 with Hadoop 3.0?

2016-10-27 Thread adam kramer
Is the version of Spark built for Hadoop 2.7 and later only for 2.x releases?

Is there any reason why Hadoop 3.0 is a non-starter for use with Spark
2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which
would resolve our driver dependency issues.

Thanks,
Adam

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark UI error spark 2.0.1 hadoop 2.6

2016-10-27 Thread gpatcham
I'm able to fix.. added servlet 3.0 to classpath



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-error-spark-2-0-1-hadoop-2-6-tp27970p27971.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming and Kinesis

2016-10-27 Thread Benjamin Kim
Has anyone worked with AWS Kinesis and retrieved data from it using Spark 
Streaming? I am having issues where it’s returning no data. I can connect to 
the Kinesis stream and describe using Spark. Is there something I’m missing? 
Are there specific IAM security settings needed? I just simply followed the 
Word Count ASL example. When it didn’t work, I even tried to run the code 
independently in Spark shell in yarn-client mode by hardcoding the arguments. 
Still, there was no data even with the setting InitialPositionInStream.LATEST 
changed to InitialPositionInStream.TRIM_HORIZON.

If anyone can help, I would truly appreciate it.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: importing org.apache.spark.Logging class

2016-10-27 Thread Michael Armbrust
This was made internal to Spark.  I'd suggest that you use slf4j directly.

On Thu, Oct 27, 2016 at 2:42 PM, Reth RM  wrote:

> Updated spark to version 2.0.0 and have issue with importing
> org.apache.spark.Logging
>
> Any suggested fix for this issue?
>


importing org.apache.spark.Logging class

2016-10-27 Thread Reth RM
Updated spark to version 2.0.0 and have issue with importing
org.apache.spark.Logging

Any suggested fix for this issue?


Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Davies Liu
Usually using broadcast join could boost the performance when you have
enough memory,
You should decrease it or even disable it when there is no enough memory.

On Thu, Oct 27, 2016 at 1:22 PM, Pietro Pugni  wrote:
> Thank you Davies,
> this worked! But what are the consequences of setting 
> spark.sql.autoBroadcastJoinThreshold=0?
> Will it degrade or boost performance?
> Thank you again
>  Pietro
>
>> Il giorno 27 ott 2016, alle ore 18:54, Davies Liu  ha 
>> scritto:
>>
>> I think this is caused by BroadcastHashJoin try to use more memory
>> than the amount driver have, could you decrease the
>> spark.sql.autoBroadcastJoinThreshold  (-1 or 0  means disable it)?
>>
>> On Thu, Oct 27, 2016 at 9:19 AM, Pietro Pugni  wrote:
>>> I’m sorry, here’s the formatted message text:
>>>
>>>
>>>
>>> I'm running an ETL process that joins table1 with other tables (CSV files),
>>> one table at time (for example table1 with table2, table1 with table3, and
>>> so on). The join is written inside a PostgreSQL istance using JDBC.
>>>
>>> The entire process runs successfully if I use table2, table3 and table4. If
>>> I add table5, table6, table7, the process run successfully with table5,
>>> table6 and table7 but as soon as it reaches table2 it starts displaying a
>>> lot of messagges like this:
>>>
>>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>>> (33554432 bytes), try again.
>>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>>> (33554432 bytes), try again.
>>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>>> (33554432 bytes), try again.
>>> ...
>>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>>> (33554432 bytes), try again.
>>> ...
>>> Traceback (most recent call last):
>>>  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 1200, in
>>> 
>>>
>>>sparkdf2database(flusso['sparkdf'], schema + "." + postgresql_tabella,
>>> "append")
>>>  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 144, in
>>> sparkdf2database
>>>properties={"ApplicationName":info["nome"] + " - Scrittura della tabella
>>> " + dest, "disableColumnSanitiser":"true", "reWriteBatchedInserts":"true"}
>>>  File
>>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>> line 762, in jdbc
>>>  File
>>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
>>> line 1133, in __call__
>>>  File
>>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>>> line 63, in deco
>>>  File
>>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>>> line 319, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o301.jdbc.
>>> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>>>at
>>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>>>at
>>> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>>>at
>>> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>>>at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>>>at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>>>at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>>>at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>>at
>>> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>>>at
>>> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>>>at
>>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>>>at
>>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenSemi(BroadcastHashJoinExec.scala:318)
>>>at
>>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:84)
>>>at
>>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>>>at
>>> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
>>>at
>>> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
>>>at
>>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>>>at
>>> org.apache.spark.sql.execution.RowDataSourceScanExec.consume(ExistingRDD.scala:150)
>>>at
>>> org.apache.spark.sql.execution.RowDataSourceScanExec.doProduce(ExistingRDD.scala:217)
>>>at
>>> 

Re: Reading AVRO from S3 - No parallelism

2016-10-27 Thread Michael Armbrust
How big are your avro files?  We collapse many small files into a single
partition to eliminate scheduler overhead.  If you need explicit
parallelism you can also repartition.

On Thu, Oct 27, 2016 at 5:19 AM, Prithish  wrote:

> I am trying to read a bunch of AVRO files from a S3 folder using Spark
> 2.0. No matter how many executors I use or what configuration changes I
> make, the cluster doesn't seem to use all the executors. I am using the
> com.databricks.spark.avro library from databricks to read the AVRO.
>
> However, if I try the same on CSV files (same S3 folder, same
> configuration and cluster), it does use all executors.
>
> Is there something that I need to do to enable parallelism when using the
> AVRO databricks library?
>
> Thanks for your help.
>
>
>


Spark UI error spark 2.0.1 hadoop 2.6

2016-10-27 Thread gpatcham
Hi,

I'm running spark-shell in yarn client mode and sparkcontext started and
able to run commands .

But UI is not coming up and see below error's in spark shell

20:51:20 WARN servlet.ServletHandler: 
javax.servlet.ServletException: Could not determine the proxy server for
redirection
at
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:183)
at
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:139)
at
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
at
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.spark_project.jetty.server.Server.handle(Server.java:499)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
at
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:724)
16/10/27 20:51:20 WARN server.HttpChannel: /
java.lang.NoSuchMethodError:
javax.servlet.http.HttpServletRequest.isAsyncStarted()Z
at
org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:484)
at
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.spark_project.jetty.server.Server.handle(Server.java:499)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
at
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:724)
16/10/27 20:51:20 WARN thread.QueuedThreadPool: 
java.lang.NoSuchMethodError:
javax.servlet.http.HttpServletResponse.getStatus()I
at
org.spark_project.jetty.server.handler.ErrorHandler.handle(ErrorHandler.java:112)
at org.spark_project.jetty.server.Response.sendError(Response.java:597)
at
org.spark_project.jetty.server.HttpChannel.handleException(HttpChannel.java:487)
at
org.spark_project.jetty.server.HttpConnection$HttpChannelOverHttp.handleException(HttpConnection.java:594)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:387)
at
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:724)
16/10/27 20:51:20 WARN thread.QueuedThreadPool: Unexpected thread death:
org.spark_project.jetty.util.thread.QueuedThreadPool$3@10d268b in
SparkUI{STARTED,8<=8<=200,i=2,q=0}


Let me know if I'm missing something.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-error-spark-2-0-1-hadoop-2-6-tp27970.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Pietro Pugni
Thank you Davies,
this worked! But what are the consequences of setting 
spark.sql.autoBroadcastJoinThreshold=0?
Will it degrade or boost performance?
Thank you again
 Pietro

> Il giorno 27 ott 2016, alle ore 18:54, Davies Liu  ha 
> scritto:
> 
> I think this is caused by BroadcastHashJoin try to use more memory
> than the amount driver have, could you decrease the
> spark.sql.autoBroadcastJoinThreshold  (-1 or 0  means disable it)?
> 
> On Thu, Oct 27, 2016 at 9:19 AM, Pietro Pugni  wrote:
>> I’m sorry, here’s the formatted message text:
>> 
>> 
>> 
>> I'm running an ETL process that joins table1 with other tables (CSV files),
>> one table at time (for example table1 with table2, table1 with table3, and
>> so on). The join is written inside a PostgreSQL istance using JDBC.
>> 
>> The entire process runs successfully if I use table2, table3 and table4. If
>> I add table5, table6, table7, the process run successfully with table5,
>> table6 and table7 but as soon as it reaches table2 it starts displaying a
>> lot of messagges like this:
>> 
>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>> (33554432 bytes), try again.
>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>> (33554432 bytes), try again.
>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>> (33554432 bytes), try again.
>> ...
>> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
>> (33554432 bytes), try again.
>> ...
>> Traceback (most recent call last):
>>  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 1200, in
>> 
>> 
>>sparkdf2database(flusso['sparkdf'], schema + "." + postgresql_tabella,
>> "append")
>>  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 144, in
>> sparkdf2database
>>properties={"ApplicationName":info["nome"] + " - Scrittura della tabella
>> " + dest, "disableColumnSanitiser":"true", "reWriteBatchedInserts":"true"}
>>  File
>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>> line 762, in jdbc
>>  File
>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
>> line 1133, in __call__
>>  File
>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>> line 63, in deco
>>  File
>> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>> line 319, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling o301.jdbc.
>> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>>at
>> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>>at
>> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>>at
>> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>>at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>>at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>>at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>>at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>at
>> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>>at
>> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>>at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>>at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenSemi(BroadcastHashJoinExec.scala:318)
>>at
>> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:84)
>>at
>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>>at
>> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
>>at
>> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
>>at
>> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>>at
>> org.apache.spark.sql.execution.RowDataSourceScanExec.consume(ExistingRDD.scala:150)
>>at
>> org.apache.spark.sql.execution.RowDataSourceScanExec.doProduce(ExistingRDD.scala:217)
>>at
>> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>>at
>> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>>at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>>at
>> 

Re: Infinite Loop in Spark

2016-10-27 Thread Mark Hamstra
Using a single SparkContext for an extended period of time is how
long-running Spark Applications such as the Spark Job Server work (
https://github.com/spark-jobserver/spark-jobserver).  It's an established
pattern.

On Thu, Oct 27, 2016 at 11:46 AM, Gervásio Santos  wrote:

> Hi guys!
>
> I'm developing an application in Spark that I'd like to run continuously.
> It would execute some actions, sleep for a while and go again. I was
> thinking of doing it in a standard infinite loop way.
>
> val sc = 
> while (true) {
>   doStuff(...)
>   sleep(...)
> }
>
> I would be running this (fairly light weight) application on a cluster,
> that would also run other (significantly heavier) jobs. However, I fear
> that this kind of code might lead to unexpected beahavior; I don't know if
> keeping the same SparkContext active continuously for a very long time
> might lead to some weird stuff happening.
>
> Can anyone tell me if there is some problem with not "renewing" the Spark
> context or is aware of any problmes with this approach that I might be
> missing?
>
> Thanks!
>


Infinite Loop in Spark

2016-10-27 Thread Gervásio Santos
Hi guys!

I'm developing an application in Spark that I'd like to run continuously.
It would execute some actions, sleep for a while and go again. I was
thinking of doing it in a standard infinite loop way.

val sc = 
while (true) {
  doStuff(...)
  sleep(...)
}

I would be running this (fairly light weight) application on a cluster,
that would also run other (significantly heavier) jobs. However, I fear
that this kind of code might lead to unexpected beahavior; I don't know if
keeping the same SparkContext active continuously for a very long time
might lead to some weird stuff happening.

Can anyone tell me if there is some problem with not "renewing" the Spark
context or is aware of any problmes with this approach that I might be
missing?

Thanks!


Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
i can see how unquoted csv would work if you escape delimiters, but i have
never seen that in practice.

On Thu, Oct 27, 2016 at 2:03 PM, Jain, Nishit 
wrote:

> I’d think quoting is only necessary if you are not escaping delimiters in
> data. But we can only share our opinions. It would be good to see something
> documented.
> This may be the cause of the issue?: https://issues.apache.
> org/jira/browse/CSV-135
>
> From: Koert Kuipers 
> Date: Thursday, October 27, 2016 at 12:49 PM
>
> To: "Jain, Nishit" 
> Cc: "user@spark.apache.org" 
> Subject: Re: CSV escaping not working
>
> well my expectation would be that if you have delimiters in your data you
> need to quote your values. if you now have quotes without your data you
> need to escape them.
>
> so escaping is only necessary if quoted.
>
> On Thu, Oct 27, 2016 at 1:45 PM, Jain, Nishit 
> wrote:
>
>> Do you mind sharing why should escaping not work without quotes?
>>
>> From: Koert Kuipers 
>> Date: Thursday, October 27, 2016 at 12:40 PM
>> To: "Jain, Nishit" 
>> Cc: "user@spark.apache.org" 
>> Subject: Re: CSV escaping not working
>>
>> that is what i would expect: escaping only works if quoted
>>
>> On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit 
>> wrote:
>>
>>> Interesting finding: Escaping works if data is quoted but not otherwise.
>>>
>>> From: "Jain, Nishit" 
>>> Date: Thursday, October 27, 2016 at 10:54 AM
>>> To: "user@spark.apache.org" 
>>> Subject: CSV escaping not working
>>>
>>> I am using spark-core version 2.0.1 with Scala 2.11. I have simple code
>>> to read a csv file which has \ escapes.
>>>
>>> val myDA = spark.read
>>>   .option("quote",null)
>>> .schema(mySchema)
>>> .csv(filePath)
>>>
>>> As per documentation \ is default escape for csv reader. But it does not
>>> work. Spark is reading \ as part of my data. For Ex: City column in csv
>>> file is *north rocks\,au* . I am expecting city column should read in
>>> code as *northrocks,au*. But instead spark reads it as *northrocks\* and
>>> moves *au* to next column.
>>>
>>> I have tried following but did not work:
>>>
>>>- Explicitly defined escape .option("escape",”\\")
>>>- Changed escape to | or : in file and in code
>>>- I have tried using spark-csv library
>>>
>>> Any one facing same issue? Am I missing something?
>>>
>>> Thanks
>>>
>>
>>
>


large scheduler delay in OnlineLDAOptimizer, (MLlib and LDA)

2016-10-27 Thread Xiaoye Sun
Hi,

I am running some experiments with OnlineLDAOptimizer in Spark 1.6.1. My
Spark cluster has 30 machines.

However, I found that the Scheduler delay at job/stage "reduce at
LDAOptimizer.scala:452" is extremely large when the LDA model is large. The
delay could be tens of seconds.

Does anyone know the reason for that?

Best,
Xiaoye


Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
I’d think quoting is only necessary if you are not escaping delimiters in data. 
But we can only share our opinions. It would be good to see something 
documented.
This may be the cause of the issue?: 
https://issues.apache.org/jira/browse/CSV-135

From: Koert Kuipers >
Date: Thursday, October 27, 2016 at 12:49 PM
To: "Jain, Nishit" >
Cc: "user@spark.apache.org" 
>
Subject: Re: CSV escaping not working

well my expectation would be that if you have delimiters in your data you need 
to quote your values. if you now have quotes without your data you need to 
escape them.

so escaping is only necessary if quoted.

On Thu, Oct 27, 2016 at 1:45 PM, Jain, Nishit 
> wrote:
Do you mind sharing why should escaping not work without quotes?

From: Koert Kuipers >
Date: Thursday, October 27, 2016 at 12:40 PM
To: "Jain, Nishit" >
Cc: "user@spark.apache.org" 
>
Subject: Re: CSV escaping not working

that is what i would expect: escaping only works if quoted

On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit 
> wrote:
Interesting finding: Escaping works if data is quoted but not otherwise.

From: "Jain, Nishit" >
Date: Thursday, October 27, 2016 at 10:54 AM
To: "user@spark.apache.org" 
>
Subject: CSV escaping not working


I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read 
a csv file which has \ escapes.

val myDA = spark.read
  .option("quote",null)
.schema(mySchema)
.csv(filePath)


As per documentation \ is default escape for csv reader. But it does not work. 
Spark is reading \ as part of my data. For Ex: City column in csv file is north 
rocks\,au . I am expecting city column should read in code as northrocks,au. 
But instead spark reads it as northrocks\ and moves au to next column.

I have tried following but did not work:

  *   Explicitly defined escape .option("escape",”\\")
  *   Changed escape to | or : in file and in code
  *   I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks




Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
well my expectation would be that if you have delimiters in your data you
need to quote your values. if you now have quotes without your data you
need to escape them.

so escaping is only necessary if quoted.

On Thu, Oct 27, 2016 at 1:45 PM, Jain, Nishit 
wrote:

> Do you mind sharing why should escaping not work without quotes?
>
> From: Koert Kuipers 
> Date: Thursday, October 27, 2016 at 12:40 PM
> To: "Jain, Nishit" 
> Cc: "user@spark.apache.org" 
> Subject: Re: CSV escaping not working
>
> that is what i would expect: escaping only works if quoted
>
> On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit 
> wrote:
>
>> Interesting finding: Escaping works if data is quoted but not otherwise.
>>
>> From: "Jain, Nishit" 
>> Date: Thursday, October 27, 2016 at 10:54 AM
>> To: "user@spark.apache.org" 
>> Subject: CSV escaping not working
>>
>> I am using spark-core version 2.0.1 with Scala 2.11. I have simple code
>> to read a csv file which has \ escapes.
>>
>> val myDA = spark.read
>>   .option("quote",null)
>> .schema(mySchema)
>> .csv(filePath)
>>
>> As per documentation \ is default escape for csv reader. But it does not
>> work. Spark is reading \ as part of my data. For Ex: City column in csv
>> file is *north rocks\,au* . I am expecting city column should read in
>> code as *northrocks,au*. But instead spark reads it as *northrocks\* and
>> moves *au* to next column.
>>
>> I have tried following but did not work:
>>
>>- Explicitly defined escape .option("escape",”\\")
>>- Changed escape to | or : in file and in code
>>- I have tried using spark-csv library
>>
>> Any one facing same issue? Am I missing something?
>>
>> Thanks
>>
>
>


Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
Do you mind sharing why should escaping not work without quotes?

From: Koert Kuipers >
Date: Thursday, October 27, 2016 at 12:40 PM
To: "Jain, Nishit" >
Cc: "user@spark.apache.org" 
>
Subject: Re: CSV escaping not working

that is what i would expect: escaping only works if quoted

On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit 
> wrote:
Interesting finding: Escaping works if data is quoted but not otherwise.

From: "Jain, Nishit" >
Date: Thursday, October 27, 2016 at 10:54 AM
To: "user@spark.apache.org" 
>
Subject: CSV escaping not working


I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read 
a csv file which has \ escapes.

val myDA = spark.read
  .option("quote",null)
.schema(mySchema)
.csv(filePath)


As per documentation \ is default escape for csv reader. But it does not work. 
Spark is reading \ as part of my data. For Ex: City column in csv file is north 
rocks\,au . I am expecting city column should read in code as northrocks,au. 
But instead spark reads it as northrocks\ and moves au to next column.

I have tried following but did not work:

  *   Explicitly defined escape .option("escape",”\\")
  *   Changed escape to | or : in file and in code
  *   I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks



Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
that is what i would expect: escaping only works if quoted

On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit 
wrote:

> Interesting finding: Escaping works if data is quoted but not otherwise.
>
> From: "Jain, Nishit" 
> Date: Thursday, October 27, 2016 at 10:54 AM
> To: "user@spark.apache.org" 
> Subject: CSV escaping not working
>
> I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to
> read a csv file which has \ escapes.
>
> val myDA = spark.read
>   .option("quote",null)
> .schema(mySchema)
> .csv(filePath)
>
> As per documentation \ is default escape for csv reader. But it does not
> work. Spark is reading \ as part of my data. For Ex: City column in csv
> file is *north rocks\,au* . I am expecting city column should read in
> code as *northrocks,au*. But instead spark reads it as *northrocks\* and
> moves *au* to next column.
>
> I have tried following but did not work:
>
>- Explicitly defined escape .option("escape",”\\")
>- Changed escape to | or : in file and in code
>- I have tried using spark-csv library
>
> Any one facing same issue? Am I missing something?
>
> Thanks
>


Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
Interesting finding: Escaping works if data is quoted but not otherwise.

From: "Jain, Nishit" >
Date: Thursday, October 27, 2016 at 10:54 AM
To: "user@spark.apache.org" 
>
Subject: CSV escaping not working


I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read 
a csv file which has \ escapes.

val myDA = spark.read
  .option("quote",null)
.schema(mySchema)
.csv(filePath)


As per documentation \ is default escape for csv reader. But it does not work. 
Spark is reading \ as part of my data. For Ex: City column in csv file is north 
rocks\,au . I am expecting city column should read in code as northrocks,au. 
But instead spark reads it as northrocks\ and moves au to next column.

I have tried following but did not work:

  *   Explicitly defined escape .option("escape",”\\")
  *   Changed escape to | or : in file and in code
  *   I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks


If you have used spark-sas7bdat package to transform SAS data set to Spark, please be aware

2016-10-27 Thread Shi Yu
I found some main issues and wrote it on my blog:

https://eilianyu.wordpress.com/2016/10/27/be-aware-of-hidden-data-errors-using-spark-sas7bdat-pacakge-to-ingest-sas-datasets-to-spark/


Re: Using Hive UDTF in SparkSQL

2016-10-27 Thread Davies Liu
Could you file a JIRA for this bug?

On Thu, Oct 27, 2016 at 3:05 AM, Lokesh Yadav
 wrote:
> Hello
>
> I am trying to use a Hive UDTF function in spark SQL. But somehow its not
> working for me as intended and I am not able to understand the behavior.
>
> When I try to register a function like this:
> create temporary function SampleUDTF_01 as
> 'com.fl.experiments.sparkHive.SampleUDTF' using JAR
> 'hdfs:///user/root/sparkHive-1.0.0.jar';
> It successfully registers the function, but gives me a 'not a registered
> function' error when I try to run that function. Also it doesn't show up in
> the list when I do a 'show functions'.
>
> Another case:
> When I try to register the same function as a temporary function using a
> local jar (the hdfs path doesn't work with temporary function, that is weird
> too), it registers, and I am able to successfully run that function as well.
> Another weird thing is that I am not able to drop that function using the
> 'drop function ...' statement. This the functions shows up in the function
> registry.
>
> I am stuck with this, any help would be really appreciated.
> Thanks
>
> Regards,
> Lokesh Yadav

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Davies Liu
I think this is caused by BroadcastHashJoin try to use more memory
than the amount driver have, could you decrease the
spark.sql.autoBroadcastJoinThreshold  (-1 or 0  means disable it)?

On Thu, Oct 27, 2016 at 9:19 AM, Pietro Pugni  wrote:
> I’m sorry, here’s the formatted message text:
>
>
>
> I'm running an ETL process that joins table1 with other tables (CSV files),
> one table at time (for example table1 with table2, table1 with table3, and
> so on). The join is written inside a PostgreSQL istance using JDBC.
>
> The entire process runs successfully if I use table2, table3 and table4. If
> I add table5, table6, table7, the process run successfully with table5,
> table6 and table7 but as soon as it reaches table2 it starts displaying a
> lot of messagges like this:
>
> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
> (33554432 bytes), try again.
> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
> (33554432 bytes), try again.
> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
> (33554432 bytes), try again.
> ...
> 16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
> (33554432 bytes), try again.
> ...
> Traceback (most recent call last):
>   File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 1200, in
> 
>
> sparkdf2database(flusso['sparkdf'], schema + "." + postgresql_tabella,
> "append")
>   File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 144, in
> sparkdf2database
> properties={"ApplicationName":info["nome"] + " - Scrittura della tabella
> " + dest, "disableColumnSanitiser":"true", "reWriteBatchedInserts":"true"}
>   File
> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 762, in jdbc
>   File
> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
> line 1133, in __call__
>   File
> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 63, in deco
>   File
> "/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o301.jdbc.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
> at
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
> at
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenSemi(BroadcastHashJoinExec.scala:318)
> at
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:84)
> at
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
> at
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
> at
> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
> at
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
> at
> org.apache.spark.sql.execution.RowDataSourceScanExec.consume(ExistingRDD.scala:150)
> at
> org.apache.spark.sql.execution.RowDataSourceScanExec.doProduce(ExistingRDD.scala:217)
> at
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
> at
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
> at
> org.apache.spark.sql.execution.RowDataSourceScanExec.produce(ExistingRDD.scala:150)
> at
> 

Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Pietro Pugni
I’m sorry, here’s the formatted message text:



I'm running an ETL process that joins table1 with other tables (CSV files), one 
table at time (for example table1 with table2, table1 with table3, and so on). 
The join is written inside a PostgreSQL istance using JDBC. 

The entire process runs successfully if I use table2, table3 and table4. If I 
add table5, table6, table7, the process run successfully with table5, table6 
and table7 but as soon as it reaches table2 it starts displaying a lot of 
messagges like this: 

16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page (33554432 
bytes), try again. 
16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page (33554432 
bytes), try again. 
16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page (33554432 
bytes), try again. 
... 
16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page (33554432 
bytes), try again. 
... 
Traceback (most recent call last): 
  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 1200, in 
sparkdf2database(flusso['sparkdf'], schema + "." + postgresql_tabella, 
"append") 
  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 144, in 
sparkdf2database 
properties={"ApplicationName":info["nome"] + " - Scrittura della tabella " 
+ dest, "disableColumnSanitiser":"true", "reWriteBatchedInserts":"true"} 
  File 
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
 line 762, in jdbc 
  File 
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
 line 1133, in __call__ 
  File 
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco 
  File 
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o301.jdbc. 
: org.apache.spark.SparkException: Exception thrown in awaitResult: 
at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) 
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
 
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) 
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
 
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenSemi(BroadcastHashJoinExec.scala:318)
 
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:84)
 
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
 
at 
org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
 
at 
org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
 
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
 
at 
org.apache.spark.sql.execution.RowDataSourceScanExec.consume(ExistingRDD.scala:150)
 
at 
org.apache.spark.sql.execution.RowDataSourceScanExec.doProduce(ExistingRDD.scala:217)
 
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
 
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
 
at 
org.apache.spark.sql.execution.RowDataSourceScanExec.produce(ExistingRDD.scala:150)
 
at 
org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113)
 
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
 
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
 
at 

TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread pietrop
I'm running an ETL process that joins table1 with other tables (CSV files),
one table at time (for example table1 with table2, table1 with table3, and
so on). The join is written inside a PostgreSQL istance using JDBC.The
entire process runs successfully if I use table2, table3 and table4. If I
add table5, table6, table7, the process run successfully with table5, table6
and table7 but as soon as it reaches table2 it starts displaying a lot of
messagges like this:/16/10/27 17:33:47 WARN TaskMemoryManager: Failed to
allocate a page (33554432 bytes), try again.16/10/27 17:33:47 WARN
TaskMemoryManager: Failed to allocate a page (33554432 bytes), try
again.16/10/27 17:33:47 WARN TaskMemoryManager: Failed to allocate a page
(33554432 bytes), try again16/10/27 17:33:47 WARN TaskMemoryManager:
Failed to allocate a page (33554432 bytes), try againTraceback (most
recent call last):  File "/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py",
line 1200, in sparkdf2database(flusso['sparkdf'], schema + "." +
postgresql_tabella, "append")  File
"/Volumes/Data/www/beaver/tmp/ETL_Spark/etl.py", line 144, in
sparkdf2databaseproperties={"ApplicationName":info["nome"] + " -
Scrittura della tabella " + dest, "disableColumnSanitiser":"true",
"reWriteBatchedInserts":"true"}  File
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 762, in jdbc  File
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
line 1133, in __call__  File
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco  File
"/Volumes/Data/www/beaver/tmp/ETL_Spark/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
line 319, in get_return_valuepy4j.protocol.Py4JJavaError: An error occurred
while calling o301.jdbc.: org.apache.spark.SparkException: Exception thrown
in awaitResult: at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)   at
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
at
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenSemi(BroadcastHashJoinExec.scala:318)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:84)
at
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at
org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
at
org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
at
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at
org.apache.spark.sql.execution.RowDataSourceScanExec.consume(ExistingRDD.scala:150)
at
org.apache.spark.sql.execution.RowDataSourceScanExec.doProduce(ExistingRDD.scala:217)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at
org.apache.spark.sql.execution.RowDataSourceScanExec.produce(ExistingRDD.scala:150)
at
org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at

CSV escaping not working

2016-10-27 Thread Jain, Nishit
I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read 
a csv file which has \ escapes.

val myDA = spark.read
  .option("quote",null)
.schema(mySchema)
.csv(filePath)


As per documentation \ is default escape for csv reader. But it does not work. 
Spark is reading \ as part of my data. For Ex: City column in csv file is north 
rocks\,au . I am expecting city column should read in code as northrocks,au. 
But instead spark reads it as northrocks\ and moves au to next column.

I have tried following but did not work:

  *   Explicitly defined escape .option("escape",”\\")
  *   Changed escape to | or : in file and in code
  *   I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks


Re: Executor shutdown hook and initialization

2016-10-27 Thread Chawla,Sumit
Hi Sean

Could you please elaborate on how can this be done on a per partition
basis?

Regards
Sumit Chawla


On Thu, Oct 27, 2016 at 7:44 AM, Walter rakoff 
wrote:

> Thanks for the info Sean.
>
> I'm initializing them in a singleton but Scala objects are evaluated
> lazily.
> So it gets initialized only when the first task is run(and makes use of
> the object).
> Plan is to start a background thread in the object that does periodic
> cache refresh too.
> I'm trying to see if this init can be done right when executor is created.
>
> Btw, this is for a Spark streaming app. So doing this per partition during
> each batch isn't ideal.
> I'd like to keep them(connect & cache) across batches.
>
> Finally, how do I setup the shutdown hook on an executor? Except for
> operations on RDD everything else is executed in the driver.
> All I can think of is something like this
> sc.makeRDD((1 until sc.defaultParallelism), sc.defaultParallelism)
>.foreachPartition(sys.ShutdownHookThread { Singleton.DoCleanup() }
> )
>
> Walt
>
> On Thu, Oct 27, 2016 at 3:05 AM, Sean Owen  wrote:
>
>> Init is easy -- initialize them in your singleton.
>> Shutdown is harder; a shutdown hook is probably the only reliable way to
>> go.
>> Global state is not ideal in Spark. Consider initializing things like
>> connections per partition, and open/close them with the lifecycle of a
>> computation on a partition instead.
>>
>> On Wed, Oct 26, 2016 at 9:27 PM Walter rakoff 
>> wrote:
>>
>>> Hello,
>>>
>>> Is there a way I can add an init() call when an executor is created? I'd
>>> like to initialize a few connections that are part of my singleton object.
>>> Preferably this happens before it runs the first task
>>> On the same line, how can I provide an shutdown hook that cleans up
>>> these connections on termination.
>>>
>>> Thanks
>>> Walt
>>>
>>
>


Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-27 Thread Kazuaki Ishizaki
Hi Chin Wei,
Thank you for confirming this on 2.0.1 and being happy to hear it never 
happens.

The performance will be improved when this PR (
https://github.com/apache/spark/pull/15219) is integrated.

Regards,
Kazuaki Ishizaki



From:   Chin Wei Low 
To: Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc: user 
Date:   2016/10/25 17:33
Subject:Re: Spark SQL is slower when DataFrame is cache in Memory



Hi Kazuaki,

I print a debug log right before I call the collect, and use that to 
compare against the job start log (it is available when turning on debug 
log).
Anyway, I test that in Spark 2.0.1 and never see it happen. But, the query 
on cached dataframe is still slightly slower than the one without cached 
when it is running on Spark 2.0.1.

Regards,
Low Chin Wei

On Tue, Oct 25, 2016 at 3:39 AM, Kazuaki Ishizaki  
wrote:
Hi Chin Wei,
I am sorry for being late to reply.

Got it. Interesting behavior. How did you measure the time between 1st and 
2nd events?

Best Regards,
Kazuaki Ishizaki



From:Chin Wei Low 
To:Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc:user@spark.apache.org
Date:2016/10/10 11:33

Subject:Re: Spark SQL is slower when DataFrame is cache in Memory



Hi Ishizaki san,

Thanks for the reply.

So, when I pre-cache the dataframe, the cache is being used during the job 
execution.

Actually there are 3 events:
1. call res.collect
2. job started
3. job completed

I am concerning about the longer time taken between 1st and 2nd events. It 
seems like the query planning and optimization is longer when query on 
cached dataframe.


Regards,
Chin Wei

On Fri, Oct 7, 2016 at 10:14 PM, Kazuaki Ishizaki  
wrote:
Hi Chin Wei,
Yes, since you force to create a cache by executing df.count, Spark starts 
to get data from cache for the following task:
val res = sqlContext.sql("table1 union table2 union table3")
res.collect()

If you insert 'res.explain', you can confirm which resource you use to get 
data, cache or parquet?
val res = sqlContext.sql("table1 union table2 union table3")
res.explain(true)
res.collect()

Do I make some misunderstandings?

Best Regards,
Kazuaki Ishizaki



From:Chin Wei Low 
To:Kazuaki Ishizaki/Japan/IBM@IBMJP
Cc:user@spark.apache.org
Date:2016/10/07 20:06
Subject:Re: Spark SQL is slower when DataFrame is cache in Memory




Hi Ishizaki san,

So there is a gap between res.collect
and when I see this log:   spark.SparkContext: Starting job: collect at 
:26

What you mean is, during this time Spark already start to get data from 
cache? Isn't it should only get the data after the job is started and 
tasks are distributed?

Regards,
Chin Wei


On Fri, Oct 7, 2016 at 3:43 PM, Kazuaki Ishizaki  
wrote:
Hi,
I think that the result looks correct. The current Spark spends extra time 
for getting data from a cache. There are two reasons. One is for a 
complicated path to get a data. The other is for decompression in the case 
of a primitive type.
The new implementation (https://github.com/apache/spark/pull/15219) is 
ready for review. It would achieve 1.2x performance improvement for a 
compressed column and much performance improvement for an uncompressed 
column.

Best Regards,
Kazuaki Ishizaki



From:Chin Wei Low 
To:user@spark.apache.org
Date:2016/10/07 13:05
Subject:Spark SQL is slower when DataFrame is cache in Memory




Hi,

I am using Spark 1.6.0. I have a Spark application that create and cache 
(in memory) DataFrames (around 50+, with some on single parquet file and 
some on folder with a few parquet files) with the following codes:

val df = sqlContext.read.parquet
df.persist
df.count

I union them to 3 DataFrames and register that as temp table.

Then, run the following codes:
val res = sqlContext.sql("table1 union table2 union table3")
res.collect()

The res.collect() is slower when I cache the DataFrame compare to without 
cache. e.g. 3 seconds vs 1 second

I turn on the DEBUG log and see there is a gap from the res.collect() to 
start the Spark job.

Is the extra time taken by the query planning & optimization? It does not 
show the gap when I do not cache the dataframe.

Anything I am missing here?

Regards,
Chin Wei










Re: Executor shutdown hook and initialization

2016-10-27 Thread Walter rakoff
Thanks for the info Sean.

I'm initializing them in a singleton but Scala objects are evaluated lazily.
So it gets initialized only when the first task is run(and makes use of the
object).
Plan is to start a background thread in the object that does periodic cache
refresh too.
I'm trying to see if this init can be done right when executor is created.

Btw, this is for a Spark streaming app. So doing this per partition during
each batch isn't ideal.
I'd like to keep them(connect & cache) across batches.

Finally, how do I setup the shutdown hook on an executor? Except for
operations on RDD everything else is executed in the driver.
All I can think of is something like this
sc.makeRDD((1 until sc.defaultParallelism), sc.defaultParallelism)
   .foreachPartition(sys.ShutdownHookThread { Singleton.DoCleanup() } )

Walt

On Thu, Oct 27, 2016 at 3:05 AM, Sean Owen  wrote:

> Init is easy -- initialize them in your singleton.
> Shutdown is harder; a shutdown hook is probably the only reliable way to
> go.
> Global state is not ideal in Spark. Consider initializing things like
> connections per partition, and open/close them with the lifecycle of a
> computation on a partition instead.
>
> On Wed, Oct 26, 2016 at 9:27 PM Walter rakoff 
> wrote:
>
>> Hello,
>>
>> Is there a way I can add an init() call when an executor is created? I'd
>> like to initialize a few connections that are part of my singleton object.
>> Preferably this happens before it runs the first task
>> On the same line, how can I provide an shutdown hook that cleans up these
>> connections on termination.
>>
>> Thanks
>> Walt
>>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Hi
Just point all users on the same app with a common spark context.
For instance akka http receives queries from user and launch concurrent
spark SQL queries in different actor thread. The only prerequsite is to
launch the different jobs in different threads (like with actors).
Be carefull it's not CRUD if one of the job modifies dataset, it's OK for
read only.

Le 27 oct. 2016 4:02 PM, "Victor Shafran"  a
écrit :

> Hi Vincent,
> Can you elaborate on how to implement "shared sparkcontext and fair
> scheduling" option?
>
> My approach was to use  sparkSession.getOrCreate() method and register
> temp table in one application. However, I was not able to access this
> tempTable in another application.
> You help is highly appreciated
> Victor
>
> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang  wrote:
>
>> Hi Mich,
>>
>> Yes, Alluxio is commonly used to cache and share Spark RDDs and
>> DataFrames among different applications and contexts. The data typically
>> stays in memory, but with Alluxio's tiered storage, the "colder" data can
>> be evicted out to other medium, like SSDs and HDDs. Here is a blog post
>> discussing Spark RDDs and Alluxio: https://www.alluxio.c
>> om/blog/effective-spark-rdds-with-alluxio
>>
>> Also, Alluxio also has the concept of an "Under filesystem", which can
>> help you access your existing data across different storage systems. Here
>> is more information about the unified namespace abilities:
>> http://www.alluxio.org/docs/master/en/Unified-and
>> -Transparent-Namespace.html
>>
>> Hope that helps,
>> Gene
>>
>> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Chanh,
>>>
>>> Can it share RDDs.
>>>
>>> Personally I have not used either Alluxio or Ignite.
>>>
>>>
>>>1. Are there major differences between these two
>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>have any experience you can kindly share
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>>
 Hi Mich,
 Alluxio is the good option to go.

 Regards,
 Chanh

 On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
 wrote:


 There was a mention of using Zeppelin to share RDDs with many users.
 From the notes on Zeppelin it appears that this is sharing UI and I am not
 sure how easy it is going to be changing the result set with different
 users modifying say sql queries.

 There is also the idea of caching RDDs with something like Apache
 Ignite. Has anyone really tried this. Will that work with multiple
 applications?

 It looks feasible as RDDs are immutable and so are registered
 tempTables etc.

 Thanks


 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.





>>>
>>
>
>
> --
>
> Victor Shafran
>
> VP R| Equalum
>
> Mobile: +972-523854883 | Email: victor.shaf...@equalum.io
>


Running Hive and Spark together with Dynamic Resource Allocation

2016-10-27 Thread rachmaninovquartet
Hi,

My team has a cluster running HDP, with Hive and Spark. We setup spark to
use dynamic resource allocation, for benefits such as not having to hard
code the number of executors and to free resources after using. Everything
is running on YARN.

The problem is that for Spark 1.5.2 with dynamic resource allocation to
function properly we needed to set yarn.nodemanager.aux-services in
yarn-site.xml to spark_shuffle, but this breaks hive (1.2.1), since it is
looking for auxService:mapreduce_shuffle. Does any one know of a way to
configure in order to have both services running smoothly?

Thanks,

Ian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Hive-and-Spark-together-with-Dynamic-Resource-Allocation-tp27968.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Many Spark metric names do not include the application name

2016-10-27 Thread Amit Sela
Hi guys,


It seems that JvmSource / DAGSchedulerSource / BlockManagerSource
/ ExecutorAllocationManager and other metrics sources (except for the
StreamingSource) publish their metrics directly under the "driver" fragment
(or its executor counter-part) of the metric path without including the
application name.


For instance:

   - "spark.application_.driver.DAGScheduler.job.allJobs"
   - while I would expect it to be something like:


   - *"*spark.application_.driver*.myAppName.*DAGScheduler.job.allJobs
   *"*
   - just like it currently is in the *streaming* metrics
   (StreamingSource):


   - "spark.application_.driver.*myAppName*
   .StreamingMetrics.streaming.lastCompletedBatch_processingDelay"

I was wondering if there is a reason for not including the application name
in the metric path?


Your help would be much appreciated!


Regards,

Amit


Re: Sharing RDDS across applications and users

2016-10-27 Thread Victor Shafran
Hi Vincent,
Can you elaborate on how to implement "shared sparkcontext and fair
scheduling" option?

My approach was to use  sparkSession.getOrCreate() method and register temp
table in one application. However, I was not able to access this tempTable
in another application.
You help is highly appreciated
Victor

On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang  wrote:

> Hi Mich,
>
> Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames
> among different applications and contexts. The data typically stays in
> memory, but with Alluxio's tiered storage, the "colder" data can be evicted
> out to other medium, like SSDs and HDDs. Here is a blog post discussing
> Spark RDDs and Alluxio: https://www.alluxio.com/blog/effective-spark-rdds-
> with-alluxio
>
> Also, Alluxio also has the concept of an "Under filesystem", which can
> help you access your existing data across different storage systems. Here
> is more information about the unified namespace abilities:
> http://www.alluxio.org/docs/master/en/Unified-
> and-Transparent-Namespace.html
>
> Hope that helps,
> Gene
>
> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Chanh,
>>
>> Can it share RDDs.
>>
>> Personally I have not used either Alluxio or Ignite.
>>
>>
>>1. Are there major differences between these two
>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>have any experience you can kindly share
>>
>> Regards
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>
>>> Hi Mich,
>>> Alluxio is the good option to go.
>>>
>>> Regards,
>>> Chanh
>>>
>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>>
>>> There was a mention of using Zeppelin to share RDDs with many users.
>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>> sure how easy it is going to be changing the result set with different
>>> users modifying say sql queries.
>>>
>>> There is also the idea of caching RDDs with something like Apache
>>> Ignite. Has anyone really tried this. Will that work with multiple
>>> applications?
>>>
>>> It looks feasible as RDDs are immutable and so are registered tempTables
>>> etc.
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>
>


-- 

Victor Shafran

VP R| Equalum

Mobile: +972-523854883 | Email: victor.shaf...@equalum.io


Spark 2.0 on HDP

2016-10-27 Thread Deenar Toraskar
Hi

Has anyone tried running Spark 2.0 on HDP. I have managed to get around the
issues with the timeline service (by turning it off), but now am stuck when
the YARN cannot find org.apache.spark.deploy.yarn.ExecutorLauncher.

Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher

I have verified that both spark.driver.extraJavaOptions and
spark.yarn.am.extraJavaOptions have the hdp.version set correctly. Anything
else I am missing?

Regards
Deenar



On 10 May 2016 at 13:48, Steve Loughran  wrote:

>
> On 9 May 2016, at 21:24, Jesse F Chen  wrote:
>
> I had been running fine until builds around 05/07/2016
>
> If I used the "--master yarn" in builds after 05/07, I got the following
> error...sounds like something jars are missing.
>
> I am using YARN 2.7.2 and Hive 1.2.1.
>
> Do I need something new to deploy related to YARN?
>
> bin/spark-sql -driver-memory 10g --verbose* --master yarn* --packages
> com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors
> 20 --executor-cores 2
>
> 16/05/09 13:15:21 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/05/09 13:15:21 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4041
> 16/05/09 13:15:21 INFO util.Utils: Successfully started service 'SparkUI'
> on port 4041.
> 16/05/09 13:15:21 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started
> at http://bigaperf116.svl.ibm.com:4041
> *Exception in thread "main" java.lang.NoClassDefFoundError:
> com/sun/jersey/api/client/config/ClientConfig*
> *at
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)*
> at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.
> serviceInit(YarnClientImpl.java:163)
> at org.apache.hadoop.service.AbstractService.init(
> AbstractService.java:163)
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(
> YarnClientSchedulerBackend.scala:56)
> at org.apache.spark.scheduler.TaskSchedulerImpl.start(
> TaskSchedulerImpl.scala:148)
>
>
>
> Looks like Jersey client isn't on the classpath.
>
> 1. Consider filing a JIRA
> 2. set  spark.hadoop.yarn.timeline-service.enabled false to turn off ATS
>
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2246)
> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
> scala:762)
> at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.
> init(SparkSQLEnv.scala:57)
> at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(
> SparkSQLCLIDriver.scala:281)
> at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(
> SparkSQLCLIDriver.scala:138)
> at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(
> SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:727)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.
> config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 22 more
> 16/05/09 13:15:21 INFO storage.DiskBlockManager: Shutdown hook called
> 16/05/09 13:15:21 INFO util.ShutdownHookManager: Shutdown hook called
> 16/05/09 13:15:21 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-ac33b501-b9c3-47a3-93c8-fa02720bf4bb
> 16/05/09 13:15:21 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-65cb43d9-c122-4106-a0a8-ae7d92d9e19c
> 16/05/09 13:15:21 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-65cb43d9-c122-4106-a0a8-ae7d92d9e19c/userFiles-
> 46dde536-29e5-46b3-a530-e5ad6640f8b2
>
>
>
>
>
> <07983638.gif> *JESSE CHEN*
> Big Data Performance | IBM Analytics
>
> Office: 408 463 2296
> Mobile: 408 828 9068
> Email: jfc...@us.ibm.com
>
>
>


Re: Run spark-shell inside Docker container against remote YARN cluster

2016-10-27 Thread Marco Mistroni
I am running spark inside docker though not connecting to cluster
How did u build spark? Which profile did u use?
Pls share details and I can try to replicate
Kr

On 27 Oct 2016 2:30 pm, "ponkin"  wrote:

Hi,
May be someone already had experience to build docker image for spark?
I want to build docker image with spark inside but configured against remote
YARN cluster.
I have already created image with spark 1.6.2 inside.
But when I run
spark-shell --master yarn --deploy-mode client --driver-memory 32G
--executor-memory 32G --executor-cores 8
inside docker I get the following exception
Diagnostics: java.io.FileNotFoundException: File
file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.2.0.jar does not
exist

Any suggestions?
Do I need to load spark-assembly i HDFS and set
spark.yarn.jar=hdfs://spark-assembly-1.6.2-hadoop2.2.0.jar ?

Here is my Dockerfile
https://gist.github.com/ponkin/cac0a071e7fe75ca7c390b7388cf4f91



--
View this message in context: http://apache-spark-user-list.
1001560.n3.nabble.com/Run-spark-shell-inside-Docker-
container-against-remote-YARN-cluster-tp27967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Sharing RDDS across applications and users

2016-10-27 Thread Gene Pang
Hi Mich,

Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames
among different applications and contexts. The data typically stays in
memory, but with Alluxio's tiered storage, the "colder" data can be evicted
out to other medium, like SSDs and HDDs. Here is a blog post discussing
Spark RDDs and Alluxio:
https://www.alluxio.com/blog/effective-spark-rdds-with-alluxio

Also, Alluxio also has the concept of an "Under filesystem", which can help
you access your existing data across different storage systems. Here is
more information about the unified namespace abilities:
http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html

Hope that helps,
Gene

On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh 
wrote:

> Thanks Chanh,
>
> Can it share RDDs.
>
> Personally I have not used either Alluxio or Ignite.
>
>
>1. Are there major differences between these two
>2. Have you tried Alluxio for sharing Spark RDDs and if so do you have
>any experience you can kindly share
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:29, Chanh Le  wrote:
>
>> Hi Mich,
>> Alluxio is the good option to go.
>>
>> Regards,
>> Chanh
>>
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
>> wrote:
>>
>>
>> There was a mention of using Zeppelin to share RDDs with many users. From
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure
>> how easy it is going to be changing the result set with different users
>> modifying say sql queries.
>>
>> There is also the idea of caching RDDs with something like Apache Ignite.
>> Has anyone really tried this. Will that work with multiple applications?
>>
>> It looks feasible as RDDs are immutable and so are registered tempTables
>> etc.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>


Run spark-shell inside Docker container against remote YARN cluster

2016-10-27 Thread ponkin
Hi,
May be someone already had experience to build docker image for spark?
I want to build docker image with spark inside but configured against remote
YARN cluster.
I have already created image with spark 1.6.2 inside.
But when I run 
spark-shell --master yarn --deploy-mode client --driver-memory 32G
--executor-memory 32G --executor-cores 8
inside docker I get the following exception
Diagnostics: java.io.FileNotFoundException: File
file:/usr/local/spark/lib/spark-assembly-1.6.2-hadoop2.2.0.jar does not
exist

Any suggestions?
Do I need to load spark-assembly i HDFS and set
spark.yarn.jar=hdfs://spark-assembly-1.6.2-hadoop2.2.0.jar ?

Here is my Dockerfile
https://gist.github.com/ponkin/cac0a071e7fe75ca7c390b7388cf4f91



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Run-spark-shell-inside-Docker-container-against-remote-YARN-cluster-tp27967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Reading AVRO from S3 - No parallelism

2016-10-27 Thread Prithish
I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0.
No matter how many executors I use or what configuration changes I make,
the cluster doesn't seem to use all the executors. I am using the
com.databricks.spark.avro library from databricks to read the AVRO.

However, if I try the same on CSV files (same S3 folder, same configuration
and cluster), it does use all executors.

Is there something that I need to do to enable parallelism when using the
AVRO databricks library?

Thanks for your help.


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
For this you will need to contribute...

Le 27 oct. 2016 1:35 PM, "Mich Talebzadeh"  a
écrit :

> so I assume Ignite will not work with Spark version >=2?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 12:27, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> some options:
>> - ignite for spark 1.5, can deep store on cassandra
>> - alluxio for all spark versions, can deep store on hdfs, gluster...
>>
>> ==> these are best for sharing between jobs
>>
>> - shared sparkcontext and fair scheduling, seems to be not thread safe
>> - spark jobserver and namedRDD, CRUD thread safe RDD sharing between
>> spark jobs
>> ==> these are best for sharing between users
>>
>> 2016-10-27 12:59 GMT+02:00 vincent gromakowski <
>> vincent.gromakow...@gmail.com>:
>>
>>> I would prefer sharing the spark context  and using FAIR scheduler for
>>> user concurrency
>>>
>>> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" 
>>> a écrit :
>>>
 thanks Vince.

 So Ignite uses some hash/in-memory indexing.

 The question is in practice is there much use case to use these two
 fabrics for sharing RDDs.

 Remember all RDBMSs do this through shared memory.

 In layman's term if I have two independent spark-submit running, can
 they share result set. For example the same tempTable etc?

 Cheers

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 27 October 2016 at 11:44, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

> Ignite works only with spark 1.5
> Ignite leverage indexes
> Alluxio provides tiering
> Alluxio easily integrates with underlying FS
>
> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" 
> a écrit :
>
>> Thanks Chanh,
>>
>> Can it share RDDs.
>>
>> Personally I have not used either Alluxio or Ignite.
>>
>>
>>1. Are there major differences between these two
>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>have any experience you can kindly share
>>
>> Regards
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>
>>> Hi Mich,
>>> Alluxio is the good option to go.
>>>
>>> Regards,
>>> Chanh
>>>
>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>> There was a mention of using Zeppelin to share RDDs with many users.
>>> From the notes on Zeppelin it appears that this is sharing UI and I am 
>>> not
>>> sure how easy it is going to be changing the result set with different
>>> users modifying say sql queries.
>>>
>>> There is also the idea of caching RDDs with something like Apache
>>> Ignite. Has anyone really tried this. Will that work with multiple
>>> applications?
>>>
>>> It looks feasible as RDDs are immutable and so are registered
>>> tempTables etc.
>>>
>>> Thanks
>>>
>>>

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
so I assume Ignite will not work with Spark version >=2?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 October 2016 at 12:27, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> some options:
> - ignite for spark 1.5, can deep store on cassandra
> - alluxio for all spark versions, can deep store on hdfs, gluster...
>
> ==> these are best for sharing between jobs
>
> - shared sparkcontext and fair scheduling, seems to be not thread safe
> - spark jobserver and namedRDD, CRUD thread safe RDD sharing between spark
> jobs
> ==> these are best for sharing between users
>
> 2016-10-27 12:59 GMT+02:00 vincent gromakowski <
> vincent.gromakow...@gmail.com>:
>
>> I would prefer sharing the spark context  and using FAIR scheduler for
>> user concurrency
>>
>> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" 
>> a écrit :
>>
>>> thanks Vince.
>>>
>>> So Ignite uses some hash/in-memory indexing.
>>>
>>> The question is in practice is there much use case to use these two
>>> fabrics for sharing RDDs.
>>>
>>> Remember all RDBMSs do this through shared memory.
>>>
>>> In layman's term if I have two independent spark-submit running, can
>>> they share result set. For example the same tempTable etc?
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:44, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
 Ignite works only with spark 1.5
 Ignite leverage indexes
 Alluxio provides tiering
 Alluxio easily integrates with underlying FS

 Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" 
 a écrit :

> Thanks Chanh,
>
> Can it share RDDs.
>
> Personally I have not used either Alluxio or Ignite.
>
>
>1. Are there major differences between these two
>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>have any experience you can kindly share
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:29, Chanh Le  wrote:
>
>> Hi Mich,
>> Alluxio is the good option to go.
>>
>> Regards,
>> Chanh
>>
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> There was a mention of using Zeppelin to share RDDs with many users.
>> From the notes on Zeppelin it appears that this is sharing UI and I am 
>> not
>> sure how easy it is going to be changing the result set with different
>> users modifying say sql queries.
>>
>> There is also the idea of caching RDDs with something like Apache
>> Ignite. Has anyone really tried this. Will that work with multiple
>> applications?
>>
>> It looks feasible as RDDs are immutable and so are registered
>> tempTables etc.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
some options:
- ignite for spark 1.5, can deep store on cassandra
- alluxio for all spark versions, can deep store on hdfs, gluster...

==> these are best for sharing between jobs

- shared sparkcontext and fair scheduling, seems to be not thread safe
- spark jobserver and namedRDD, CRUD thread safe RDD sharing between spark
jobs
==> these are best for sharing between users

2016-10-27 12:59 GMT+02:00 vincent gromakowski <
vincent.gromakow...@gmail.com>:

> I would prefer sharing the spark context  and using FAIR scheduler for
> user concurrency
>
> Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh"  a
> écrit :
>
>> thanks Vince.
>>
>> So Ignite uses some hash/in-memory indexing.
>>
>> The question is in practice is there much use case to use these two
>> fabrics for sharing RDDs.
>>
>> Remember all RDBMSs do this through shared memory.
>>
>> In layman's term if I have two independent spark-submit running, can they
>> share result set. For example the same tempTable etc?
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 October 2016 at 11:44, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> Ignite works only with spark 1.5
>>> Ignite leverage indexes
>>> Alluxio provides tiering
>>> Alluxio easily integrates with underlying FS
>>>
>>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" 
>>> a écrit :
>>>
 Thanks Chanh,

 Can it share RDDs.

 Personally I have not used either Alluxio or Ignite.


1. Are there major differences between these two
2. Have you tried Alluxio for sharing Spark RDDs and if so do you
have any experience you can kindly share

 Regards


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 27 October 2016 at 11:29, Chanh Le  wrote:

> Hi Mich,
> Alluxio is the good option to go.
>
> Regards,
> Chanh
>
> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> There was a mention of using Zeppelin to share RDDs with many users.
> From the notes on Zeppelin it appears that this is sharing UI and I am not
> sure how easy it is going to be changing the result set with different
> users modifying say sql queries.
>
> There is also the idea of caching RDDs with something like Apache
> Ignite. Has anyone really tried this. Will that work with multiple
> applications?
>
> It looks feasible as RDDs are immutable and so are registered
> tempTables etc.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
>

>>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
I would prefer sharing the spark context  and using FAIR scheduler for user
concurrency

Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh"  a
écrit :

> thanks Vince.
>
> So Ignite uses some hash/in-memory indexing.
>
> The question is in practice is there much use case to use these two
> fabrics for sharing RDDs.
>
> Remember all RDBMSs do this through shared memory.
>
> In layman's term if I have two independent spark-submit running, can they
> share result set. For example the same tempTable etc?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:44, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Ignite works only with spark 1.5
>> Ignite leverage indexes
>> Alluxio provides tiering
>> Alluxio easily integrates with underlying FS
>>
>> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" 
>> a écrit :
>>
>>> Thanks Chanh,
>>>
>>> Can it share RDDs.
>>>
>>> Personally I have not used either Alluxio or Ignite.
>>>
>>>
>>>1. Are there major differences between these two
>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>have any experience you can kindly share
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>>
 Hi Mich,
 Alluxio is the good option to go.

 Regards,
 Chanh

 On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
 wrote:


 There was a mention of using Zeppelin to share RDDs with many users.
 From the notes on Zeppelin it appears that this is sharing UI and I am not
 sure how easy it is going to be changing the result set with different
 users modifying say sql queries.

 There is also the idea of caching RDDs with something like Apache
 Ignite. Has anyone really tried this. Will that work with multiple
 applications?

 It looks feasible as RDDs are immutable and so are registered
 tempTables etc.

 Thanks


 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.





>>>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich,
I only tried Alluxio so I can’t give you a comparison.
In my experience, I use Alluxio for the big data set (50GB - 100GB) which is 
the input of the pipelines jobs so you can reuse the result from previous job.


> On Oct 27, 2016, at 5:39 PM, Mich Talebzadeh  
> wrote:
> 
> Thanks Chanh,
> 
> Can it share RDDs.
> 
> Personally I have not used either Alluxio or Ignite.
> 
> Are there major differences between these two
> Have you tried Alluxio for sharing Spark RDDs and if so do you have any 
> experience you can kindly share
> Regards
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 27 October 2016 at 11:29, Chanh Le  > wrote:
> Hi Mich,
> Alluxio is the good option to go. 
> 
> Regards,
> Chanh
> 
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh > > wrote:
>> 
>> 
>> There was a mention of using Zeppelin to share RDDs with many users. From 
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure 
>> how easy it is going to be changing the result set with different users 
>> modifying say sql queries.
>> 
>> There is also the idea of caching RDDs with something like Apache Ignite. 
>> Has anyone really tried this. Will that work with multiple applications?
>> 
>> It looks feasible as RDDs are immutable and so are registered tempTables etc.
>> 
>> Thanks
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
> 
> 



Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
thanks Vince.

So Ignite uses some hash/in-memory indexing.

The question is in practice is there much use case to use these two fabrics
for sharing RDDs.

Remember all RDBMSs do this through shared memory.

In layman's term if I have two independent spark-submit running, can they
share result set. For example the same tempTable etc?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 October 2016 at 11:44, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Ignite works only with spark 1.5
> Ignite leverage indexes
> Alluxio provides tiering
> Alluxio easily integrates with underlying FS
>
> Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh"  a
> écrit :
>
>> Thanks Chanh,
>>
>> Can it share RDDs.
>>
>> Personally I have not used either Alluxio or Ignite.
>>
>>
>>1. Are there major differences between these two
>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>have any experience you can kindly share
>>
>> Regards
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>
>>> Hi Mich,
>>> Alluxio is the good option to go.
>>>
>>> Regards,
>>> Chanh
>>>
>>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>>
>>> There was a mention of using Zeppelin to share RDDs with many users.
>>> From the notes on Zeppelin it appears that this is sharing UI and I am not
>>> sure how easy it is going to be changing the result set with different
>>> users modifying say sql queries.
>>>
>>> There is also the idea of caching RDDs with something like Apache
>>> Ignite. Has anyone really tried this. Will that work with multiple
>>> applications?
>>>
>>> It looks feasible as RDDs are immutable and so are registered tempTables
>>> etc.
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>


Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Ignite works only with spark 1.5
Ignite leverage indexes
Alluxio provides tiering
Alluxio easily integrates with underlying FS

Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh"  a
écrit :

> Thanks Chanh,
>
> Can it share RDDs.
>
> Personally I have not used either Alluxio or Ignite.
>
>
>1. Are there major differences between these two
>2. Have you tried Alluxio for sharing Spark RDDs and if so do you have
>any experience you can kindly share
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 October 2016 at 11:29, Chanh Le  wrote:
>
>> Hi Mich,
>> Alluxio is the good option to go.
>>
>> Regards,
>> Chanh
>>
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
>> wrote:
>>
>>
>> There was a mention of using Zeppelin to share RDDs with many users. From
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure
>> how easy it is going to be changing the result set with different users
>> modifying say sql queries.
>>
>> There is also the idea of caching RDDs with something like Apache Ignite.
>> Has anyone really tried this. Will that work with multiple applications?
>>
>> It looks feasible as RDDs are immutable and so are registered tempTables
>> etc.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
Thanks Chanh,

Can it share RDDs.

Personally I have not used either Alluxio or Ignite.


   1. Are there major differences between these two
   2. Have you tried Alluxio for sharing Spark RDDs and if so do you have
   any experience you can kindly share

Regards


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 October 2016 at 11:29, Chanh Le  wrote:

> Hi Mich,
> Alluxio is the good option to go.
>
> Regards,
> Chanh
>
> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
> wrote:
>
>
> There was a mention of using Zeppelin to share RDDs with many users. From
> the notes on Zeppelin it appears that this is sharing UI and I am not sure
> how easy it is going to be changing the result set with different users
> modifying say sql queries.
>
> There is also the idea of caching RDDs with something like Apache Ignite.
> Has anyone really tried this. Will that work with multiple applications?
>
> It looks feasible as RDDs are immutable and so are registered tempTables
> etc.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>


Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich,
Alluxio is the good option to go. 

Regards,
Chanh

> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh  
> wrote:
> 
> 
> There was a mention of using Zeppelin to share RDDs with many users. From the 
> notes on Zeppelin it appears that this is sharing UI and I am not sure how 
> easy it is going to be changing the result set with different users modifying 
> say sql queries.
> 
> There is also the idea of caching RDDs with something like Apache Ignite. Has 
> anyone really tried this. Will that work with multiple applications?
> 
> It looks feasible as RDDs are immutable and so are registered tempTables etc.
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  



Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
There was a mention of using Zeppelin to share RDDs with many users. From
the notes on Zeppelin it appears that this is sharing UI and I am not sure
how easy it is going to be changing the result set with different users
modifying say sql queries.

There is also the idea of caching RDDs with something like Apache Ignite.
Has anyone really tried this. Will that work with multiple applications?

It looks feasible as RDDs are immutable and so are registered tempTables
etc.

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Using Hive UDTF in SparkSQL

2016-10-27 Thread Lokesh Yadav
Hello

I am trying to use a Hive UDTF function in spark SQL. But somehow its not
working for me as intended and I am not able to understand the behavior.

When I try to register a function like this:
create temporary function SampleUDTF_01 as
'com.fl.experiments.sparkHive.SampleUDTF' using JAR
'hdfs:///user/root/sparkHive-1.0.0.jar';
It successfully registers the function, but gives me a 'not a registered
function' error when I try to run that function. Also it doesn't show up in
the list when I do a 'show functions'.

Another case:
When I try to register the same function as a temporary function using a
local jar (the hdfs path doesn't work with temporary function, that is
weird too), it registers, and I am able to successfully run that function
as well. Another weird thing is that I am not able to drop that function
using the 'drop function ...' statement. This the functions shows up in the
function registry.

I am stuck with this, any help would be really appreciated.
Thanks

Regards,
Lokesh Yadav


Using SparkLauncher in cluster mode, in a Mesos cluster

2016-10-27 Thread Nerea Ayestarán
I am trying to launch a Apache Spark job from a java class to a Apache
Mesos cluster in cluster deploy mode. I use SparkLauncher configured as
follows:

Process sparkProcess = new SparkLauncher()
.setAppResource("hdfs://auto-ha/path/to/jar/SparkPi.jar")
.setMainClass("com.ik.SparkPi")
.setMaster("mesos://dispatcher:7077")
.setConf("spark.executor.uri",
"hdfs://auto-ha/spark/spark-2.0.0-bin-hadoop2.7.tgz")
.setSparkHome("/local/path/to/spark-2.0.0-bin-hadoop2.7")
.setDeployMode("cluster")
.setAppName("PI")
.setVerbose(true)
.launch();


If I submit the same job with the same configuration using spark-submit the
job works perfectly, but in the case of the SparkLauncher I get the
following error:

Error: Cannot load main class from JAR
file:/var/lib/mesos/slaves/00a81353-d68c-4b7c-b050-d9dfb2a74646-S24/frameworks/52806e97-565b-43d0-90a1-979a61196cb8-0007/executors/driver-20161024163345-0095/runs/c904dc98-8365-4270-895c-374c59ff2b34/spark-2.0.0-bin-hadoop2.7/2


If I go to the Mesos task ui, I can see the spark folder and the
SparkPi.jar.

What am I missing? If I don't specify the local spark home it doesn't work
neither.

Thanks in advance.


Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-27 Thread Mehrez Alachheb
I think you should just shut down your SparkContext at the end.
sc.stop()

2016-10-21 22:47 GMT+02:00 Chetan Khatri :

> Hello Spark Users,
>
> I am writing around 10 GB of Processed Data to Parquet where having 1 TB
> of HDD and 102 GB of RAM, 16 vCore machine on Google Cloud.
>
> Every time, i write to parquet. it shows on Spark UI that stages succeeded
> but on spark shell it hold context on wait mode for almost 10 mins. then it
> clears broadcast, accumulator shared variables.
>
> Can we sped up this thing ?
>
> Thanks.
>
> --
> Yours Aye,
> Chetan Khatri.
> M.+91 7 80574
> Data Science Researcher
> INDIA
>
> ​​Statement of Confidentiality
> 
> The contents of this e-mail message and any attachments are confidential
> and are intended solely for addressee. The information may also be legally
> privileged. This transmission is sent in trust, for the sole purpose of
> delivery to the intended recipient. If you have received this transmission
> in error, any use, reproduction or dissemination of this transmission is
> strictly prohibited. If you are not the intended recipient, please
> immediately notify the sender by reply e-mail or phone and delete this
> message and its attachments, if any.​​
>


Re: Spark security

2016-10-27 Thread Steve Loughran

On 13 Oct 2016, at 14:40, Mendelson, Assaf 
> wrote:

Hi,
We have a spark cluster and we wanted to add some security for it. I was 
looking at the documentation (in  
http://spark.apache.org/docs/latest/security.html) and had some questions.
1.   Do all executors listen by the same blockManager port? For example, in 
yarn there are multiple executors per node, do they all listen to the same port?

On YARN the executors will come up on their own ports.

2.   Are ports defined in earlier version (e.g. 
http://spark.apache.org/docs/1.6.1/security.html) and removed in the latest 
(such as spark.executor.port and spark.fileserver.port) gone and can be blocked?
3.   If I define multiple workers per node in spark standalone mode, how do 
I set the different ports for each worker (there is only one 
spark.worker.ui.port / SPARK_WORKER_WEBUI_PORT definition. Do I have to start 
each worker separately to configure a port?) The same is true for the worker 
port (SPARK_WORKER_PORT)
4.   Is it possible to encrypt the logs instead of just limiting with 
permissions the log directory?

if writing to HDFS on a Hadoop 2.7+ cluster you can use HDFS Encryption At Rest 
to encrypt the data on the disks. If you are talking to S3 with the Hadoop 2.8+ 
libraries (not officially shipping), you can use S3 server side encryption with 
AWS managed keys too.

5.   Is the communication between the servers encrypted (e.g. using ssh?)

you can enable this;

https://spark.apache.org/docs/latest/security.html
https://spark.apache.org/docs/latest/configuration.html#security

spark.network.sasl.serverAlwaysEncrypt true
spark.authenticate.enableSaslEncryption true

I *believe* that encrypted shuffle comes with 2.1   
https://issues.apache.org/jira/browse/SPARK-5682

as usual, look in the source to really understand

there's various ways to interact with spark and within; you need to make sure 
they are all secured against malicious users

-web UI. on YARN, you can use SPNEGO to kerberos-auth the yarn RM proxy; the 
Spark UI will 302 all direct requests to its web UI back to that proxy. 
Communications behind the scnese between the RM and the Spark  UI will not, 
AFAIK, be encrypted/authed.

-spark-driver executor comms
-bulk data exchange between drivers
-shuffle service in executor, or hosted inside YARN node managers.
-spark-filesystem communications
-spark to other data source communications (Kafka, etc)

You're going to have go through them all and do the checklist.

As is usual in an open source project, documentation improvements are always 
welcome. There is a good security doc in the spark source —but I'm sure extra 
contributions will be welcome




6.   Are there any additional best practices beyond what is written in the 
documentation?
Thanks,

In a YARN cluster, Kerberos is mandatory if you want any form of security. 
Sorry.



Re: spark infers date to be timestamp type

2016-10-27 Thread Steve Loughran
CSV type inference isn't really ideal: it does a full scan of a file to 
determine this; you are doubling the amount of data you need to read. Unless 
you are just exploring files in your notebook, I'd recommend doing it once, 
getting the schema from it then using that as the basis for the code snippet 
where you really define the schema. That's when you can explicitly declare the 
schema types if the inferred ones aren't great.

(maybe I should write something which prints out the scala/py code for that 
declaration rather than having to do it by hand...)

On 27 Oct 2016, at 05:55, Hyukjin Kwon 
> wrote:

Hi Koert,


Sorry, I thought you meant this is a regression between 2.0.0 and 2.0.1. I just 
checked It has not been supporting to infer DateType before[1].

Yes, it only supports to infer such data as timestamps currently.


[1]https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L85-L92




2016-10-27 9:12 GMT+09:00 Anand Viswanathan 
>:
Hi,

you can use the customSchema(for DateType) and specify dateFormat in .option().
or
at spark dataframe side, you can convert the timestamp to date using cast to 
the column.

Thanks and regards,
Anand Viswanathan

On Oct 26, 2016, at 8:07 PM, Koert Kuipers 
> wrote:

hey,
i create a file called test.csv with contents:
date
2015-01-01
2016-03-05

next i run this code in spark 2.0.1:
spark.read
  .format("csv")
  .option("header", true)
  .option("inferSchema", true)
  .load("test.csv")
  .printSchema

the result is:
root
 |-- date: timestamp (nullable = true)


On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon 
> wrote:

There are now timestampFormat for TimestampType and dateFormat for DateType.

Do you mind if I ask to share your codes?

On 27 Oct 2016 2:16 a.m., "Koert Kuipers" 
> wrote:
is there a reason a column with dates in format -mm-dd in a csv file is 
inferred to be TimestampType and not DateType?

thanks! koert






Dynamic Resource Allocation in a standalone

2016-10-27 Thread Ofer Eliassaf
Hi,

I have a question/problem regarding dynamic resource allocation.
I am using spark 1.6.2 with stand alone cluster manager.

I have one worker with 2 cores.

I set the the folllowing arguments in the spark-defaults.conf file on all
my nodes:

spark.dynamicAllocation.enabled  true
spark.shuffle.service.enabled true
spark.deploy.defaultCores 1

I run a sample application with many tasks.

I open port 4040 on the driver and i can verify that the above
configuration exists.

My problem is that no matter what i do my application only gets 1 core even
though the other cores are available.

Is this normal or do i have a problem in my configuration.


The behaviour i want to get is this:
I have many users working with the same spark cluster.
I want that each application will get a fixed number of cores unless the
rest of the clutser is pending.
In this case I want that the runn ing applications will get the total
amount of cores until a new application arrives...


-- 
Regards,
Ofer Eliassaf


Re: Executor shutdown hook and initialization

2016-10-27 Thread Sean Owen
Init is easy -- initialize them in your singleton.
Shutdown is harder; a shutdown hook is probably the only reliable way to go.
Global state is not ideal in Spark. Consider initializing things like
connections per partition, and open/close them with the lifecycle of a
computation on a partition instead.

On Wed, Oct 26, 2016 at 9:27 PM Walter rakoff 
wrote:

> Hello,
>
> Is there a way I can add an init() call when an executor is created? I'd
> like to initialize a few connections that are part of my singleton object.
> Preferably this happens before it runs the first task
> On the same line, how can I provide an shutdown hook that cleans up these
> connections on termination.
>
> Thanks
> Walt
>


RE: No of partitions in a Dataframe

2016-10-27 Thread Jan Botorek
Hello, Nipun
In my opinion, the „converting the dataframe to an RDD“ wouldn’t be a costly 
operation since Dataframe (Dataset) operations are under the hood operated 
always as RDDs. I don’t know which version of Spark you operate, but I suppose 
you utilize the 2.0.
I would, therefore go for:

dataFrame.rdd.partitions

That returns Array of partitions (writen in SCALA).

Regards,
Jan

From: Nipun Parasrampuria [mailto:paras...@umn.edu]
Sent: Thursday, October 27, 2016 12:01 AM
To: user@spark.apache.org
Subject: No of partitions in a Dataframe


How do I find the number of partitions in a dataframe without converting the 
dataframe to an RDD(I'm assuming that it's a costly operation).

If there's no way to do so, I wonder why the API doesn't include a method like 
that(an explanation for why such a method would be useless, perhaps)

Thanks!
Nipun


RE: Spark security

2016-10-27 Thread Mendelson, Assaf
Anyone can assist with this?

From: Mendelson, Assaf [mailto:assaf.mendel...@rsa.com]
Sent: Thursday, October 13, 2016 3:41 PM
To: user@spark.apache.org
Subject: Spark security

Hi,
We have a spark cluster and we wanted to add some security for it. I was 
looking at the documentation (in  
http://spark.apache.org/docs/latest/security.html) and had some questions.

1.   Do all executors listen by the same blockManager port? For example, in 
yarn there are multiple executors per node, do they all listen to the same port?

2.   Are ports defined in earlier version (e.g. 
http://spark.apache.org/docs/1.6.1/security.html) and removed in the latest 
(such as spark.executor.port and spark.fileserver.port) gone and can be blocked?

3.   If I define multiple workers per node in spark standalone mode, how do 
I set the different ports for each worker (there is only one 
spark.worker.ui.port / SPARK_WORKER_WEBUI_PORT definition. Do I have to start 
each worker separately to configure a port?) The same is true for the worker 
port (SPARK_WORKER_PORT)

4.   Is it possible to encrypt the logs instead of just limiting with 
permissions the log directory?

5.   Is the communication between the servers encrypted (e.g. using ssh?)

6.   Are there any additional best practices beyond what is written in the 
documentation?
Thanks,
Assaf.