Re: Submit job with driver options in Mesos Cluster mode

2016-10-27 Thread Rodrick Brown
Try setting the values in $SPARK_HOME/conf/spark-defaults.conf i.e. $ egrep 'spark.(driver|executor).extra' /data/orchard/spark-2.0.1/conf/spark-defaults.conf spark.executor.extraJavaOptions -Duser.timezone=UTC -Xloggc:garbage-collector.log spark.driver.extraJavaOptions

[ANNOUNCE] Apache Bahir 2.0.1

2016-10-27 Thread Luciano Resende
The Apache Bahir PMC is pleased to announce the release of Apache Bahir 2.0.1 which is our first major release and provides the following extensions for Apache Spark 2.0.1 : Akka Streaming MQTT Streaming and Structured Streaming Twitter Streaming ZeroMQ Streaming For more information about

Re: Need help Creating a rule using the Streaming API

2016-10-27 Thread patrickhuang
Hi, Maybe you can try like this? val transformed= events.map(event => ((event.user, event.ip), 1).reduceByKey(_ +_) val alarm= transformed.filter(transformed._2 >= 10) Patrick -- View this message in context:

Re: Reading AVRO from S3 - No parallelism

2016-10-27 Thread prithish
The Avro files were 500-600kb in size and that folder contained around 1200 files. The total folder size was around 600mb. Will try repartition. Thank you. > > On Oct 28, 2016 at 2:24 AM, (mailto:mich...@databricks.com)> wrote: > > > > How big are your

Re: Submit job with driver options in Mesos Cluster mode

2016-10-27 Thread vonnagy
We were using 1.6, but now we are on 2.0.1. Both versions show the same issue. I dove deep into the Spark code and have identified that the extra java options are /not/ added to the process on the executors. At this point, I believe you have to use spark-defaults.conf to set any values that will

Re: Submit job with driver options in Mesos Cluster mode

2016-10-27 Thread csakoda
I'm seeing something very similar in my own Mesos/Spark Cluster. High level summary: When I use `--deploy-mode cluster`, java properties that I pass to my driver via `spark.driver.extraJavaOptions` are not available to the driver. I've confirmed this by inspecting the output of

Spark 2.0 with Hadoop 3.0?

2016-10-27 Thread adam kramer
Is the version of Spark built for Hadoop 2.7 and later only for 2.x releases? Is there any reason why Hadoop 3.0 is a non-starter for use with Spark 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which would resolve our driver dependency issues. Thanks, Adam

Re: Spark UI error spark 2.0.1 hadoop 2.6

2016-10-27 Thread gpatcham
I'm able to fix.. added servlet 3.0 to classpath -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-error-spark-2-0-1-hadoop-2-6-tp27970p27971.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark Streaming and Kinesis

2016-10-27 Thread Benjamin Kim
Has anyone worked with AWS Kinesis and retrieved data from it using Spark Streaming? I am having issues where it’s returning no data. I can connect to the Kinesis stream and describe using Spark. Is there something I’m missing? Are there specific IAM security settings needed? I just simply

Re: importing org.apache.spark.Logging class

2016-10-27 Thread Michael Armbrust
This was made internal to Spark. I'd suggest that you use slf4j directly. On Thu, Oct 27, 2016 at 2:42 PM, Reth RM wrote: > Updated spark to version 2.0.0 and have issue with importing > org.apache.spark.Logging > > Any suggested fix for this issue? >

importing org.apache.spark.Logging class

2016-10-27 Thread Reth RM
Updated spark to version 2.0.0 and have issue with importing org.apache.spark.Logging Any suggested fix for this issue?

Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Davies Liu
Usually using broadcast join could boost the performance when you have enough memory, You should decrease it or even disable it when there is no enough memory. On Thu, Oct 27, 2016 at 1:22 PM, Pietro Pugni wrote: > Thank you Davies, > this worked! But what are the

Re: Reading AVRO from S3 - No parallelism

2016-10-27 Thread Michael Armbrust
How big are your avro files? We collapse many small files into a single partition to eliminate scheduler overhead. If you need explicit parallelism you can also repartition. On Thu, Oct 27, 2016 at 5:19 AM, Prithish wrote: > I am trying to read a bunch of AVRO files from a

Spark UI error spark 2.0.1 hadoop 2.6

2016-10-27 Thread gpatcham
Hi, I'm running spark-shell in yarn client mode and sparkcontext started and able to run commands . But UI is not coming up and see below error's in spark shell 20:51:20 WARN servlet.ServletHandler: javax.servlet.ServletException: Could not determine the proxy server for redirection at

Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Pietro Pugni
Thank you Davies, this worked! But what are the consequences of setting spark.sql.autoBroadcastJoinThreshold=0? Will it degrade or boost performance? Thank you again Pietro > Il giorno 27 ott 2016, alle ore 18:54, Davies Liu ha > scritto: > > I think this is caused by

Re: Infinite Loop in Spark

2016-10-27 Thread Mark Hamstra
Using a single SparkContext for an extended period of time is how long-running Spark Applications such as the Spark Job Server work ( https://github.com/spark-jobserver/spark-jobserver). It's an established pattern. On Thu, Oct 27, 2016 at 11:46 AM, Gervásio Santos wrote: >

Infinite Loop in Spark

2016-10-27 Thread Gervásio Santos
Hi guys! I'm developing an application in Spark that I'd like to run continuously. It would execute some actions, sleep for a while and go again. I was thinking of doing it in a standard infinite loop way. val sc = while (true) { doStuff(...) sleep(...) } I would be running this (fairly

Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
i can see how unquoted csv would work if you escape delimiters, but i have never seen that in practice. On Thu, Oct 27, 2016 at 2:03 PM, Jain, Nishit wrote: > I’d think quoting is only necessary if you are not escaping delimiters in > data. But we can only share our

large scheduler delay in OnlineLDAOptimizer, (MLlib and LDA)

2016-10-27 Thread Xiaoye Sun
Hi, I am running some experiments with OnlineLDAOptimizer in Spark 1.6.1. My Spark cluster has 30 machines. However, I found that the Scheduler delay at job/stage "reduce at LDAOptimizer.scala:452" is extremely large when the LDA model is large. The delay could be tens of seconds. Does anyone

Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
I’d think quoting is only necessary if you are not escaping delimiters in data. But we can only share our opinions. It would be good to see something documented. This may be the cause of the issue?: https://issues.apache.org/jira/browse/CSV-135 From: Koert Kuipers

Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
well my expectation would be that if you have delimiters in your data you need to quote your values. if you now have quotes without your data you need to escape them. so escaping is only necessary if quoted. On Thu, Oct 27, 2016 at 1:45 PM, Jain, Nishit wrote: > Do you

Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
Do you mind sharing why should escaping not work without quotes? From: Koert Kuipers > Date: Thursday, October 27, 2016 at 12:40 PM To: "Jain, Nishit" > Cc:

Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
that is what i would expect: escaping only works if quoted On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit wrote: > Interesting finding: Escaping works if data is quoted but not otherwise. > > From: "Jain, Nishit" > Date: Thursday, October 27, 2016

Re: CSV escaping not working

2016-10-27 Thread Jain, Nishit
Interesting finding: Escaping works if data is quoted but not otherwise. From: "Jain, Nishit" > Date: Thursday, October 27, 2016 at 10:54 AM To: "user@spark.apache.org"

If you have used spark-sas7bdat package to transform SAS data set to Spark, please be aware

2016-10-27 Thread Shi Yu
I found some main issues and wrote it on my blog: https://eilianyu.wordpress.com/2016/10/27/be-aware-of-hidden-data-errors-using-spark-sas7bdat-pacakge-to-ingest-sas-datasets-to-spark/

Re: Using Hive UDTF in SparkSQL

2016-10-27 Thread Davies Liu
Could you file a JIRA for this bug? On Thu, Oct 27, 2016 at 3:05 AM, Lokesh Yadav wrote: > Hello > > I am trying to use a Hive UDTF function in spark SQL. But somehow its not > working for me as intended and I am not able to understand the behavior. > > When I try to

Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Davies Liu
I think this is caused by BroadcastHashJoin try to use more memory than the amount driver have, could you decrease the spark.sql.autoBroadcastJoinThreshold (-1 or 0 means disable it)? On Thu, Oct 27, 2016 at 9:19 AM, Pietro Pugni wrote: > I’m sorry, here’s the formatted

Re: TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread Pietro Pugni
I’m sorry, here’s the formatted message text: I'm running an ETL process that joins table1 with other tables (CSV files), one table at time (for example table1 with table2, table1 with table3, and so on). The join is written inside a PostgreSQL istance using JDBC. The entire process runs

TaskMemoryManager: Failed to allocate a page

2016-10-27 Thread pietrop
I'm running an ETL process that joins table1 with other tables (CSV files), one table at time (for example table1 with table2, table1 with table3, and so on). The join is written inside a PostgreSQL istance using JDBC.The entire process runs successfully if I use table2, table3 and table4. If I

CSV escaping not working

2016-10-27 Thread Jain, Nishit
I am using spark-core version 2.0.1 with Scala 2.11. I have simple code to read a csv file which has \ escapes. val myDA = spark.read .option("quote",null) .schema(mySchema) .csv(filePath) As per documentation \ is default escape for csv reader. But it does not work. Spark is

Re: Executor shutdown hook and initialization

2016-10-27 Thread Chawla,Sumit
Hi Sean Could you please elaborate on how can this be done on a per partition basis? Regards Sumit Chawla On Thu, Oct 27, 2016 at 7:44 AM, Walter rakoff wrote: > Thanks for the info Sean. > > I'm initializing them in a singleton but Scala objects are evaluated >

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-27 Thread Kazuaki Ishizaki
Hi Chin Wei, Thank you for confirming this on 2.0.1 and being happy to hear it never happens. The performance will be improved when this PR ( https://github.com/apache/spark/pull/15219) is integrated. Regards, Kazuaki Ishizaki From: Chin Wei Low To: Kazuaki

Re: Executor shutdown hook and initialization

2016-10-27 Thread Walter rakoff
Thanks for the info Sean. I'm initializing them in a singleton but Scala objects are evaluated lazily. So it gets initialized only when the first task is run(and makes use of the object). Plan is to start a background thread in the object that does periodic cache refresh too. I'm trying to see if

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Hi Just point all users on the same app with a common spark context. For instance akka http receives queries from user and launch concurrent spark SQL queries in different actor thread. The only prerequsite is to launch the different jobs in different threads (like with actors). Be carefull it's

Running Hive and Spark together with Dynamic Resource Allocation

2016-10-27 Thread rachmaninovquartet
Hi, My team has a cluster running HDP, with Hive and Spark. We setup spark to use dynamic resource allocation, for benefits such as not having to hard code the number of executors and to free resources after using. Everything is running on YARN. The problem is that for Spark 1.5.2 with dynamic

Many Spark metric names do not include the application name

2016-10-27 Thread Amit Sela
Hi guys, It seems that JvmSource / DAGSchedulerSource / BlockManagerSource / ExecutorAllocationManager and other metrics sources (except for the StreamingSource) publish their metrics directly under the "driver" fragment (or its executor counter-part) of the metric path without including the

Re: Sharing RDDS across applications and users

2016-10-27 Thread Victor Shafran
Hi Vincent, Can you elaborate on how to implement "shared sparkcontext and fair scheduling" option? My approach was to use sparkSession.getOrCreate() method and register temp table in one application. However, I was not able to access this tempTable in another application. You help is highly

Spark 2.0 on HDP

2016-10-27 Thread Deenar Toraskar
Hi Has anyone tried running Spark 2.0 on HDP. I have managed to get around the issues with the timeline service (by turning it off), but now am stuck when the YARN cannot find org.apache.spark.deploy.yarn.ExecutorLauncher. Error: Could not find or load main class

Re: Run spark-shell inside Docker container against remote YARN cluster

2016-10-27 Thread Marco Mistroni
I am running spark inside docker though not connecting to cluster How did u build spark? Which profile did u use? Pls share details and I can try to replicate Kr On 27 Oct 2016 2:30 pm, "ponkin" wrote: Hi, May be someone already had experience to build docker image for

Re: Sharing RDDS across applications and users

2016-10-27 Thread Gene Pang
Hi Mich, Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames among different applications and contexts. The data typically stays in memory, but with Alluxio's tiered storage, the "colder" data can be evicted out to other medium, like SSDs and HDDs. Here is a blog post

Run spark-shell inside Docker container against remote YARN cluster

2016-10-27 Thread ponkin
Hi, May be someone already had experience to build docker image for spark? I want to build docker image with spark inside but configured against remote YARN cluster. I have already created image with spark 1.6.2 inside. But when I run spark-shell --master yarn --deploy-mode client --driver-memory

Reading AVRO from S3 - No parallelism

2016-10-27 Thread Prithish
I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. No matter how many executors I use or what configuration changes I make, the cluster doesn't seem to use all the executors. I am using the com.databricks.spark.avro library from databricks to read the AVRO. However, if I

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
For this you will need to contribute... Le 27 oct. 2016 1:35 PM, "Mich Talebzadeh" a écrit : > so I assume Ignite will not work with Spark version >=2? > > Dr Mich Talebzadeh > > > > LinkedIn * >

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
so I assume Ignite will not work with Spark version >=2? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
some options: - ignite for spark 1.5, can deep store on cassandra - alluxio for all spark versions, can deep store on hdfs, gluster... ==> these are best for sharing between jobs - shared sparkcontext and fair scheduling, seems to be not thread safe - spark jobserver and namedRDD, CRUD thread

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
I would prefer sharing the spark context and using FAIR scheduler for user concurrency Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" a écrit : > thanks Vince. > > So Ignite uses some hash/in-memory indexing. > > The question is in practice is there much use case to use

Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich, I only tried Alluxio so I can’t give you a comparison. In my experience, I use Alluxio for the big data set (50GB - 100GB) which is the input of the pipelines jobs so you can reuse the result from previous job. > On Oct 27, 2016, at 5:39 PM, Mich Talebzadeh

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
thanks Vince. So Ignite uses some hash/in-memory indexing. The question is in practice is there much use case to use these two fabrics for sharing RDDs. Remember all RDBMSs do this through shared memory. In layman's term if I have two independent spark-submit running, can they share result

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Ignite works only with spark 1.5 Ignite leverage indexes Alluxio provides tiering Alluxio easily integrates with underlying FS Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" a écrit : > Thanks Chanh, > > Can it share RDDs. > > Personally I have not used either Alluxio or

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
Thanks Chanh, Can it share RDDs. Personally I have not used either Alluxio or Ignite. 1. Are there major differences between these two 2. Have you tried Alluxio for sharing Spark RDDs and if so do you have any experience you can kindly share Regards Dr Mich Talebzadeh LinkedIn *

Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich, Alluxio is the good option to go. Regards, Chanh > On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh > wrote: > > > There was a mention of using Zeppelin to share RDDs with many users. From the > notes on Zeppelin it appears that this is sharing UI and I am

Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
There was a mention of using Zeppelin to share RDDs with many users. From the notes on Zeppelin it appears that this is sharing UI and I am not sure how easy it is going to be changing the result set with different users modifying say sql queries. There is also the idea of caching RDDs with

Using Hive UDTF in SparkSQL

2016-10-27 Thread Lokesh Yadav
Hello I am trying to use a Hive UDTF function in spark SQL. But somehow its not working for me as intended and I am not able to understand the behavior. When I try to register a function like this: create temporary function SampleUDTF_01 as 'com.fl.experiments.sparkHive.SampleUDTF' using JAR

Using SparkLauncher in cluster mode, in a Mesos cluster

2016-10-27 Thread Nerea Ayestarán
I am trying to launch a Apache Spark job from a java class to a Apache Mesos cluster in cluster deploy mode. I use SparkLauncher configured as follows: Process sparkProcess = new SparkLauncher() .setAppResource("hdfs://auto-ha/path/to/jar/SparkPi.jar")

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-27 Thread Mehrez Alachheb
I think you should just shut down your SparkContext at the end. sc.stop() 2016-10-21 22:47 GMT+02:00 Chetan Khatri : > Hello Spark Users, > > I am writing around 10 GB of Processed Data to Parquet where having 1 TB > of HDD and 102 GB of RAM, 16 vCore machine on Google

Re: Spark security

2016-10-27 Thread Steve Loughran
On 13 Oct 2016, at 14:40, Mendelson, Assaf > wrote: Hi, We have a spark cluster and we wanted to add some security for it. I was looking at the documentation (in http://spark.apache.org/docs/latest/security.html) and had some questions.

Re: spark infers date to be timestamp type

2016-10-27 Thread Steve Loughran
CSV type inference isn't really ideal: it does a full scan of a file to determine this; you are doubling the amount of data you need to read. Unless you are just exploring files in your notebook, I'd recommend doing it once, getting the schema from it then using that as the basis for the code

Dynamic Resource Allocation in a standalone

2016-10-27 Thread Ofer Eliassaf
Hi, I have a question/problem regarding dynamic resource allocation. I am using spark 1.6.2 with stand alone cluster manager. I have one worker with 2 cores. I set the the folllowing arguments in the spark-defaults.conf file on all my nodes: spark.dynamicAllocation.enabled true

Re: Executor shutdown hook and initialization

2016-10-27 Thread Sean Owen
Init is easy -- initialize them in your singleton. Shutdown is harder; a shutdown hook is probably the only reliable way to go. Global state is not ideal in Spark. Consider initializing things like connections per partition, and open/close them with the lifecycle of a computation on a partition

RE: No of partitions in a Dataframe

2016-10-27 Thread Jan Botorek
Hello, Nipun In my opinion, the „converting the dataframe to an RDD“ wouldn’t be a costly operation since Dataframe (Dataset) operations are under the hood operated always as RDDs. I don’t know which version of Spark you operate, but I suppose you utilize the 2.0. I would, therefore go for:

RE: Spark security

2016-10-27 Thread Mendelson, Assaf
Anyone can assist with this? From: Mendelson, Assaf [mailto:assaf.mendel...@rsa.com] Sent: Thursday, October 13, 2016 3:41 PM To: user@spark.apache.org Subject: Spark security Hi, We have a spark cluster and we wanted to add some security for it. I was looking at the documentation (in