Re: Class cast exception while using Data Frames

2018-03-27 Thread Shmuel Blitz
Hi Nikhil, Can you please put a code snippet that reproduces the issue? Shmuel On Tue, Mar 27, 2018 at 12:55 AM, Nikhil Goyal wrote: > |-- myMap: map (nullable = true) > ||-- key: struct > ||-- value: double (valueContainsNull = true) > |||-- _1:

Re: Calculate co-occurring terms

2018-03-27 Thread Donni Khan
Hi again, I found example in Scala but I don't have any experience with scala? can anyone convert it to java please? Thank you, Donni On Fri, Mar 23, 2018 at 8:57 AM, Donni Khan

java spark udf error

2018-03-27 Thread 崔苗
Hi, I define a udf to mark the empty string in java like that: public class MarkUnknown implements UDF2 { @Override public String call(String processor,String fillContent){ if(processor.trim().equals("")){ logger.info("find empty string");

Queries with streaming sources must be executed with writeStream.start();;

2018-03-27 Thread Junfeng Chen
I am reading some data from kafka, and willing to save them to parquet on hdfs with structured streaming. The data from kafka is in JSON format. I try to convert them to DataSet with spark.read.json(). However, I get the exception: > > Queries with streaming sources must be executed with >

unsubscribe

2018-03-27 Thread Mikhail Ibraheem

Re: What do I need to set to see the number of records and processing time for each batch in SPARK UI?

2018-03-27 Thread kant kodali
For example in this blog post. Looking at figure 1 and figure 2 I wonder What I need to do to see those graphs in spark 2.3.0? On Mon, Mar 26, 2018 at 7:10 AM, kant kodali

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-27 Thread Rohit Karlupia
Let me be more specific: With GC/CPU aware task scheduling, user doesn't have to worry about specifying cores carefully. So if the user always specify cores = 100 or 1024 for every executor, he will still not get OOM (under vast majority of cases). Internally, the scheduler will vary the number

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

2018-03-27 Thread naresh Goud
In case of storing as parquet file I don’t think it requires header. option("header","true") Give a try by removing header option and then try to read it. I haven’t tried. Just a thought. Thank you, Naresh On Tue, Mar 27, 2018 at 9:47 PM Mina Aslani wrote: > Hi, > > >

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

2018-03-27 Thread Mina Aslani
Hi Naresh, Thank you for the quick response, appreciate it. Removing the option("header","true") and trying df = spark.read.parquet("test.parquet"), now can read the parquet works. However, I would like to find a way to have the data in csv/readable. still I cannot save df as csv as it throws.

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-27 Thread Gourav Sengupta
Hi, As per documentation in: https://spark.apache.org/docs/latest/configuration.html spark.local.dir /tmp Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

2018-03-27 Thread Xiao Li
Hi, Eirik, Yes, please open a JIRA. Thanks, Xiao 2018-03-23 8:03 GMT-07:00 Eirik Thorsnes : > Hi all, > > I'm trying the new ORC native in Spark 2.3 > (org.apache.spark.sql.execution.datasources.orc). > > I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

2018-03-27 Thread Dongjoon Hyun
Hi, Eric. For me, Spark 2.3 works correctly like the following. Could you give us some reproducible example? ``` scala> sql("set spark.sql.orc.impl=native") scala> sql("set spark.sql.orc.compression.codec=zlib") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala>

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

2018-03-27 Thread Dongjoon Hyun
You may hit SPARK-23355 (convertMetastore should not ignore table properties). Since it's a known Spark issue for all Hive tables (Parquet/ORC), could you check that too? Bests, Dongjoon. On 2018/03/28 01:00:55, Dongjoon Hyun wrote: > Hi, Eric. > > For me, Spark 2.3

java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

2018-03-27 Thread Mina Aslani
Hi, I am using pyspark. To transform my sample data and create model, I use stringIndexer and OneHotEncoder. However, when I try to write data as csv using below command df.coalesce(1).write.option("header","true").mode("overwrite").csv("output.csv") I get UnsupportedOperationException

closure issues: wholeTextFiles

2018-03-27 Thread Gourav Sengupta
Hi, I can understand facing closure issues while executing this code: package spark //this package is about understanding closures as mentioned in:

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-27 Thread Michael Shtelma
Hi, the Jira Bug is here: https://issues.apache.org/jira/browse/SPARK-23799 I have also created the PR for the issue: https://github.com/apache/spark/pull/20913 With this fix, it is working for me really well. Best, Michael On Sat, Mar 24, 2018 at 12:39 AM, Takeshi Yamamuro

Re: Class cast exception while using Data Frames

2018-03-27 Thread Nikhil Goyal
You can run this on spark shell *CODE:* case class InstanceData(service: String, metric: String, zone: String, source: String, time: Long, value: Double ) val seq = sc.parallelize(Seq( InstanceData("serviceA", "metricA", "zoneA", "sourceA", 1000L, 1.0),

Spark on K8s resource staging server timeout

2018-03-27 Thread Jenna Hoole
So I'm running into an issue with my resource staging server that's producing a stacktrace like Issue 342 , but I don't think for the same reasons. What's happening is that every time after I start up a resource staging server, the first job

[Spark R] Proposal: Exposing RBackend in RRunner

2018-03-27 Thread Jeremy Liu
Spark Users, In SparkR, RBackend is created in RRunner.main(). This in particular makes it difficult to control or use the RBackend. For my use case, I am looking to access the JVMObjectTracker that RBackend maintains for SparkR dataframes. Analogously, pyspark starts a py4j.GatewayServer in

unsubscribe

2018-03-27 Thread Nicholas Sharkey

Re: unsubscribe

2018-03-27 Thread Romero, Saul
unsubscribe On Tue, Mar 27, 2018 at 1:15 PM, Nicholas Sharkey wrote: > >

PySpark Structured Streaming : Writing to DB in Python and Foreach Sink.

2018-03-27 Thread Ramaswamy, Muthuraman
Hi All, I am exploring PySpark Structured Streaming and the documentation says the Foreach Sink is not supported in Python and is available only with Java/Scala. Given the unavailability of this sink, what options are there for the following: 1. Will there be support for Foreach Sink in

unsubscribe

2018-03-27 Thread Andrei Balici
-- Andrei Balici Student at the School of Computer Science, University of Manchester