Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-15 Thread Han-Cheol Cho
My apologies, It was a problem of our Hadoop cluster. When we tested the same code on another cluster (HDP-based), it worked without any problem. ```scala ## make sjis text cat a.txt 8月データだけでやってみよう nkf -W -s a.txt >b.txt cat b.txt 87n%G!<%?$@$1$G$d$C$F$_$h$& nkf -s -w b.txt 8月データだけでやってみよう hdfs

Reading CSV with multiLine option invalidates encoding option.

2017-08-15 Thread Han-Cheol Cho
Dear Spark ML members, I experienced a trouble in using "multiLine" option to load CSV data with Shift-JIS encoding. When option("multiLine", true) is specified, option("encoding", "encoding-name") just doesn't work anymore. In CSVDataSource.scala file, I found that

Re: SPIP: Spark on Kubernetes

2017-08-15 Thread lucas.g...@gmail.com
>From our perspective, we have invested heavily in Kubernetes as our cluster manager of choice. We also make quite heavy use of spark. We've been experimenting with using these builds (2.1 with pyspark enabled) quite heavily. Given that we've already 'paid the price' to operate Kubernetes in

Re: Restart streaming query spark 2.1 structured streaming

2017-08-15 Thread purna pradeep
Ok thanks Few more 1.when I looked into the documentation it says onQueryprogress is not threadsafe ,So Is this method would be the right place to refresh cache?and no need to restart query if I choose listener ? The methods are not thread-safe as they may be called from different threads.

Re: Restart streaming query spark 2.1 structured streaming

2017-08-15 Thread Tathagata Das
Both works. The asynchronous method with listener will have less of down time, just that the first trigger/batch after the asynchronous unpersist+persist will probably take longer as it has to reload the data. On Tue, Aug 15, 2017 at 2:29 PM, purna pradeep wrote: >

Re: Restart streaming query spark 2.1 structured streaming

2017-08-15 Thread purna pradeep
Thanks tathagata das actually I'm planning to something like this activeQuery.stop() //unpersist and persist cached data frame df.unpersist() //read the updated data //data size of df is around 100gb df.persist() activeQuery = startQuery() the cached data frame size around 100gb ,so

Hive Metastore open connections even after closing Spark context and session

2017-08-15 Thread Rohit Damkondwar
Hi. I am using Spark for querying Hive followed by transformations. My Scala app creates multiple Spark Applications. A new spark context (and session) is created only after closing previous SparkSession and Spark Context. However, on stopping sc and spark, somehow connections to Hive Metastore

Re: Restart streaming query spark 2.1 structured streaming

2017-08-15 Thread Tathagata Das
You can do something like this. def startQuery(): StreamingQuery = { // create your streaming dataframes // start the query with the same checkpoint directory} // handle to the active queryvar activeQuery: StreamingQuery = null while(!stopped) { if (activeQuery = null) { // if

Re: Restart streaming query spark 2.1 structured streaming

2017-08-15 Thread purna pradeep
Thanks Michael I guess my question is little confusing ..let me try again I would like to restart streaming query programmatically while my streaming application is running based on a condition and why I want to do this I want to refresh a cached data frame based on a condition and the best

Re: Restart streaming query spark 2.1 structured streaming

2017-08-15 Thread Michael Armbrust
See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing Though I think that this currently doesn't work with the console sink. On Tue, Aug 15, 2017 at 9:40 AM, purna pradeep wrote: > Hi, > >> >>

Re: Spark 2.2 streaming with append mode: empty output

2017-08-15 Thread Ashwin Raju
The input dataset has multiple days worth of data, so I thought the watermark should have been crossed. To debug, I changed the query to the code below. My expectation was that since I am doing 1 day windows with late arrivals permitted for 1 second, when it sees records for the next day, it would

How to force Spark Kafka Direct to start from the latest offset when the lag is huge in kafka 10?

2017-08-15 Thread SRK
Hi, How to force Spark Kafka Direct to start from the latest offset when the lag is huge in kafka 10? It seems to be processing from the latest offset stored for a group id. One way to do this is to change the group id. But it would mean that each time that we need to process the job from the

Restart streaming query spark 2.1 structured streaming

2017-08-15 Thread purna pradeep
Hi, > > I'm trying to restart a streaming query to refresh cached data frame > > Where and how should I restart streaming query > val sparkSes = SparkSession .builder .config("spark.master", "local") .appName("StreamingCahcePoc") .getOrCreate() import

Re: DAGScheduler - two runtimes

2017-08-15 Thread 周康
ResultStage cost time is your job's last stage cost time. Job 13 finished: reduce at VertexRDDImpl.scala:90, took 0.035546 s is the time your job cost 2017-08-14 18:58 GMT+08:00 Kaepke, Marc : > Hi everyone, > > I’m a Spark newbie and have one question: > What is the

How to calculating CPU time for a Spark Job?

2017-08-15 Thread 钟文波
How to calculating CPU time for a Spark Job? Is there any interface can be directly call? like the hadoop Map-Reduce Framework provider the CPU time spent(ms) in the Counters. thinks!