Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Gourav Sengupta
Hi Shiyuan, I do not know whether I am right, but I would prefer to avoid expressions in Spark as: df = <> Regards, Gourav Sengupta On Tue, Apr 10, 2018 at 10:42 PM, Shiyuan wrote: > Here is the pretty print of the physical plan which reveals some details > about what

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

2018-04-10 Thread Louis Hust
We want to extract data from mysql, and calculate in sparksql. The sql explain like below. == Parsed Logical Plan == > 'Sort ['revenue DESC NULLS LAST], true > +- 'Aggregate ['n_name], ['n_name, 'SUM(('l_extendedprice * (1 - > 'l_discount))) AS revenue#329] >+- 'Filter ('c_custkey =

Not able to access Pyspark into Jupyter notebook

2018-04-10 Thread @Nandan@
Hi Users, Currently, I am trying to use Apache Spark 2.2.0 by using a Jupyter notebook but not able to achieve it. I am using Ubuntu 17.10. I can able to use pyspark in command line as well as spark-shell . Please give some ideas. Thanks. Nandan Priyadarshi

Specifying a custom Partitioner on RDD creation in Spark 2

2018-04-10 Thread Colin Williams
Hi, I'm currently creating RDDs using a pattern like follows: val rdd: RDD[String] = session.sparkContext.parallelize(longkeys).flatMap( key => { logInfo(s"job at key: ${key}") Source.fromBytes(S3Util.getBytes(S3Util.getClient(region, S3Util.getCredentialsProvider("INSTANCE", "")),

Re: cache OS memory and spark usage of it

2018-04-10 Thread yncxcw
hi, Raúl First, the most of the OS memory cache is used by Page Cache which OS use for caching the recent read/write I/O. I think the understanding of OS memory cache should be discussed in two different perspectives. From a perspective of

Re: Apache Kafka / Spark Integration - Exception - The server disconnected before a response was received.

2018-04-10 Thread M Singh
Hi Daniel: Yes I am working with Spark Structured Streaming. The exception is emanating from spark kafka connector but I was wondering if someone has encountered this issue and resolved it by some configuration parameter in kafka client/broker or OS settings. Thanks Mans On Tuesday, April

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan
Here is the pretty print of the physical plan which reveals some details about what causes the bug (see the lines highlighted in bold): WithColumnRenamed() fails to update the dependency graph correctly: 'Resolved attribute(s) kk#144L missing from ID#118,LABEL#119,kk#96L,score#121 in operator

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan
The spark warning about Row instead of Dict is not the culprit. The problem still persists after I use Row instead of Dict to generate the dataframe. Here is the expain() output regarding the reassignment of df as Gourav suggests to run, They look the same except that the serial numbers

[Structured Streaming] why events size is 0 when use mapGroupsWithState

2018-04-10 Thread 王晨伊
Hi. In my opinion, the user-given function(updatePageviewDatawithState) of `mapGroupsWithState` is only called when the groups have values in the batch. But in my program, why the group key present with its events size is 0? My code is here:

Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Marcelo Vanzin
This is the problem: > :/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar Seems like some code is confusing things when mixing OSes. It's using the Windows separator when building a command line ti be run on a Linux host. On Tue, Apr

Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
Previous example was bad paste( I tried a lot of variants, so sorry for wrong paste ) PS C:\WINDOWS\system32> spark-submit --master k8s://https://ip:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --executor-memory 1G --conf

package reload in dapply SparkR

2018-04-10 Thread Deepansh Goyal
I have a native R model and doing structured streaming on it. Data comes from Kafka and goes into dapply method where my model does prediction and data is written to sink. Problem:- My model requires caret package. Inside dapply function for every stream job, caret package is loaded again which

cache OS memory and spark usage of it

2018-04-10 Thread José Raúl Pérez Rodríguez
Hi, When I issue a "free -m" command in a host, I see a lot of memory used for cache in OS, however Spark Streaming is not able to request that memory for its usage, and it fail the execution due to not been able to launch executors. What I understand of the OS memory cache (the one in

Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Yinan Li
The example jar path should be local:///opt/spark/examples/*jars* /spark-examples_2.11-2.3.0.jar. On Tue, Apr 10, 2018 at 1:34 AM, Dmitry wrote: > Hello spent a lot of time to find what I did wrong , but not found. > I have a minikube WIndows based cluster ( Hyper V as

Re: Testing spark streaming action

2018-04-10 Thread Jörn Franke
Run it as part of integration testing, you can still use scala test but with a different sub folder (it or integrationtest) instead of test. Within integrationtest you create a local Spark server that has also accumulators. > On 10. Apr 2018, at 17:35, Guillermo Ortiz

Testing spark streaming action

2018-04-10 Thread Guillermo Ortiz
I have a unitTest in SparkStreaming which has an input parameters. -DStream[String] Inside of the code I want to update an LongAccumulator. When I execute the test I get an NullPointerException because the accumulator doesn't exist. Is there any way to test this? My accumulator is updated in

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Thanks guys. @Filipp Zhinkin Yes, we might have couple of string columns which will have 15million+ unique values which need to be mapped to indices. @Nick Pentreath We are on 2.0.2 though I will check it out. Is it better from hashing collision perspective or can handle large volume of data as

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Nick Pentreath
Also check out FeatureHasher in Spark 2.3.0 which is designed to handle this use case in a more natural way than HashingTF (and handles multiple columns at once). On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin wrote: > Hi Shahab, > > do you actually need to have a few

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Filipp Zhinkin
Hi Shahab, do you actually need to have a few columns with such a huge amount of categories whose value depends on original value's frequency? If no, then you may use value's hash code as a category or combine all columns into a single vector using HashingTF. Regards, Filipp. On Tue, Apr 10,

StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Is the StringIndexer keeps all the mapped label to indices in the memory of the driver machine? It seems to be unless I am missing something. What if our data that needs to be

Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
Hello spent a lot of time to find what I did wrong , but not found. I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try to run examples against Spark 2.3. Tried several docker images builds: * several builds that I build myself * andrusha/spark-k8s:2.3.0-hadoop2.7 from

Is DLib available for Spark?

2018-04-10 Thread Aakash Basu
Hi team, Is DLib package available for use through Spark? Thanks, Aakash.

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-10 Thread Junfeng Chen
But I still have one question. I find the task number in stage is 3. So where is this 3 from? How to increase the parallelism? Regard, Junfeng Chen On Tue, Apr 10, 2018 at 11:31 AM, Junfeng Chen wrote: > Yeah, I have increase the executor number and executor cores, and it