How many are there PySpark Windows users?

2016-09-17 Thread Hyukjin Kwon
Hi all, We are currently testing SparkR on Windows[1] and it seems several problems are being identified time to time. Although It seems it is not easy to automate Spark's tests in Scala on Windows because I think we should introduce a proper change detection to run only related tests rather than

Re: Spark metrics when running with YARN?

2016-09-17 Thread Vladimir Tretyakov
Hello Saisai Shao. Thx for reminder, I know which component in which mode Spark has. But Mich Talebzadeh has written above that URL 4040 will work regardless mode user use, that's why I hoped it will be also true for metrics URL (since they are on the same port). I think you are right, better st

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Jörn Franke
In Tableau you can use the in-memory facilities of the Tableau server. As said, Apache Ignite could be one way. You can also use it to make Hive tables in-memory. While reducing IO can make sense, I do not think you will receive in production systems so much difference (at least not 20x). If the

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin, I have few questions on this. Does that only not work with write.json() ? I just wonder if write.text, csv or another API does not work as well and it is a JSON specific issue. Also, does that work with small data? I want to make sure if this happen only on large data. Thanks! 2016

Re: Recovered state for updateStateByKey and incremental streams processing

2016-09-17 Thread manasdebashiskar
If you are using spark 1.6 onwards there is a better solution for you. It is called mapwithState mapwithState takes a state function and an initial RDD. 1) When you start your program for the first time/OR version changes and new code can't use the checkpoint, the initialRDD comes handy. 2) For t

Re: Spark metrics when running with YARN?

2016-09-17 Thread Saisai Shao
H Vladimir, I think you mixed cluster manager and Spark application running on it, the master and workers are two components for Standalone cluster manager, the yarn counterparts are RM and NM. the URL you listed above is only worked for standalone master and workers. It would be more clear if yo

NoSuchField Error : INSTANCE specify user defined httpclient jar

2016-09-17 Thread sagarcasual .
Hello, I am using Spark 1.6.1 distribution over Cloudera CDH 5.7.0 cluster. When I am running my fatJar - spark jar and when it is making a call to HttpClient it is getting classic NoSuchField Error : INSTANCE. Which happens usually when httrpclient in classpath is older than anticipated httpclient

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Thanks Todd As I thought Apache Ignite is a data fabric much like Oracle Coherence cache or HazelCast. The use case is different between an in-memory-database (IMDB) and Data Fabric. The build that I am dealing with has a 'database centric' view of its data (i.e. it accesses its data using Spark

take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Kevin Burton
I'm seeing some weird behavior and wanted some feedback. I have a fairly large, multi-hour job that operates over about 5TB of data. It builds it out into a ranked category index of about 25000 categories sorted by rank, descending. I want to write this to a file but it's not actually writing an

DataFrame defined within conditional IF ELSE statement

2016-09-17 Thread Mich Talebzadeh
In Spark 2 this gives me an error in a conditional IF ELSE statement I recall seeing the same in standard SQL I am doing a test for different sources (text file, ORC or Parquet) to be read in dependent on value of var option I wrote this import org.apache.spark.sql.functions._ import java.util

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Thanks Todd. I will have a look. Regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* U

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Todd Nist
Hi Mich, Have you looked at Apache Ignite? https://apacheignite-fs.readme.io/docs. This looks like something that may be what your looking for: http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin HTH. -Todd On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh wrote: > Hi

Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Hi, I am seeing similar issues when I was working on Oracle with Tableau as the dashboard. Currently I have a batch layer that gets streaming data from source -> Kafka -> Flume -> HDFS It stored on HDFS as text files and a cron process sinks Hive table with the the external table build on the d

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread Mich Talebzadeh
Hi CLOSE_WAIT! According to this link - CLOSE_WAIT - Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed .So this essentially means that his is a state where socket is waitin

RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread anupama . gangadhar
Hi, Yes. I am able to connect to Hive from simple Java program running in the cluster. When using spark-submit I faced the issue. The output of command is given below $> netstat -alnp |grep 10001 (Not all processes could be identified, non-owned process info will not be shown, you would have to

RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread anupama . gangadhar
Hi, @Deepak I have used a separate user keytab(not hadoop services keytab) and able to connect to Hive via simple java program. I am able to connect to Hive from spark-shell as well. However when I submit a spark job using this same keytab, I see the issue. Do cache have a role to play here? In t

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
Ok You have an external table in Hive on S3 with partition and bucket. say .. PARTITIONED BY (year int, month string) CLUSTERED BY (prod_id) INTO 256 BUCKETS STORED AS ORC. with have within each partition buckets on prod_id equally spread to 256 hash partitions/bucket. bucket is the has

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
I want to run job to load existing data from one S3 bucket, process it, then store to another bucket with Partition, and Bucket (data format conversion from tsv to parquet with gzip). So source data and results both are in S3, different are the tools which I used to process data. First I process d

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
It is difficult to guess what is happening with your data. First when you say you use Spark to generate test data are these selected randomly and then stored in Hive/etc table? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
Hi, I use spark to generate data , then we use hive/pig/presto/spark to analyze data, but I found even I add used bucketBy and sortBy with bucket number in Spark, the results files was generate by Spark is always far more than bucket number under each partition, then Presto can not recognize the b

Re: Spark output data to S3 is very slow

2016-09-17 Thread Qiang Li
Tried several times, it is slow same as before, I will let spark output data to HDFS, then sync data to S3 as temporary solution. Thank you. On Sat, Sep 17, 2016 at 10:43 AM, Takeshi Yamamuro wrote: > Hi, > > Have you seen the previous thread? > https://www.mail-archive.com/user@spark.apache.or