Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin, I have few questions on this. Does that only not work with write.json() ? I just wonder if write.text, csv or another API does not work as well and it is a JSON specific issue. Also, does that work with small data? I want to make sure if this happen only on large data. Thanks!

Re: Recovered state for updateStateByKey and incremental streams processing

2016-09-17 Thread manasdebashiskar
If you are using spark 1.6 onwards there is a better solution for you. It is called mapwithState mapwithState takes a state function and an initial RDD. 1) When you start your program for the first time/OR version changes and new code can't use the checkpoint, the initialRDD comes handy. 2) For

Re: Spark metrics when running with YARN?

2016-09-17 Thread Saisai Shao
H Vladimir, I think you mixed cluster manager and Spark application running on it, the master and workers are two components for Standalone cluster manager, the yarn counterparts are RM and NM. the URL you listed above is only worked for standalone master and workers. It would be more clear if

NoSuchField Error : INSTANCE specify user defined httpclient jar

2016-09-17 Thread sagarcasual .
Hello, I am using Spark 1.6.1 distribution over Cloudera CDH 5.7.0 cluster. When I am running my fatJar - spark jar and when it is making a call to HttpClient it is getting classic NoSuchField Error : INSTANCE. Which happens usually when httrpclient in classpath is older than anticipated

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Thanks Todd As I thought Apache Ignite is a data fabric much like Oracle Coherence cache or HazelCast. The use case is different between an in-memory-database (IMDB) and Data Fabric. The build that I am dealing with has a 'database centric' view of its data (i.e. it accesses its data using Spark

take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Kevin Burton
I'm seeing some weird behavior and wanted some feedback. I have a fairly large, multi-hour job that operates over about 5TB of data. It builds it out into a ranked category index of about 25000 categories sorted by rank, descending. I want to write this to a file but it's not actually writing

DataFrame defined within conditional IF ELSE statement

2016-09-17 Thread Mich Talebzadeh
In Spark 2 this gives me an error in a conditional IF ELSE statement I recall seeing the same in standard SQL I am doing a test for different sources (text file, ORC or Parquet) to be read in dependent on value of var option I wrote this import org.apache.spark.sql.functions._ import

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Thanks Todd. I will have a look. Regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:*

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Todd Nist
Hi Mich, Have you looked at Apache Ignite? https://apacheignite-fs.readme.io/docs. This looks like something that may be what your looking for: http://apacheignite.gridgain.org/docs/data-analysis-with-apache-zeppelin HTH. -Todd On Sat, Sep 17, 2016 at 12:53 PM, Mich Talebzadeh

Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-17 Thread Mich Talebzadeh
Hi, I am seeing similar issues when I was working on Oracle with Tableau as the dashboard. Currently I have a batch layer that gets streaming data from source -> Kafka -> Flume -> HDFS It stored on HDFS as text files and a cron process sinks Hive table with the the external table build on the

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread Mich Talebzadeh
Hi CLOSE_WAIT! According to this link - CLOSE_WAIT - Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed .So this essentially means that his is a state where socket is

RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread anupama . gangadhar
Hi, Yes. I am able to connect to Hive from simple Java program running in the cluster. When using spark-submit I faced the issue. The output of command is given below $> netstat -alnp |grep 10001 (Not all processes could be identified, non-owned process info will not be shown, you would have to

RE: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-17 Thread anupama . gangadhar
Hi, @Deepak I have used a separate user keytab(not hadoop services keytab) and able to connect to Hive via simple java program. I am able to connect to Hive from spark-shell as well. However when I submit a spark job using this same keytab, I see the issue. Do cache have a role to play here? In

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
Ok You have an external table in Hive on S3 with partition and bucket. say .. PARTITIONED BY (year int, month string) CLUSTERED BY (prod_id) INTO 256 BUCKETS STORED AS ORC. with have within each partition buckets on prod_id equally spread to 256 hash partitions/bucket. bucket is the

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
I want to run job to load existing data from one S3 bucket, process it, then store to another bucket with Partition, and Bucket (data format conversion from tsv to parquet with gzip). So source data and results both are in S3, different are the tools which I used to process data. First I process

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
It is difficult to guess what is happening with your data. First when you say you use Spark to generate test data are these selected randomly and then stored in Hive/etc table? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
Hi, I use spark to generate data , then we use hive/pig/presto/spark to analyze data, but I found even I add used bucketBy and sortBy with bucket number in Spark, the results files was generate by Spark is always far more than bucket number under each partition, then Presto can not recognize the

Re: Spark output data to S3 is very slow

2016-09-17 Thread Qiang Li
Tried several times, it is slow same as before, I will let spark output data to HDFS, then sync data to S3 as temporary solution. Thank you. On Sat, Sep 17, 2016 at 10:43 AM, Takeshi Yamamuro wrote: > Hi, > > Have you seen the previous thread? >