Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-06-02 Thread Timur Shenkao
Did you use RDDs or DataFrames? What is the Spark version? On Mon, May 28, 2018 at 10:32 PM, Saulo Sobreiro wrote: > Hi, > I run a few more tests and found that even with a lot more operations on > the scala side, python is outperformed... > > Dataset Stream duration: ~3 minutes (csv formatted

Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Timur Shenkao
Caused by: org.apache.spark.SparkException: Task not serializable That's the answer :) What are you trying to save? Is it empty or None / null? On Wed, Jan 10, 2018 at 4:58 PM, Liana Napalkova < liana.napalk...@eurecat.org> wrote: > Hello, > > > Has anybody faced the following problem in

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Timur Shenkao
Spark Dataset / Dataframe has foreachPartition() as well. Its implementation is much more efficient than RDD's. There is ton of code snippets, say

Re: SANSA 0.3 (Scalable Semantic Analytics Stack) Released

2017-12-18 Thread Timur Shenkao
Hello Thank you for very interesting job! The question are : 1) where do you store final results or intermediate results? Parquet, Janusgraph, Cassandra ? 2) Is there integration with Spark GraphFrames? Sincerely yours, Timur On Mon, Dec 18, 2017 at 9:21 AM, Hajira Jabeen

Re: Job spark blocked and runs indefinitely

2017-10-26 Thread Timur Shenkao
HBase has its own Java API and Scala API: you can use what you like. Btw, which Spark-Hbase connector do you use? Cloudera or Hortonworks? On Wed, Oct 11, 2017 at 3:01 PM, Amine CHERIFI < cherifimohamedam...@gmail.com> wrote: > it seems that the job block whene we call newAPIHadoopRDD to get

Re: strange behavior of spark 2.1.0

2017-04-02 Thread Timur Shenkao
Hello, It's difficult to tell without details. I believe one of the executors dies because of OOM or some Runtime Exception (some unforeseen dirty data row). Less probable is GC stop-the-world pause when incoming message rate increases drastically. On Saturday, April 1, 2017, Jiang Jacky

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread Timur Shenkao
Hello, I'm not sure that's your reason but check this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html On Tue, Feb 14, 2017 at 9:25 PM, anatva wrote: > Hi, > I am reading an ORC

Re: is dataframe thread safe?

2017-02-12 Thread Timur Shenkao
Hello, I suspect that your need isn't parallel execution but parallel data access. In that case, use Alluxio or Ignite. Or, more exotic, one Spark job writes to Kafka and the other ones read from Kafka. Sincerely yours, Timur On Sun, Feb 12, 2017 at 2:30 PM, Mendelson, Assaf

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-11 Thread Timur Shenkao
Hello, 1) Are you sure that your data is "clean"? No unexpected missing values? No strings in unusual encoding? No additional or missing columns ? 2) How long does your job run? What about garbage collector parameters? Have you checked what happens with jconsole / jvisualvm ? Sincerely yours,

Re: [Spark SQL] Task failed while writing rows

2016-12-25 Thread Timur Shenkao
Hi, I've just read your message. Have you resolved the problem ? If not, what is the contents of /etc/hosts ? On Mon, Dec 19, 2016 at 10:09 PM, Michael Stratton < michael.strat...@komodohealth.com> wrote: > I don't think the issue is an empty partition, but it may not hurt to try > a

Re: Spark Streaming with Kafka

2016-12-11 Thread Timur Shenkao
Hi, Usual general questions are: -- what is your Spark version? -- what is your Kafka version? -- do you use "standard" Kafka consumer or try to implement something custom (your own multi-threaded consumer)? The freshest docs

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Timur Shenkao
ache/spark/pull/16080. >> >> Thanks, >> >> Yin >> >> On Tue, Nov 29, 2016 at 11:34 PM, Timur Shenkao <t...@timshenkao.su> >> wrote: >> >>> Hi! >>> >>> Do you have real HIVE installation? >>> Have you built S

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Timur Shenkao
Hi! Do you have real HIVE installation? Have you built Spark 2.1 & Spark 2.0 with HIVE support ( -Phive -Phive-thriftserver ) ? It seems that you use "default" Spark's HIVE 1.2.1. Your metadata is stored in local Derby DB which is visible to concrete Spark installation but not for all. On Wed,

Re: Possible DR solution

2016-11-12 Thread Timur Shenkao
Hi guys! 1) Though it's quite interesting, I believe that this discussion is not about Spark :) 2) If you are interested, there is solution by Cloudera https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_bdr_replication_intro.html (requires that *source cluster* has Cloudera

Re: YARN - Pyspark

2016-09-30 Thread Timur Shenkao
It's not weird behavior. Did you run the job in cluster mode? I suspect your driver died / finished / stopped after 12 hours but your job continued. It's possible as you didn't output anything to console on driver node. Quite long time ago, when I just tried Spark Streaming, I launched PySpark

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-24 Thread Timur Shenkao
Hello, everybody! May be it's not a reason of your problem, but I've noticed the line in your commentaries: *java version "1.8.0_51"* It's strongly advised to use Java 1.8.0_66+ I use even Java 1.8.0_101 On Tue, Sep 20, 2016 at 1:09 AM, janardhan shetty wrote: > Yes

Fwd: Populating tables using hive and spark

2016-08-26 Thread Timur Shenkao
Hello! I just wonder: do you (both of you) use the same user for HIVE & Spark? Or different ? Do you use Kerberized Hadoop? On Mon, Aug 22, 2016 at 2:20 PM, Mich Talebzadeh wrote: > Ok This is my test > > 1) create table in Hive and populate it with two rows > >

Re: NoClassDefFoundError with ZonedDateTime

2016-07-24 Thread Timur Shenkao
Which version of Java 8 do you use? AFAIK, it's recommended to exploit Java 1.8_0.66 + On Fri, Jul 22, 2016 at 8:49 PM, Jacek Laskowski wrote: > On Fri, Jul 22, 2016 at 6:43 AM, Ted Yu wrote: > > You can use this command (assuming log aggregation is turned

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Timur Shenkao
Hi, guys! As far as I remember, Spark does not use all peculiarities and optimizations of ORC. Moreover, the possibility to read ORC files appeared not so long time ago in Spark. So, despite "victorious" results announced in http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ ,

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-22 Thread Timur Shenkao
Hi, Thanks a lot for such interesting comparison. But important questions remain / to be addressed: 1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark

Re: This works to filter transactions older than certain months

2016-03-28 Thread Timur Shenkao
bq. CSV data is stored in an underlying table in Hive (actually created and populated as an ORC table by Spark) How is it possible? On Mon, Mar 28, 2016 at 1:50 AM, Mich Talebzadeh wrote: > Hi, > > A while back I was looking for functional programming to filter out >

Re: spark 1.6.0 connect to hive metastore

2016-03-12 Thread Timur Shenkao
I had similar issue with CDH 5.5.3. Not only with Spark 1.6 but with beeline as well. I resolved it via installation & running hiveserver2 role instance at the same server wher metastore is. On Tue, Feb 9, 2016 at 10:58 PM, Koert Kuipers

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-12 Thread Timur Shenkao
Hi, I have suffered from Hive Streaming , Transactions enough, so I can share my experience with you. 1) It's not a problem of Spark. It happens because of "peculiarities" / bugs of Hive Streaming. Hive Streaming, transactions are very raw technologies. If you look at Hive JIRA, you'll see