Spark ExternalTable doesn't recognize subdir

2016-10-19 Thread lk_spark
hi,all my issue is everyday I will receive some json datafile , I want to convert them to parquet file and save to hdfs, the floder will like this: /my_table_base_floder /my_table_base_floder/day_2 /my_table_base_floder/day_3 where the parquet files

partitionBy produces wrong number of tasks

2016-10-19 Thread Daniel Haviv
Hi, I have a case where I use partitionBy to write my DF using a calculated column, so it looks somethings like this: val df = spark.sql("select *, from_unixtime(ts, 'MMddH') partition_key from mytable") df.write.partitionBy("partition_key").orc("/partitioned_table") df is 8 partitions in

Re: [Spark][issue]Writing Hive Partitioned table

2016-10-19 Thread ayan guha
Hi Group Sorry to rekindle this thread. Using Spark 1.6.0 on CDH 5.7. Any idea? Best Ayan On Fri, Oct 7, 2016 at 5:08 PM, Mich Talebzadeh wrote: > Hi Ayan, > > Depends on the version of Spark you are using. > > Have you tried updating stats in Hive? > > ANALYZE

Re: Dataframe schema...

2016-10-19 Thread Muthu Jayakumar
Hello Michael, Thank you for looking into this query. In my case there seem to be an issue when I union a parquet file read from disk versus another dataframe that I construct in-memory. The only difference I see is the containsNull = true. In fact, I do not see any errors with union on the

Re: How does Spark determine in-memory partition count when reading Parquet ~files?

2016-10-19 Thread Michael Armbrust
In spark 2.0 we bin-pack small files into a single task to avoid overloading the scheduler. If you want a specific number of partitions you should repartition. If you want to disable this optimization you can set the file open cost very high: spark.sql.files.openCostInBytes On Tue, Oct 18, 2016

Re: Dataframe schema...

2016-10-19 Thread Michael Armbrust
Nullable is just a hint to the optimizer that its impossible for there to be a null value in this column, so that it can avoid generating code for null-checks. When in doubt, we set nullable=true since it is always safer to check. Why in particular are you trying to change the nullability of the

Dataframe schema...

2016-10-19 Thread Muthu Jayakumar
Hello there, I am trying to understand how and when does DataFrame (or Dataset) sets nullable = true vs false on a schema. Here is my observation from a sample code I tried... scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", 2.0d))).toDF("col1", "col2",

?????? Why the json file used by sparkSession.read.json must be a validjson object per line

2016-10-19 Thread Wangjianfei
yeah, the design mainly because hdfs. -- 2015???? 15101549787 -- -- ??: "Jakob Odersky"; : 2016??10??20??(??) 4:46 ??:

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-19 Thread Yang
in my case, my model size is fairly small ( 100k training samples ), though the features count is roughly 100k populated out of 10mil possible features. in this case it does not help me to distribute the training process, since data size is so small. I just need a good core solver to train the

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-19 Thread Jakob Odersky
Another reason I could imagine is that files are often read from HDFS, which by default uses line terminators to separate records. It is possible to implement your own hdfs delimiter finder, however for arbitrary json data, finding that delimiter would require stateful parsing of the file and

Re: Code review / sqlContext Scope

2016-10-19 Thread Ajay Chander
Can someone please shed some lights on this. I wrote the below code in Scala 2.10.5, can someone please tell me if this is the right way of doing it? import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.{DataFrame, Row} import org.apache.spark.sql.functions._ import

ApacheCon is now less than a month away!

2016-10-19 Thread Rich Bowen
Dear Apache Enthusiast, ApacheCon Sevilla is now less than a month out, and we need your help getting the word out. Please tell your colleagues, your friends, and members of related technical communities, about this event. Rates go up November 3rd, so register today! ApacheCon, and Apache Big

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-19 Thread Cody Koeninger
60 seconds for a batch is above the default settings in kafka related to heartbeat timeouts, so that might be related. Have you tried tweaking session.timeout.ms, heartbeat.interval.ms, or related configs? On Wed, Oct 19, 2016 at 12:22 PM, Srikanth wrote: > Bringing this

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-19 Thread Srikanth
Bringing this thread back as I'm seeing this exception on a production kafka cluster. I have two Spark streaming apps reading the same topic. App1 has batch interval 2secs and app2 has 60secs. Both apps are running on the same cluster on similar hardware. I see this exception only in app2 and

Re: How to make Mesos Cluster Dispatcher of Spark 1.6.1 load my config files?

2016-10-19 Thread Michael Gummelt
See https://issues.apache.org/jira/browse/SPARK-13258 for an explanation and workaround. On Wed, Oct 19, 2016 at 1:35 AM, Chanh Le wrote: > Thank you Daniel, > Actually I tried this before but this way is still not flexible way if you > are running multiple jobs at the time

Re: Deep learning libraries for scala

2016-10-19 Thread janardhan shetty
Agreed. But as it states deeper integration with (scala) is yet to be developed. Any thoughts on how to use tensorflow with scala ? Need to write wrappers I think. On Oct 19, 2016 7:56 AM, "Benjamin Kim" wrote: > On that note, here is an article that Databricks made

Re: Deep learning libraries for scala

2016-10-19 Thread Benjamin Kim
On that note, here is an article that Databricks made regarding using Tensorflow in conjunction with Spark. https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Cheers, Ben > On Oct 19, 2016, at 3:09 AM, Gourav Sengupta >

Re: LDA and Maximum Iterations

2016-10-19 Thread Richard Garris
Hi Frank, Two suggestions 1. I would recommend caching the corpus prior to running LDA 2. If you are using EM I would tweak the sample size using the setMiniBatchFraction parameter to decrease the sample per iteration. -Richard On Tue, Sep 20, 2016 at 10:27 AM, Frank Zhang <

Re: K-Mean retrieving Cluster Members

2016-10-19 Thread Robin East
or alternatively this should work (assuming parsedData is an RDD[Vector]): clusters.predict(parsedData) > On 18 Oct 2016, at 00:35, Reth RM wrote: > > I think I got it > > parsedData.foreach( > new VoidFunction() { > @Override

Joins of typed datasets

2016-10-19 Thread daunnc
Hi! I work with a new Spark 2 datasets api. PR: https://github.com/geotrellis/geotrellis/pull/1675 The idea is to use Datasets[(K, V)] and for example to join by Key of type K. The first problems was that there are no Encoders for custom types (not products), so the workaround was to use Kryo:

how to see spark class variable values on variable explorer of spyder for python?

2016-10-19 Thread muhammet pakyürek
is there any way to to see spark class variable values on variable explorer of spyder for python?

Re: spark with kerberos

2016-10-19 Thread Steve Loughran
On 19 Oct 2016, at 00:18, Michael Segel > wrote: (Sorry sent reply via wrong account.. ) Steve, Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-) Usually you will end up having a local Kerberos set up

Re: About Error while reading large JSON file in Spark

2016-10-19 Thread Steve Loughran
On 18 Oct 2016, at 10:58, Chetan Khatri > wrote: Dear Xi shen, Thank you for getting back to question. The approach i am following are as below: I have MSSQL server as Enterprise data lack. 1. run Java jobs and generated JSON files,

Re: Deep learning libraries for scala

2016-10-19 Thread Gourav Sengupta
while using Deep Learning you might want to stay as close to tensorflow as possible. There is very less translation loss, you get to access stable, scalable and tested libraries from the best brains in the industry and as far as Scala goes, it helps a lot to think about using the language as a

Re: How to make Mesos Cluster Dispatcher of Spark 1.6.1 load my config files?

2016-10-19 Thread Chanh Le
Thank you Daniel, Actually I tried this before but this way is still not flexible way if you are running multiple jobs at the time and may different dependencies between each job configuration so I gave up. Another simple solution is set the command bellow as a service and I am using it. >

Re: Substitute Certain Rows a data Frame using SparkR

2016-10-19 Thread Felix Cheung
It's a bit less concise but this works: > a <- as.DataFrame(cars) > head(a) speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 > b <- withColumn(a, "speed", ifelse(a$speed > 15, a$speed, 3)) > head(b) speed dist 1 3 2 2 3 10 3 3 4 4 3 22 5 3 16 6 3 10 I think your example could be something

Re: How to make Mesos Cluster Dispatcher of Spark 1.6.1 load my config files?

2016-10-19 Thread Daniel Carroza
Hi Chanh, I found a workaround that works to me: http://stackoverflow.com/questions/29552799/spark-unable-to-find-jdbc-driver/40114125#40114125 Regards, Daniel El jue., 6 oct. 2016 a las 6:26, Chanh Le () escribió: > Hi everyone, > I have the same config in both mode and I

Re: question about the new Dataset API

2016-10-19 Thread Yang
I even added a fake groupByKey on the entire DataSet: scala> a_ds.groupByKey(k=>1).agg(typed.count[(Long,Long)](_._1)).show +-++ |value|TypedCount(scala.Tuple2)| +-++ |1| 2| +-++ On

question about the new Dataset API

2016-10-19 Thread Yang
scala> val a = sc.parallelize(Array((1,2),(3,4))) a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[243] at parallelize at :38 scala> val a_ds = hc.di.createDataFrame(a).as[(Long,Long)] a_ds: org.apache.spark.sql.Dataset[(Long, Long)] = [_1: int, _2: int] scala>

Equivalent Parquet File Repartitioning Benefits for Join/Shuffle?

2016-10-19 Thread adam kramer
Hello All, I’m trying to improve join efficiency within (self-join) and across data sets loaded from different parquet files primarily due to a multi-stage data ingestion environment. Are there specific benefits to shuffling efficiency (e.g. no network transmission) if the parquet files are