Question Spark streaming - S3 textFileStream- How to get the current file name ?

2016-04-01 Thread Natu Lauchande
Hi, I am using spark streamming and using the input strategy of watching for files in S3 directories. Using the textFileStream method in the streamming context. The filename contains relevant for my pipeline manipulation i wonder if there is a more robust way to get this name other than

partitioned parquet tables

2016-04-01 Thread Imran Akbar
Hi, I'm reading in a CSV file, and I would like to write it back as a permanent table, but with partitioning by year, etc. Currently I do this: from pyspark.sql import HiveContext sqlContext = HiveContext(sc) df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',

Re: Problem with jackson lib running on spark

2016-04-01 Thread Ted Yu
Thanks for sharing the workaround. Probably send a PR on tranquilizer github :-) On Fri, Apr 1, 2016 at 12:50 PM, Marcelo Oikawa wrote: > Hi, list. > > Just to close the thread. Unfortunately, I didnt solve the jackson lib > problem but I did a workaround that

Introducing Spark User Group in Korea & Question on creating non-software goods (stickers)

2016-04-01 Thread Kevin (Sangwoo) Kim
Hi all! I'm Kevin, one of contributors of Spark and I'm organizing Spark User Group in Korea. We're having 2500 members in community, and it's even growing faster today. https://www.facebook.com/groups/sparkkoreauser/ -

spark-shell with different username

2016-04-01 Thread Matt Tenenbaum
Hello all — tl;dr: I’m having an issue running spark-shell from my laptop (or other non-cluster-affiliated machine), and I think the issue boils down to usernames. Can I convince spark/scala that I’m someone other than $USER? A bit of background: our cluster is CDH 5.4.8, installed with Cloudera

Re: [Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Michael Armbrust
What error are you getting? Here is an example . External types are documented here:

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao
So I think ramdisk is simple way to try. Besides I think Reynold's suggestion is quite valid, with such high-end machine, putting everything in memory might not improve the performance a lot as assumed. Since bottleneck will be shifted, like memory bandwidth, NUMA, CPU efficiency

Re: Spark streaming issue

2016-04-01 Thread Mich Talebzadeh
Ok I managed to make this work. All I am interested is receiving messages from topic every minute. No filtering yet jut full text import _root_.kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka.KafkaUtils

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
Yes we see it on final write. Our preference is to eliminate this. On Fri, Apr 1, 2016, 7:25 PM Saisai Shao wrote: > Hi Michael, shuffle data (mapper output) have to be materialized into disk > finally, no matter how large memory you have, it is the design purpose of >

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao
Hi Michael, shuffle data (mapper output) have to be materialized into disk finally, no matter how large memory you have, it is the design purpose of Spark. In you scenario, since you have a big memory, shuffle spill should not happen frequently, most of the disk IO you see might be final shuffle

Re: [SQL] A bug with withColumn?

2016-04-01 Thread Jacek Laskowski
On Thu, Mar 31, 2016 at 5:47 PM, Jacek Laskowski wrote: > It means that it's not only possible to rename a column using > withColumnRenamed, but also replace the content of a column (in one > shot) using withColumn with an existing column name. I can live with > that :) Hi,

Re: Spark Metrics : Why is the Sink class declared private[spark] ?

2016-04-01 Thread Saisai Shao
There's a JIRA (https://issues.apache.org/jira/browse/SPARK-14151) about it, please take a look. Thanks Saisai On Sat, Apr 2, 2016 at 6:48 AM, Walid Lezzar wrote: > Hi, > > I looked into the spark code at how spark report metrics using the > MetricsSystem class. I've seen

Spark Metrics : Why is the Sink class declared private[spark] ?

2016-04-01 Thread Walid Lezzar
Hi, I looked into the spark code at how spark report metrics using the MetricsSystem class. I've seen that the spark MetricsSystem class when instantiated parses the metrics.properties file, tries to find the sinks class name and load them dinamically. It would be great to implement my own

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
Shuffling a 1tb set of keys and values (aka sort by key) results in about 500gb of io to disk if compression is enabled. Is there any way to eliminate shuffling causing io? On Fri, Apr 1, 2016, 6:32 PM Reynold Xin wrote: > Michael - I'm not sure if you actually read my

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
It's spark.local.dir. On Fri, Apr 1, 2016 at 3:37 PM, Yong Zhang wrote: > Is there a configuration in the Spark of location of "shuffle spilling"? I > didn't recall ever see that one. Can you share it out? > > It will be good for a test writing to RAM Disk if that

RE: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Yong Zhang
Is there a configuration in the Spark of location of "shuffle spilling"? I didn't recall ever see that one. Can you share it out? It will be good for a test writing to RAM Disk if that configuration is available. Thanks Yong From: r...@databricks.com Date: Fri, 1 Apr 2016 15:32:23 -0700

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Shishir Anshuman
When I added *"org.apache.spark" % "spark-core_2.10" % "1.6.0", *it should include spark-core_2.10-1.6.1-tests.jar. Why do I need to use the jar file explicitly? And how do I use the jars for compiling with *sbt* and running the tests on spark? On Sat, Apr 2, 2016 at 3:46 AM, Ted Yu

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
Michael - I'm not sure if you actually read my email, but spill has nothing to do with the shuffle files on disk. It was for the partitioning (i.e. sorting) process. If that flag is off, Spark will just run out of memory when data doesn't fit in memory. On Fri, Apr 1, 2016 at 3:28 PM, Michael

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
RAMdisk is a fine interim step but there is a lot of layers eliminated by keeping things in memory unless there is need for spillover. At one time there was support for turning off spilling. That was eliminated. Why? On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan wrote:

Re: Spark streaming issue

2016-04-01 Thread Mich Talebzadeh
I adopted this approach scala> val conf = new SparkConf(). | setAppName("StreamTest"). | setMaster("local[12]"). | set("spark.driver.allowMultipleContexts", "true"). | set("spark.hadoop.validateOutputSpecs", "false") conf:

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
You need to include the following jars: jar tvf ./core/target/spark-core_2.10-1.6.1-tests.jar | grep SparkFunSuite 1787 Thu Mar 03 09:06:14 PST 2016 org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class 1780 Thu Mar 03 09:06:14 PST 2016

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Mridul Muralidharan
I think Reynold's suggestion of using ram disk would be a good way to test if these are the bottlenecks or something else is. For most practical purposes, pointing local dir to ramdisk should effectively give you 'similar' performance as shuffling from memory. Are there concerns with taking that

Re: Spark streaming issue

2016-04-01 Thread Mich Talebzadeh
yes I noticed that scala> val kafkaStream = KafkaUtils.createStream(ssc, "rhes564:2181", "rhes564:9092", "newtopic", 1) :52: error: overloaded method value createStream with alternatives: (jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,zkQuorum: String,groupId: String,topics:

Re: Spark streaming issue

2016-04-01 Thread Cody Koeninger
You're not passing valid Scala values. rhes564:2181 without quotes isn't a valid literal, newtopic isn't a list of strings, etc. On Fri, Apr 1, 2016 at 4:04 PM, Mich Talebzadeh wrote: > Thanks Cody. > > Can I use Receiver-based Approach here? > > I have created the

Spark Text Streaming Does not Recognize Folder using RegEx

2016-04-01 Thread Rachana Srivastava
Hello All, I have written a simple program to get data from JavaDStream textStream = jssc.textFileStream(); JavaDStream ceRDD = textStream.map( new Function() { public String call(String ceData) throws Exception { System.out.println(ceData); } }); } My code works file when we

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
I totally disagree that it’s not a problem. - Network fetch throughput on 40G Ethernet exceeds the throughput of NVME drives. - What Spark is depending on is Linux’s IO cache as an effective buffer pool This is fine for small jobs but not for jobs with datasets in the TB/node range. - On

Re: Spark streaming issue

2016-04-01 Thread Mich Talebzadeh
Thanks Cody. Can I use Receiver-based Approach here? I have created the topic newtopic as below ${KAFKA_HOME}/bin/kafka-topics.sh --create --zookeeper rhes564:2181 --replication-factor 1 --partitions 1 --topic newtopic This is basically what I am doing the Spark val lines =

[Spark SQL]: UDF with Array[Double] as input

2016-04-01 Thread Jerry Lam
Hi spark users and developers, Anyone tried to pass in an Array[Double] as a input to the UDF? I tried it for many hours reading spark sql code but IK still couldn't figure out a way to do this. Best Regards, Jerry

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Reynold Xin
spark.shuffle.spill actually has nothing to do with whether we write shuffle files to disk. Currently it is not possible to not write shuffle files to disk, and typically it is not a problem because the network fetch throughput is lower than what disks can sustain. In most cases, especially with

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Holden Karau
You can also look at spark-testing-base which works in both Scalatest and Junit and see if that works for your use case. On Friday, April 1, 2016, Ted Yu wrote: > Assuming your code is written in Scala, I would suggest using ScalaTest. > > Please take a look at the

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
Assuming your code is written in Scala, I would suggest using ScalaTest. Please take a look at the XXSuite.scala files under mllib/ On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman wrote: > Hello, > > I have a code written in scala using Mllib. I want to perform unit

Re: Support for time column type?

2016-04-01 Thread Michael Armbrust
There is also CalendarIntervalType. Is that what you are looking for? On Fri, Apr 1, 2016 at 1:11 PM, Philip Weaver wrote: > Hi, I don't see any mention of a time type in the documentation (there is > DateType and TimestampType, but not TimeType), and have been unable

Scala: Perform Unit Testing in spark

2016-04-01 Thread Shishir Anshuman
Hello, I have a code written in scala using Mllib. I want to perform unit testing it. I cant decide between Junit 4 and ScalaTest. I am new to Spark. Please guide me how to proceed with the testing. Thank you.

Re: Spark streaming issue

2016-04-01 Thread Cody Koeninger
It looks like you're using a plain socket stream to connect to a zookeeper port, which won't work. Look at spark.apache.org/docs/latest/streaming-kafka-integration.html On Fri, Apr 1, 2016 at 3:03 PM, Mich Talebzadeh wrote: > > Hi, > > I am just testing Spark

Support for time column type?

2016-04-01 Thread Philip Weaver
Hi, I don't see any mention of a time type in the documentation (there is DateType and TimestampType, but not TimeType), and have been unable to find any documentation about whether this will be supported in the future. Does anyone know if this is currently supported or will be supported in the

Spark streaming issue

2016-04-01 Thread Mich Talebzadeh
Hi, I am just testing Spark streaming with Kafka. Basically I am broadcasting topic every minute to Host:port -> rhes564:2181. This is sending few lines through a shell script as follows: cat ${IN_FILE} | ${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list rhes564:9092 --topic newtopic

Re: Problem with jackson lib running on spark

2016-04-01 Thread Marcelo Oikawa
Hi, list. Just to close the thread. Unfortunately, I didnt solve the jackson lib problem but I did a workaround that works fine for me. Perhaps this help another one. The problem raised from this line when I try to create tranquilizer object (used to connect to Druid) using this utility

Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Ted Yu
Please read https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties w.r.t. spark-defaults.conf On Fri, Apr 1, 2016 at 12:06 PM, Max Schmidt wrote: > Yes but doc doesn't say any word for which variable the configs are valid, > so do I have

Re: Spark Metrics Framework?

2016-04-01 Thread Yiannis Gkoufas
Hi Mike, I am forwarding you a mail I sent a while ago regarding some related work I did, hope you find it useful Hi all, I recently sent to the dev mailing list about this contribution, but I thought it might be useful to post it here, since I have seen a lot of people asking about OS-level

In-Memory Only Spark Shuffle

2016-04-01 Thread slavitch
Hello; I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle. Is there a way to force spark to stay exclusively in memory when doing shuffle operations? The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x)

Re: OutOfMemory with wide (289 column) dataframe

2016-04-01 Thread Ted Yu
bq. This was a big help! The email (maybe only addressed to you) didn't come with your latest reply. Do you mind sharing it ? Thanks On Fri, Apr 1, 2016 at 11:37 AM, ludflu wrote: > This was a big help! For the benefit of my fellow travelers running spark > on > EMR: > > I

writing partitioned parquet files

2016-04-01 Thread Imran Akbar
Hi, I'm reading in a CSV file, and I would like to write it back as a permanent table, but with particular partitioning by year, etc. Currently I do this: from pyspark.sql import HiveContext sqlContext = HiveContext(sc) df =

Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Max Schmidt
Yes but doc doesn't say any word for which variable the configs are valid, so do I have to set them for the history-server? The daemon? The workers? And what if I use the java API instead of spark-submit for the jobs? I guess that the spark-defaults.conf are obsolete for the java API? Am

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-04-01 Thread Abhishek Anand
Hi Ted, Any thoughts on this ??? I am getting the same kind of error when I kill a worker on one of the machines. Even after killing the worker using kill -9 command, the executor shows up on the spark UI with negative active tasks. All the tasks on that worker starts to fail with the following

Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
Hello; I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle. Is there a way to force spark to stay in memory when doing shuffle operations? The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x) and

Re: OutOfMemory with wide (289 column) dataframe

2016-04-01 Thread ludflu
This was a big help! For the benefit of my fellow travelers running spark on EMR: I made a json file with the following: [ { "Classification": "yarn-site", "Properties": { "yarn.nodemanager.pmem-check-enabled": "false", "yarn.nodemanager.vmem-check-enabled": "false" } } ] and then I created my

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
Can you try reducing maxBins? That reduces communication (at the cost of coarser discretization of continuous features). On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley wrote: > In my experience, 20K is a lot but often doable; 2K is easy; 200 is > small. Communication

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small. Communication scales linearly in the number of features. On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov wrote: > Joseph, > > Correction, there 20k features. Is it still a lot? > What number

Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Michael Slavitch
Hello; I’m working on spark with very large memory systems (2TB+) and notice that Spark spills to disk in shuffle. Is there a way to force spark to stay in memory when doing shuffle operations? The goal is to keep the shuffle data either in the heap or in off-heap memory (in 1.6.x) and

SocketTimeoutException

2016-04-01 Thread Sergey
Hi! I get SocketTimeoutException when execute piece of code first time. When I re-run it - it works fine. The code just reads csv file and transforms it to dataframe. Any ideas abot the reason? import pyspark_csv as pycsv plaintext_rdd = sc.textFile(r'file:///c:\data\sample.csv') dataframe =

Re: Spark Metrics Framework?

2016-04-01 Thread Mike Sukmanowsky
Thanks Silvio, JIRA submitted https://issues.apache.org/jira/browse/SPARK-14332. On Fri, 25 Mar 2016 at 12:46 Silvio Fiorito wrote: > Hi Mike, > > Sorry got swamped with work and didn’t get a chance to reply. > > I misunderstood what you were trying to do. I

strange behavior of pyspark RDD zip

2016-04-01 Thread Sergey
Hi! I'm on Spark 1.6.1 in local mode on Windows. And have issue with zip of zip'pping of two RDDs of __equal__ size and __equal__ partitions number (I also tried to repartition both RDDs to one partition). I get such exception when I do rdd1.zip(rdd2).count(): File

can spark-csv package accept strings instead of files?

2016-04-01 Thread Benjamin Kim
Does anyone know if this is possible? I have an RDD loaded with rows of CSV data strings. Each string representing the header row and multiple rows of data along with delimiters. I would like to feed each thru a CSV parser to convert the data into a dataframe and, ultimately, UPSERT a

Logistic regression throwing errors

2016-04-01 Thread James Hammerton
Hi, On a particular .csv data set - which I can use in WEKA's logistic regression implementation without any trouble, I'm getting errors like the following: 16/04/01 18:04:18 ERROR LBFGS: Failure! Resetting history: > breeze.optimize.FirstOrderException: Line search failed These errors cause

Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Ted Yu
You can set them in spark-defaults.conf See also https://spark.apache.org/docs/latest/configuration.html#spark-ui On Fri, Apr 1, 2016 at 8:26 AM, Max Schmidt wrote: > Can somebody tell me the interaction between the properties: > > spark.ui.retainedJobs >

Re: Thread-safety of a SparkListener

2016-04-01 Thread Marcelo Vanzin
On Fri, Apr 1, 2016 at 9:23 AM, Truong Duc Kien wrote: > I need to gather some metrics using a SparkListener. Does the callback > methods need to thread-safe or they are always call from the same thread ? The callbacks are all fired on the same thread. Just be careful

Re: Thread-safety of a SparkListener

2016-04-01 Thread Truong Duc Kien
I need to process the events related to Task, Stage and Executor. Regards, Kien Truong On Fri, Apr 1, 2016 at 5:34 PM, Ted Yu wrote: > In general, you should implement thread-safety in your code. > > Which set of events are you interested in ? > > Cheers > > On Fri, Apr 1,

Re: Thread-safety of a SparkListener

2016-04-01 Thread Ted Yu
In general, you should implement thread-safety in your code. Which set of events are you interested in ? Cheers On Fri, Apr 1, 2016 at 9:23 AM, Truong Duc Kien wrote: > Hi, > > I need to gather some metrics using a SparkListener. Does the callback > methods need to

Thread-safety of a SparkListener

2016-04-01 Thread Truong Duc Kien
Hi, I need to gather some metrics using a SparkListener. Does the callback methods need to thread-safe or they are always call from the same thread ? Thanks, Kien Truong

Re: Spark process creating and writing to a Hive ORC table

2016-04-01 Thread Mich Talebzadeh
yes this is feasible. You can use databricks jar file to loas csv files from staging directory. This is pretty standard val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://xx:9000/data/stg/") You can then create an

Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Max Schmidt
Can somebody tell me the interaction between the properties: spark.ui.retainedJobs spark.ui.retainedStages spark.history.retainedApplications I know from the bugtracker, that the last one describes the number of applications the history-server holds in memory. Can I set the properties in the

Re: Execution error during ALS execution in spark

2016-04-01 Thread pankajrawat
Thanks for suggestion, but our application is still crashing *Description: * flatMap at MatrixFactorizationModel.scala:278 *Failure Reason: * Job aborted due to stage failure: Task 1 in stage 6.0 failed 4 times, most recent failure: Lost task 1.3 in stage 6.0 (TID 116, dev.local):

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread ashokkumar rajendran
I agree with Hemant's comment. But it does not give good results for simple usecases like 2 OR conditions. Ultimately we need good results from Spark for end users. shall we consider this as a request to support SQL hints then? Is there any plan to support SQL hint in Spark in upcoming release?

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread Robin East
Yes and even today CBO (e.g. in Oracle) will still require hints in some cases so I think it is more like: RBO -> RBO + Hints -> CBO + Hints. Most relational databases meet significant numbers of corner cases where CBO plans simply don’t do what you would want. I don’t know enough about Spark

[Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both

2016-04-01 Thread Divya Gehlot
Forgot to mention I am using all DataFrame API instead of sqls to the operations -- Forwarded message -- From: Divya Gehlot Date: 1 April 2016 at 18:35 Subject: [Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both To: "user @spark"

[Spark-1.5.2]Spark Memory Issue while Saving to HDFS and Pheonix both

2016-04-01 Thread Divya Gehlot
[image: Mic Drop] Hi, I have Hadoop Hortonworks 3 NODE Cluster on EC2 with *Hadoop *version 2.7.x *Spark *version - 1.5.2 *Phoenix *version - 4.4 *Hbase *version 1.1.x *Cluster Statistics * Date Node 1 OS: redhat7 (x86_64)Cores (CPU): 2 (2)Disk: 20.69GB/99.99GB (20.69% used) Memory: 7.39GB Date

Join FetchFailedException

2016-04-01 Thread nihed mbarek
Hi, I have a big dataframe 100giga that I need to join with 3 others dataframes. For the first join, it's ok For the second, it's ok But for the third, just after the big shuffle, before the execution of the stage, I have an exception org.apache.spark.shuffle.FetchFailedException:

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread Hemant Bhanawat
As Mich has already noticed, Spark defaults to NL join if there are more than one condition. Oracle is probably doing cost-based optimizations in this scenario. You can call it a bug but in my opinion it is an area where Spark is still evolving. >> Hemant has mentioned the nested loop time will

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread ashokkumar rajendran
Hi Mich, Thanks for the input. Yes, it seems to be a bug. Is it possible to fix this in next release? Regards Ashok On Fri, Apr 1, 2016 at 2:06 PM, Mich Talebzadeh wrote: > hm. > > Sounds like it ends up in Nested Loop Join (NLJ) as opposed to Hash Join > (HJ) when

Relation between number of partitions and cores.

2016-04-01 Thread vaibhavrtk
As per Spark programming guide, it says "we should have 2-4 partitions for each CPU in your cluster.". In this case how does 1 CPU core process 2-4 partitions at the same time? Does it do context switching between tasks or run them in parallel? If it does context switching how is it efficient

Spatial Spark Library on 1.6

2016-04-01 Thread Jorge Machado
Hi Guys, does someone knows a good library for Geo spatial operations ? Magellan, Spatial Spark are broken on do not work properly on 1.6 Regards Jorge Machado www.jmachado.me

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread Mich Talebzadeh
hm. Sounds like it ends up in Nested Loop Join (NLJ) as opposed to Hash Join (HJ) when OR is used for more than one predicate comparison. In below I have a table dummy created as ORC with 1 billion rows. Just created another one called dummy1 with 60K rows A simple join results in Hash Join