Shuffle Write Size

2015-12-24 Thread gsvic
Is there any formula with which I could determine Shuffle Write before execution? For example, in Sort Merge join in the stage in which the first table is being loaded the shuffle write is 429.2 MB. The table is 5.5G in the HDFS with block size 128 MB. Consequently is being loaded in 45

Hash Partitioning & Sort Merge Join

2015-11-18 Thread gsvic
In case of Sort Merge join in which a shuffle (exchange) will be performed, I have the following questions (Please correct me if my understanding is not correct): Let's say that relation A is a JSONRelation (640 MB) on the HDFS where the block size is 64MB. This will produce a Scan

Map Tasks - Disk Spill (?)

2015-11-15 Thread gsvic
According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Are map tasks spilling data to disk?

2015-11-15 Thread gsvic
According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Map Tasks - Disk I/O

2015-11-11 Thread gsvic
According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Task Execution

2015-09-30 Thread gsvic
Concerning task execution, a worker executes its assigned tasks in parallel or sequentially? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Task-Execution-tp14411.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: RDD: Execution and Scheduling

2015-09-20 Thread gsvic
Concerning answers 1 and 2: 1) How Spark determines a node as a "slow node" and how slow is that? 2) How an RDD chooses a location as a preferred location and with which criteria? Could you please also include the links of the source files for the two questions above? -- View this message

RDD: Execution and Scheduling

2015-09-17 Thread gsvic
After reading some parts of Spark source code I would like to make some questions about RDD execution and scheduling. At first, please correct me if I am wrong at the following: 1) The number of partitions equals to the number of tasks will be executed in parallel (e.g. , when an RDD is

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic
? Does it only contain a single line? On Wed, Aug 26, 2015 at 6:47 AM, gsvic [hidden email] http:///user/SendEmail.jtp?type=nodenode=13852i=0 wrote: Hi, I have the following issue. I am trying to load a 2.5G JSON file from a 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame

SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic
Hi, I have the following issue. I am trying to load a 2.5G JSON file from a 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame, using sqlContext.read.json(hdfs://master:9000/path/file.json). The JSON file contains a parsed table(relation) from the TPCH benchmark. After

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic
No, I created the file by appending each JSON record in a loop without changing line. I've just changed that and now it works fine. Thank you very much for your support. -- View this message in context: