from:"gsvic"

Shuffle Write Size

2015-12-24 Thread gsvic

Is there any formula with which I could determine Shuffle Write before execution? For example, in Sort Merge join in the stage in which the first table is being loaded the shuffle write is 429.2 MB. The table is 5.5G in the HDFS with block size 128 MB. Consequently is being loaded in 45

Hash Partitioning & Sort Merge Join

2015-11-18 Thread gsvic

In case of Sort Merge join in which a shuffle (exchange) will be performed, I have the following questions (Please correct me if my understanding is not correct): Let's say that relation A is a JSONRelation (640 MB) on the HDFS where the block size is 64MB. This will produce a Scan

Map Tasks - Disk Spill (?)

2015-11-15 Thread gsvic

According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Are map tasks spilling data to disk?

2015-11-15 Thread gsvic

According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Map Tasks - Disk I/O

2015-11-11 Thread gsvic

According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Task Execution

2015-09-30 Thread gsvic

Concerning task execution, a worker executes its assigned tasks in parallel or sequentially? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Task-Execution-tp14411.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: RDD: Execution and Scheduling

2015-09-20 Thread gsvic

Concerning answers 1 and 2: 1) How Spark determines a node as a "slow node" and how slow is that? 2) How an RDD chooses a location as a preferred location and with which criteria? Could you please also include the links of the source files for the two questions above? -- View this message

RDD: Execution and Scheduling

2015-09-17 Thread gsvic

After reading some parts of Spark source code I would like to make some questions about RDD execution and scheduling. At first, please correct me if I am wrong at the following: 1) The number of partitions equals to the number of tasks will be executed in parallel (e.g. , when an RDD is

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic

? Does it only contain a single line? On Wed, Aug 26, 2015 at 6:47 AM, gsvic [hidden email] http:///user/SendEmail.jtp?type=nodenode=13852i=0 wrote: Hi, I have the following issue. I am trying to load a 2.5G JSON file from a 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame

SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic

Hi, I have the following issue. I am trying to load a 2.5G JSON file from a 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame, using sqlContext.read.json(hdfs://master:9000/path/file.json). The JSON file contains a parsed table(relation) from the TPCH benchmark. After

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread gsvic

No, I created the file by appending each JSON record in a loop without changing line. I've just changed that and now it works fine. Thank you very much for your support. -- View this message in context:

Shuffle Write Size

Hash Partitioning & Sort Merge Join

Map Tasks - Disk Spill (?)

Are map tasks spilling data to disk?

Map Tasks - Disk I/O

Task Execution

Re: RDD: Execution and Scheduling

RDD: Execution and Scheduling

Re: SQLContext.read.json(path) throws java.io.IOException

SQLContext.read.json(path) throws java.io.IOException

Re: SQLContext.read.json(path) throws java.io.IOException

11 matches

Site Navigation

Mail list logo

Footer information