Why repartitionAndSortWithinPartitions slower than MapReducer

2018-08-20 Thread 周浥尘
Hi team,

I found the Spark method *repartitionAndSortWithinPartitions *spends twice
as much time as using Mapreduce in some cases.
I want to repartition the dataset accorading to split keys and save them to
files in ascending. As the doc says, repartitionAndSortWithinPartitions “is
more efficient than calling `repartition` and then sorting within each
partition because it can push the sorting down into the shuffle machinery.”
I thought it may be faster than MR, but actually, it is much more slower. I
also adjust several configurations of spark, but that doesn't work.(Both
Spark and Mapreduce run on a three-node cluster and share the same number
of partitions.)
Can this situation be explained or is there any approach to improve the
performance of spark?

Thanks & Regards,
Yichen


Re: Two different Hive instances running

2018-08-20 Thread Vaibhav Kulkarni
You can specify the hive-site.xml in spark-submit command using --files
option to make sure that the Spark job is referring to the hive metastore
you are interested in

spark-submit --files /path/to/hive-site.xml



On Sat, Aug 18, 2018 at 1:59 AM Patrick Alwell 
wrote:

> You probably need to take a look at your hive-site.xml and see what the
> location is for the Hive Metastore. As for beeline, you can explicitly use
> an instance of Hive server by passing in the JDBC url to the hiveServer
> when you launch the client; e.g. beeline –u “jdbc://example.com:5432”
>
>
>
> Try taking a look at this
> https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hive-metastore.html
>
>
>
> There should be conf settings you can update to make sure you are using
> the same metastore as the instance of HiveServer.
>
>
>
> Hive Wiki is a great resource as well ☺
>
>
>
> *From: *Fabio Wada 
> *Date: *Friday, August 17, 2018 at 11:22 AM
> *To: *"user@spark.apache.org" 
> *Subject: *Two different Hive instances running
>
>
>
> Hi,
>
>
>
> I am executing a insert into Hive table using SparkSession in Java. When I
> execute select via beeline, I don't see these inserted data. And when I
> insert data using beeline I don't see via my program using SparkSession.
>
>
>
> It's looks like there are different Hive instances running.
>
>
>
> How can I point to same Hive instance? Using SparkSession and beeline.
>
>
>
> Thanks
>
> [image: mage removed by sender.]ᐧ
>


Re: Why repartitionAndSortWithinPartitions slower than MapReducer

2018-08-20 Thread 周浥尘
In addition to my previous email,
Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6

周浥尘  于2018年8月20日周一 下午8:52写道:

> Hi team,
>
> I found the Spark method *repartitionAndSortWithinPartitions *spends
> twice as much time as using Mapreduce in some cases.
> I want to repartition the dataset accorading to split keys and save them
> to files in ascending. As the doc says,
> repartitionAndSortWithinPartitions “is more efficient than calling
> `repartition` and then sorting within each partition because it can push
> the sorting down into the shuffle machinery.” I thought it may be faster
> than MR, but actually, it is much more slower. I also adjust several
> configurations of spark, but that doesn't work.(Both Spark and Mapreduce
> run on a three-node cluster and share the same number of partitions.)
> Can this situation be explained or is there any approach to improve the
> performance of spark?
>
> Thanks & Regards,
> Yichen
>


Re: Why repartitionAndSortWithinPartitions slower than MapReducer

2018-08-20 Thread Koert Kuipers
I assume you are using RDDs? What are you doing after the repartitioning +
sorting, if anything?


On Aug 20, 2018 11:22, "周浥尘"  wrote:

In addition to my previous email,
Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6

周浥尘  于2018年8月20日周一 下午8:52写道:

> Hi team,
>
> I found the Spark method *repartitionAndSortWithinPartitions *spends
> twice as much time as using Mapreduce in some cases.
> I want to repartition the dataset accorading to split keys and save them
> to files in ascending. As the doc says, repartitionAndSortWithinPartitions
> “is more efficient than calling `repartition` and then sorting within each
> partition because it can push the sorting down into the shuffle machinery.”
> I thought it may be faster than MR, but actually, it is much more slower. I
> also adjust several configurations of spark, but that doesn't work.(Both
> Spark and Mapreduce run on a three-node cluster and share the same number
> of partitions.)
> Can this situation be explained or is there any approach to improve the
> performance of spark?
>
> Thanks & Regards,
> Yichen
>


No space left on device

2018-08-20 Thread Steve Lewis
We are trying to run a job that has previously run on Spark 1.3 on a
different cluster. The job was converted to 2.3 spark and this is a
new cluster.

The job dies after completing about a half dozen stages with

java.io.IOException: No space left on device


   It appears that the nodes are using local storage as tmp.


I could use help diagnosing the issue and how to fix it.


Here are the spark conf properties

Spark Conf Properties
spark.driver.extraJavaOptions=-Djava.io.tmpdir=/scratch/home/int/eva/zorzan/sparktmp/
spark.master=spark://10.141.0.34:7077
spark.mesos.executor.memoryOverhead=3128
spark.shuffle.consolidateFiles=true
spark.shuffle.spill=falsespark.app.name=Anonymous
spark.shuffle.manager=sort
spark.storage.memoryFraction=0.3
spark.jars=file:/home/int/eva/zorzan/bin/SparkHydraV2-master/HydraSparkBuilt.jar
spark.ui.killEnabled=true
spark.shuffle.spill.compress=true
spark.shuffle.sort.bypassMergeThreshold=100
com.lordjoe.distributed.marker_property=spark_property_set
spark.executor.memory=12g
spark.mesos.coarse=true
spark.shuffle.memoryFraction=0.4
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=com.lordjoe.distributed.hydra.HydraKryoSerializer
spark.default.parallelism=360
spark.io.compression.codec=lz4
spark.reducer.maxMbInFlight=128
spark.hadoop.validateOutputSpecs=false
spark.submit.deployMode=client
spark.local.dir=/scratch/home/int/eva/zorzan/sparktmp
spark.shuffle.file.buffer.kb=1024



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Unsubscribe

2018-08-20 Thread Happy??????


Spark with Scala : understanding closures or best way to take udf registrations' code out of main and put in utils

2018-08-20 Thread aastha
This is more of a Scala  concept doubt than Spark. I have this Spark
initialization code :

object EntryPoint {
   val spark = SparkFactory.createSparkSession(...
   val funcsSingleton = ContextSingleton[CustomFunctions] { new
CustomFunctions(Some(hashConf)) }
   lazy val funcs = funcsSingleton.get
   //this part I want moved to another place since there are many many
UDFs
   spark.udf.register("funcName", udf {funcName _ })
}
The other class, CustomFunctions looks like this 

class CustomFunctions(val hashConfig: Option[HashConfig], sark:
Option[SparkSession] = None) {
 val funcUdf = udf { funcName _ }
 def funcName(colValue: String) = withDefinedOpt(hashConfig) { c =>
 ...}
}
^ class is wrapped in Serializable interface using ContextSingleton which is
defined like so

class ContextSingleton[T: ClassTag](constructor: => T) extends AnyRef
with Serializable {
   val uuid = UUID.randomUUID.toString
   @transient private lazy val instance =
ContextSingleton.pool.synchronized {
ContextSingleton.pool.getOrElseUpdate(uuid, constructor)
   }
   def get = instance.asInstanceOf[T]
}
object ContextSingleton {
   private val pool = new TrieMap[String, Any]()
   def apply[T: ClassTag](constructor: => T): ContextSingleton[T] = new
ContextSingleton[T](constructor)
   def poolSize: Int = pool.size
   def poolClear(): Unit = pool.clear()
}

Now to my problem, I want to not have to explicitly register the udfs as
done in the EntryPoint app. I create all udfs as needed in my
CustomFunctions class and then register dynamically only the ones that I
read from user provided config. What would be the best way to achieve it?
Also, I want to register the required udfs outside the main app but that
throws me the infamous `TaskNotSerializable` exception. Serializing the big
CustomFunctions is not a good idea, hence wrapped it up in ContextSingleton
but my problem of registering udfs outside cannot be solved that way. Please
suggest the right approach. 
 







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org